Bayes Rule

Bayes' Rule and conditional independence

P(Cavity | toothache Ù catch)

= α · P(toothache Ù catch | Cavity) P(Cavity)

= α · P(toothache | Cavity) P(catch | Cavity) P(Cavity)

This is an example of a naïve Bayes model:

P(Cause,Effect₁, … ,Effect_n) = P(Cause) π_iP(Effect_i|Cause)

Total number of parameters is linear in n

Naïve Bayes Classifier

Calculate most probable function value

V_map = argmax P(v_j| a₁,a₂, … , a_n)

        = argmax P(a₁,a₂, … , a_n| v_j) P(v_j)

                               P(a₁,a₂, … , a_n)

        = argmax P(a₁,a₂, … , a_n| v_j) P(v_j)

Naïve assumption: P(a₁,a₂, … , a_n) = P(a₁)P(a₂) … P(a_n)

Naïve Bayes Algorithm

NaïveBayesLearn(examples)
For each target value v_j
   P’(v_j) ← estimate P(v_j)
   For each attribute value a_i of each attribute a
      P’(a_i|v_j) ← estimate P(a_i|v_j)

ClassfyingNewInstance(x)
v_nb= argmax P’(v_j) Π P’(a_i|v_j)

An Example

(due to MIT’s open coursework slides)


Product rule P(aÙb) = P(a \| b) P(b) = P(b \| a) P(a)
Þ Bayes' rule: P(a \| b) = P(b \| a) P(a) / P(b)

or in distribution form
P(Y\|X) = P(X\|Y) P(Y) / P(X) = αP(X\|Y) P(Y)

Useful for assessing diagnostic probability from causal probability:
	P(Cause\|Effect) = P(Effect\|Cause) P(Cause) / P(Effect)

	E.g., let M be meningitis, S be stiff neck:
		P(m\|s) = P(s\|m) P(m) / P(s) = 0.5 × 0.0002 / 0.05 = 0.0002
	Note: posterior probability of meningitis still very small!


	P(Cavity \| toothache Ù catch)
		= α · P(toothache Ù catch \| Cavity) P(Cavity)
		= α · P(toothache \| Cavity) P(catch \| Cavity) P(Cavity)

	This is an example of a naïve Bayes model:
		P(Cause,Effect₁, … ,Effect_n) = P(Cause) π_iP(Effect_i\|Cause)




	Total number of parameters is linear in n


	Calculate most probable function value
		V_map = argmax P(v_j\| a₁,a₂, … , a_n)
		= argmax P(a₁,a₂, … , a_n\| v_j) P(v_j)
		P(a₁,a₂, … , a_n)
		= argmax P(a₁,a₂, … , a_n\| v_j) P(v_j)

		Naïve assumption: P(a₁,a₂, … , a_n) = P(a₁)P(a₂) … P(a_n)



	NaïveBayesLearn(examples) For each target value v_j P’(v_j) ← estimate P(v_j) For each attribute value a_i of each attribute a P’(a_i\|v_j) ← estimate P(a_i\|v_j)

	ClassfyingNewInstance(x) v_nb= argmax P’(v_j) Π P’(a_i\|v_j)