Bayes’ rule simply

Bayes’ rule is usually written \[\begin{aligned} P(\theta|x) & =P(x|\theta)\frac{P(\theta)}{P(x)}\end{aligned}\]

In practice we’re trying to learn about some model parameter \(\theta\) given some observation \(x\). The model \(P(x|\theta)\) tells us how observations are influenced by the model parameter. This seems simple enough, but a small change in notation reveals how simple Bayes’ rule is. Let us call \(P(\theta)\) the prior on \(\theta\) and \(P'(\theta)\) the posterior on theta. Then Bayes’ rule says:

\[\begin{aligned} P'(\theta) & \propto P(x|\theta)P(\theta)\end{aligned}\] We got rid of the denominator \(P(x)\) because it’s just a normalisation to make the total probability sum to 1, and say that \(P'(\theta)\) is proportional to \(P(x|\theta)P(\theta)\). The value \(P(x|\theta)P(\theta)=P(x,\theta)\) is the joint probability of seeing a given pair \((x,\theta)\), so we can also write Bayes’ rule as:

\[\begin{aligned} P'(\theta)\propto & P(x,\theta)\end{aligned}\] So up to normalisation, the posterior is just substituting the actual observation \(X=x\) into the joint distribution. How can we interpret this? Imagine that we have a robot whose current state of belief is given by \(P(x,\theta)\) and that \(x,\theta\) only have a finite number of possible values, so that the robot has stored a finite number of probabilities \(P(x,\theta)\), one for each pair \((x,\theta)\). Suppose that the robot now learns \(X=x\) by observation. What does it do to compute its posterior belief? It first sets \(P(y,\theta)=0\) for all \(y\neq x\) because the actual observed value is \(x\). Then it renormalises the probabilities to make \(P(x,\theta)\) sum to 1 again. That’s all Bayes’ rule is: simply delete the possibilities that are incompatible with the observation, and renormalise the remainder.