Optimality of the conditional expectation

This post supplements the supervised learning slides. Please see the slides for the setup.

We wish to show that the conditional expectation \(\Ex\big[Y\mid X=x\big]\) is the minimum mean squared error (MSE) prediction function of \(Y\) from \(X\); i.e.

\[\Ex\big[(Y - f_*(X))^2\big] \le \Ex\big[(Y - f(X))^2\big]\text{ for any (other) function }f.\]

First, we note that the problem of finding the minimum MSE prediction function of \(Y\) from \(X\) is equivalent to the problem of finding the minimum MSE constant prediction of \(Y_x \triangleq Y\mid X=x\); i.e. finding the constant \(\mu_x\in\reals\) such that

\[\Ex\big[(Y_x - \mu_x)^2\big] \le \Ex\big[(Y_x - c)^2\big]\text{ for any (other) constant }c\in\reals.\]

This is because the minimum MSE prediction function \(f_*\) must equal \(\mu_x\) at \(x\); i.e. \(f_*(x) = \mu_x\). Otherwise, it is possible to reduce the MSE of \(f_*\) by replacing its value at \(x\) with \(\mu_x\):

\[f(x') = \begin{cases}\mu_x & \text{if }x' = x, \\ f_*(x') & \text{otherwise}.\end{cases}\]

Second, we show that \(\mu_x = \Ex\big[Y_x\big]\) by solving the optimization problem: \(\min_{c\in\reals}\Ex\big[(Y_x - c)^2\big]\). The cost function seems complicated, but it is actually a quadratic function of \(c\):

\[\Ex\big[(Y_x - c)^2\big] = \Ex\big[Y_x^2\big] - 2c\Ex\big[Y_x\big] + c^2.\]

We differentiate the cost function and find its root to deduce \(\mu_x = \Ex\big[Y_x\big]\). Recalling \(f_*(x) = \mu_x\) from the first part, we conclude

\[f_*(x) = \Ex\big[Y_x\big] = \Ex\big[Y\mid X=x\big].\]

Posted on August 30, 2021 from Ann Arbor, MI