Bias-variance decomposition

This post supplements the supervised learning slides. Please see the slides for the setup.

We wish to derive the bias-variance decomposition on p.21 (of the slides):

\[\def\hf{\widehat{f}} \def\bias{\mathrm{bias}} \begin{aligned} \Ex\big[(Y - \hf(X))^2\mid X=x\big] = \bias\big[\hf(x)\big]^2 + \var\big[\hf(x)\big] + \var\big[\eps\mid X=x\big], \\ \bias\big[\hf(x)\big] \triangleq \big(f(x) - \Ex\big[\hf(x)\big]\big), \\ \var\big[\hf(x)\big] = \Ex\big[\big(\hf(x) - \Ex\big[\hf(x)\big]\big)^2\big]. \end{aligned}\]

All the expectations (unless otherwise stated) are with respect to \((X,Y)\) and \(\hf\). Note that the irreducible error \(\var\big[\eps\mid X=x\big]\) depends on \(x\). This is the more general form of the irreducible error for heteroscedastic problems in which the (conditional) variance of \(\eps\) depends on \(x\).

First, we decompose the MSE of a fixed \(\hf\) into reducible and irreducible parts (see p.5):

\[\begin{aligned} &\Ex\big[(Y - \hf(X))^2\mid X=x\big] \\ &= \Ex\big[(f(X) + \eps - \hf(X))^2\mid X=x\big] \\ &= \Ex\big[(f(X) - \hf(X))^2\mid X=x\big] + \Ex\big[\eps^2\mid X=x\big] + 2\Ex\big[(f(X) - \hf(X))\eps\mid X=x\big] \\ &= \underbrace{(f(x) - \hf(x))^2}_\text{reducible error} + \Ex\big[\eps^2\mid X=x\big] + 2(f(x) - \hf(x))\Ex\big[\eps\mid X=x\big], \end{aligned}\]

where \(f(x) = \Ex\big[Y\mid X=x\big]\) is the regression function. It is not hard to check that the conditional mean of \(\eps\) is zero:

\[\Ex\big[\eps\mid X=x\big] = \Ex\big[Y - f(X)\mid X=x\big] = \Ex\big[Y\mid X=x\big] - f(x) = 0.\]

Thus the second term in the decomposition of \(\Ex\big[(Y - \hf(X))^2\mid X=x\big]\) is the irreducible error, and the third term is zero. Note that this decomposition remains valid for a (random) \(\hf\) fit to training data because \((X,Y)\) is a test sample that is independent of the training data. In other words, we can average/integrate the decomposition with respect to the training data to obtain

\[\Ex\big[(Y - \hf(X))^2\mid X=x\big] = \Ex\big[(f(x) - \hf(x))^2\big] + \var\big[\eps\mid X=x\big].\]

Second, we decompose the reducible part of the MSE into (squared) bias and variance:

\[\begin{aligned} \Ex\big[(f(x) - \hf(x))^2\big] &= \Ex\big[(f(x) - \Ex\big[\hf(x)\big] + \Ex\big[\hf(x)\big] - \hf(x))^2\big] \\ &= \big(\underbrace{f(x) - \Ex\big[\hf(x)\big]}_{\bias\big[\hf(x)\big]}\big)^2 + \underbrace{\Ex\big[(\Ex\big[\hf(x)\big] - \hf(x))^2\big]}_{\var\big[\hf(x)\big]} \\ &\quad+2\Ex\big[\big(f(x) - \Ex\big[\hf(x)\big]\big)\big(\Ex\big[\hf(x)\big] - \hf(x)\big)\big]. \end{aligned}\]

The third term is zero because

\[\begin{aligned} \Ex\big[\big(f(x) - \Ex\big[\hf(x)\big]\big)\big(\Ex\big[\hf(x)\big] - \hf(x)\big)\big] &= \big(f(x) - \Ex\big[\hf(x)\big]\big)\Ex\big[\Ex\big[\hf(x)\big] - \hf(x)\big] \\ &= \big(f(x) - \Ex\big[\hf(x)\big]\big)\big(\underbrace{\Ex\big[\hf(x)\big] - \Ex\big[\hf(x)\big]}_{0}\big). \end{aligned}\]

Posted on September 01, 2021 from Ann Arbor, MI