Gradient boosting

This post supplements the tree-based methods slides. Please see the slides for the setup.

Gradient boosting seeks to assemble of ensemble of prediction rules \(\widehat{f} = \widehat{g}_1 + \dots + \widehat{g}_T\) that minimize the training error; i.e.

\[\textstyle \widehat{f}\in\argmin_f\frac1n\sum_{i=1}^n\ell(f(X_i),Y_i),\]

where \(\ell\) is a loss function that measures the discrepancy between the predicted output \(f(X_i)\) and actual output \(Y_i\). To derive the gradient boosting update rule, we observe that the training error only depends on the prediction rule \(f\) thru its outputs at the training inputs \(X_1,\dots,X_n\). This allows us to simplify the functional optimization problem (in which the optimization variable is a function) to a more tractable problem in which the optimization variable is the vector of predicted outputs at the training inputs.

In terms of the vector of predicted outputs \(F\in\reals^n\), the optimization problem is

\[\textstyle \min_{F\in\reals^n}L(F,Y) \triangleq\frac1n\sum_{i=1}^n\ell(F_i,Y_i),\]

where \(L\) is the vectorized loss. The gradient descent update rule for minimizing \(L\) (WRT \(F\)) is

\[F_{t+1} \gets F_t - \eta\partial_F L(F_t,Y),\]

where \(\eta > 0\) is a step size parameter. Intuitively, gradient descent

evaluates the gradient of the cost function at the current iterate and
generates the next iterate by perturbing the current iterate in the negative gradient direction.

Unraveling the recursion, we have

\[F_t = F_0 - \eta(\partial_F L(F_0,Y) + \dots + \partial_F L(F_{t-1},Y)).\]

In practice, we often initialize gradient descent at \(F_0 = 0\), so \(F_t\) is a linear combination of the \(\partial_F L(F_t,Y)\)’s.

Gradient boosting assembles an ensemble of prediction rules that mimics the gradient descent update rule on the vector of predicted outputs by

approximating \(\partial_F L(f_t(X),Y)\) (\(f_t(X)\in\reals^n\) is the vector of outputs of \(\widehat{f}\) on the training inputs) with a prediction rule \(\widehat{g}_t\) and
adding \(\widehat{g}_t\) to the ensemble: \(\widehat{f}_t \gets \widehat{f}_{t-1} - \eta\widehat{g}_{t-1}\)

This mimics the gradient descent update rule for minimizing \(L\) because the outputs of \(f_t\) on the training inputs are

\[\widehat{f}_t(X) \gets \widehat{f}_{t-1}(X) - \eta\widehat{g}_{t-1}(X) \approx \widehat{f}_{t-1}(X) - \eta\partial_F L(\widehat{f}(X),Y).\]

To wrap up, we check that the boosting algorithm in the tree-based methods slides is an instance of gradient boosting. Indeed, if the loss is the square loss, then the vectorized loss is

\[\textstyle L(F,Y) = \sum_{i=1}^n\ell(F_i,Y_i) = \sum_{i=1}^n\frac12(F_i - Y_i)^2 = \frac12\|F - Y\|_2^2,\]

and its gradient WRT the predictions at the training inputs are the residuals of \(f_t\):

\[\partial_F L(f_t(X),Y) = f_t(X) - Y.\]

Thus gradient boosting with the square loss leads to an ensemble of prediction rules that are trained to correct the errors of previous ensembles.

Posted on November 02, 2022 from Ann Arbor, MI