SVM dual problem derivation

This post follows the margin maximization problem derivation. Please see the preceding post for the setup.

The dual of the margin maximization problem

Recall the margin maximization problem is

\[\begin{aligned} &\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\ &\subjectto && \{1 - Y_i(\beta_0 + \beta^\top X_i) \le 0:\alpha_i\}_{i=1}^n \end{aligned},\]

where we assigned Lagrange multiplier \(\alpha_i\) to the \(i\)-th margin constraint (in preparation for the derivation of the SVM dual problem). Since the margin constraints are inequality (versus equality) constraints, the \(\alpha_i\)’s must be non-negative. The Lagrangian of the margin maximization problem is

\[\textstyle L(\beta_0,\beta,\alpha) \triangleq \frac12\|\beta\|_2^2 + \sum_{i=1}^n\alpha_i(1 - Y_i(\beta_0 + \beta^\top X_i)).\]

The Lagrangian encodes the margin maximization problem because it is possible to recover (an equivalent form of) the margin maximization problem from the Lagrangian. As we shall see, the margin maximization problem is (in terms of the Lagrangian)

\[\textstyle \min_{\beta_0\in\reals,\beta\in\reals^p}\max_{\alpha\in\reals_+^n}L(\beta_0,\beta,\alpha).\]

To see this, fix \(\beta_0\) and \(\beta\), and consider the inner maximization with respect to \(\alpha\):

\[\max\nolimits_{\alpha\in\reals_+^n}L(\beta_0,\beta,\alpha).\]

If the \(i\)-th margin constraint is violated at \((\beta_0,\beta)\) (i.e. \(1 - Y_i(\beta_0 + \beta^\top X_i) > 0\)), then the maximum of the Lagrangian (WRT \(\alpha\)) is unbounded above because the maximization WRT \(\alpha_i\) is unbounded above:

\[\textstyle \max_{\alpha\in\reals_+^n}L(\beta_0,\beta,\alpha) = \frac12\|\beta\|_2^2 + \sum_{i=1}^n\max_{\alpha_i \ge 0}\alpha_i(\underbrace{1 - Y_i(\beta_0 + \beta^\top X_i)}_{> 0}) \to\infty\]

On the other hand, if all the margin constraints are satisfied at \((\beta_0,\beta)\) (i.e. the hyperplane \(\{x\in\reals^p:\beta_0 + \beta^\top x = 0\}\) separates the training data), then the maximum of the Lagrangian (WRT \(\alpha\)) is the cost function because all the :

\[\textstyle \max_{\alpha\in\reals_+^n}L(\beta_0,\beta,\alpha) = \frac12\|\beta\|_2^2 + \sum_{i=1}^n\max_{\alpha_i \ge 0}\alpha_i(\underbrace{1 - Y_i(\beta_0 + \beta^\top X_i)}_{\le 0}) = \frac12\|\beta\|_2^2.\]

Thus minimizing (WRT \(\beta_0\) and \(\beta\)) the maximum (WRT \(\alpha\)) of the Lagrangian is equivalent to the margin maximization problem

\[\begin{aligned} \min\nolimits_{\beta_0\in\reals,\beta\in\reals^p}\max\nolimits_{\alpha\in\reals_+^n}L(\beta_0,\beta,\alpha) &= \min\nolimits_{\beta_0\in\reals,\beta\in\reals^p}\begin{cases}\frac12\|\beta\|_2^2 & (\beta,\beta_0)\text{ is feasible,} \\ \infty & \text{otherwise.}\end{cases} \\ &= \left\{\begin{aligned}&\min\nolimits_{\beta_0\in\reals,\beta\in\reals^p} &&\textstyle\frac12\|\beta\|_2^2 \\&\subjectto &&(\beta,\beta_0)\text{ is feasible}\end{aligned}\right\}. \end{aligned}\]

To obtain the SVM dual problem, we exchange the min and the max:

\[\textstyle \max_{\alpha\in\reals_+^n}\min_{\beta_0\in\reals,\beta\in\reals^p}L(\beta_0,\beta,\alpha).\]

In the rest of this post, we simplify the dual problem by (fixing \(\alpha\) and) minimizing the Lagrangian WRT \(\beta_0\) and \(\beta\). First, note that the Lagrangian depends linearly on \(\beta_0\):

\[\textstyle L(\beta_0,\beta,\alpha) \triangleq \frac12\|\beta\|_2^2 + (\sum_{i=1}^n\alpha_iY_i)\beta_0 - \alpha^\top Y\beta_0 + \sum_{i=1}^n\alpha_i(1 - \beta^\top X_iY_i).\]

Thus the minimum of the Lagrangian WRT \(\beta_0\) is unbounded below unless the coefficient of \(\beta_0\) in the Lagrangian is zero; i.e.

\[\sum_{i=1}^n\alpha_iY_i = \alpha^\top Y = 0.\]

This is a basically a constraint on the \(\alpha_i\)’s: since we wish to maximize \(\min_{\beta_0\in\reals,\beta\in\reals^p}L(\beta_0,\beta,\alpha)\) WRT \(\alpha\), we can ignore \(\alpha\) values such that \(L(\beta_0,\beta,\alpha)\) (as a function \(\beta_0\) and \(\beta\)) is unbounded below. We state this constraint explicitly to obtain

\[\begin{aligned} &\max\nolimits_{\alpha\in\reals_+^n} &&\textstyle\min_{\beta\in\reals^p} \frac12\|\beta\|_2^2 - \cancel{\alpha^\top Y\beta_0} + \sum_{i=1}^n\alpha_i(1 - \beta^\top X_iY_i) \\ &\subjectto &&\alpha^\top Y = 0. \end{aligned}\]

Second, we minimize WRT \(\beta\). The optimal \(\beta\) satisfies the zero-gradient condition WRT:

\[0 = \frac{\partial}{\partial\beta}L(\beta_0,\beta_*(\alpha),\alpha) = \beta_*(\alpha) - \sum_{i=1}^n\alpha_iY_iX_i,\]

where we denote the optimal \(\beta\) as \(\beta_*(\alpha)\) to emphasize its dependence on \(\alpha\). We rearrange to obtain \(\beta_*(\alpha) = \sum_{i=1}^n\alpha_iY_iX_i\). Plugging the optimal \(\beta\) into the objective of the SVM dual problem, we obtain

\[\begin{aligned} &\max\nolimits_{\alpha\in\reals_+^n} &&\textstyle\frac12\|\sum_{i=1}^n\alpha_iY_iX_i\|_2^2 + \sum_{i=1}^n\alpha_i(1 - (\sum_{j=1}^n\alpha_jY_jX_j)^\top X_iY_i) \\ &\subjectto &&\alpha^\top Y = 0. \end{aligned}\]

We simplify the objective to obtain the standard form of the SVM dual problem:

\[\begin{aligned} &\max\nolimits_{\alpha\in\reals_+^n} &&\textstyle\sum_{i=1}^n\alpha_i - \frac12\sum_{i,j=1}^n\alpha_i\alpha_jY_iY_jX_i^\top X_j \\ &\subjectto &&\alpha^\top Y = 0. \end{aligned}\]

This is a quadratic optimization problem that can be solved with off-the-shelf optimization software. That said, there are algorithms specially designed to solve the SVM dual problem. The most widely used algorithm is the sequential minimal optimization (SMO) algorithm.

Recovering the max margin (linear) from the dual maximizer

Solving the dual problem is only useful if we can obtain the optimal \(\beta_0\) and \(\beta\) of margin maximization problem quickly from optimal \(\alpha\) of the dual problem. This is possible for the margin maximization problem because

\[\textstyle \min_{\beta_0\in\reals,\beta\in\reals^p}\max_{\alpha\in\reals_+^n}L(\beta_0,\beta,\alpha) = \max_{\alpha\in\reals_+^n}\min_{\beta_0\in\reals,\beta\in\reals^p}L(\beta_0,\beta,\alpha);\]

i.e. the optimal values of the margin maximization problem and its dual problem coincide. The equality of the optimal values of an optimization problem and its dual problem is called strong duality. In terms of the optimal \(\alpha\) of the dual problem, the optimal \(\beta_0\) and \(\beta\) of the margin maximization problem are

\[\begin{aligned}\textstyle \beta_{0,*} = -\frac12(\max\{\beta_*^\top X_i\mid Y_i=-1\} + \min\{\beta_*^\top X_i\mid Y_i=1\}), \\ \beta_* = \sum_{i=1}^n\alpha_i^*Y_iX_i. \end{aligned}\]

Posted on October 26, 2022 from Ann Arbor, MI