STATS 606

Subgradient descent proofs

This post accompanies the lecture video on subgradient descent. Please see the video for the problem setup and definitions.

Basic inequality for projected subgradient descent

We have

\[\def\cC{\mathcal{C}} \begin{aligned} \|x_{t+1} - x_*\|_2^2 &= \|P_{\cC}(x_t - \eta_t g_t) - P_{\cC}(x_*)\|_2^2 \\ &\le \|x_t - \eta_tg_t - x_*\|_2^2 \\ &= \|x_t - x_*\|_2^2 - 2\eta_t\langle g_t,x_t - x_*\rangle + \eta_t^2\|g_t\|_2^2 \\ &\le \|x_t - x_*\|_2^2 - 2\eta_t(f(x_t) - f(x_*)) + \eta_t^2\|g_t\|_2^2, \end{aligned}\]

where we appealed to non-expansivness of projection in the second step and the first-order characterization of convexity in the fourth step. We note that if \(f\) is strongly convex, then it is possible to strengthen the basic inequality by appealing to the first-order characterization of strong convexity in the fourth step:

\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &\le \|x_t - x_*\|_2^2 - 2\eta_t\langle g_t,x_t - x_*\rangle + \eta_t^2\|g_t\|_2^2 \\ &\le (1-\mu\eta_t)\|x_t - x_*\|_2^2 - 2\eta_t(f(x_t) - f(x_*)) + \eta_t^2\|g_t\|_2^2, \end{aligned}\]

Convergence of subGD with Polyak’s step size

Recall Polyak’s step size minimizes the right side of the basic inequality (WRT \(\eta_t\)). Thus Polyak’s step size ensures

\[\|x_{t+1} - x_*\|_2^2 \le \|x_t - x_*\|_2^2 - \frac{(f(x_t) - f(x_*))^2}{\|g_t\|_2^2}.\]

Rearranging, we have

\[\begin{aligned} (f(x_t) - f(x_*))^2 &\le (\|x_t - x_*\|_2^2 - \|x_{t+1} - x_*\|_2^2)\|g_t\|_2^2 \\ &\le L^2(\|x_t - x_*\|_2^2 - \|x_{t+1} - x_*\|_2^2). \end{aligned}\]

Summing over iterations up to \(T\), we obtain

\[\sum_{t=0}^T(f(x_t) - f(x_*))^2 \le L^2(\|x_0 - x_*\|_2^2 - \|x_{t+1} - x_*\|_2^2).\]

We lower bound the left side with \(T\min_{t\in[T]}(f(x_t) - f(x_*))^2\) and rearrange to obtain the stated result.

Convergence of subGD on convex and Lipschitz problems

Iterating the basic inequality leads to

\[\|x_{T+1} - x_*\|_2^2 \le \|x_0 - x_*\|_2^2 - 2\sum_{t=0}^T\eta_t(f(x_t) - f(x_*)) + \sum_{t=0}^T\eta_t^2\|g_t\|_2^2.\]

Rearranging, we have

\[\begin{aligned} 2\sum_{t=1}^T\eta_t(f(x_t) - f(x_*)) &\le \|x_0 - x_*\|_2^2 - \|x_{T+1} - x_*\|_2^2 + \sum_{t=0}^T\eta_t^2\|g_t\|_2^2 \\ &\le \|x_0 - x_*\|_2^2 + L^2\sum_{t=0}^T\eta_t^2. \end{aligned}\]

Dividing both sides by \(\sum_{t=0}^T\eta_t\) (and rearranging), we obtain

\[\frac{\sum_{t=0}^T\eta_t(f(x_t) - f(x_*))}{\sum_{t=0}^T\eta_t} \le \frac{\|x_0 - x_*\|_2^2 + L^2\sum_{t=0}^T\eta_t^2}{2\sum_{t=0}^T\eta_t}.\]

We recognize the left side is lower bounded by \(f(\bar{x}_T) - f(x_*)\) (because \(f\) is convex) to obtain the stated result.

Convergence of subGD on strongly convex and Lipschitz problems

Rearranging the basic inequality for strongly convex cost functions, we have

\[f(x_t) - f(x_*) \le \frac{1-\mu\eta_t}{2\eta_t}\|x_t - x_*\|_2^2 - \frac{1}{2\eta_t}\|x_{t+1} - x_*\|_2^2 + \frac{\eta_t}{2}\|g_t\|_2^2.\]

We plug in \(\eta_t = \frac{2}{\mu(t+1)}\) to obtain

\[f(x_t) - f(x_*) \le \frac{\mu(t-1)}{4}\|x_t - x_*\|_2^2 - \frac{\mu(t+1)}{4}\|x_{t+1} - x_*\|_2^2 + \frac{1}{\mu(t+1)}\|g_t\|_2^2.\]

Multiplying both sides by \(t\), we have

\[t(f(x_t) - f(x_*)) \le \frac{\mu t(t-1)}{4}\|x_t - x_*\|_2^2 - \frac{\mu t(t+1)}{4}\|x_{t+1} - x_*\|_2^2 + \frac{1}{\mu}\|g_t\|_2^2.\]

We sum over iterations up tp \(T\) to obtain

\[\sum_{t=0}^Tt(f(x_t) - f(x_*)) \le -\frac{\mu t(t+1)}{4}\|x_{t+1} - x_*\|_2^2 + \frac{1}{\mu}\sum_{t=0}^T\|g_t\|_2^2 \le \frac{L^2T}{\mu}.\]

This implies

\[f(\bar{x}_T) - f(x_*) \le \frac{\sum_{t=0}^Tt(f(x_t) - f(x_*))}{\sum_{t=0}^Tt} \le \frac{L^2T}{\mu\sum_{t=0}^Tt} \le \frac{2L^2}{\mu(T+1)}.\]

Convergence of the subGD on convex-concave saddle point problems

The convexity-concavity of \(f\) implies

\[\begin{aligned} f(x_t,y_t) - f(x,y_t) \le \langle g_t,x_t - x\rangle\text{ for any }x\in\cX,\\ f(x_t,y) - f(x_t,y_t) \le \langle h_t,y - y_t\rangle\text{ for any }y\in\cY. \end{aligned}\]

Adding the two inequalities leads to

\[f(x_t,y) - f(x,y_t) \le \langle g_t,x_t-x\rangle - \rangle h_t,y_t-y\rangle\]

for any \(x\in\cX\), \(y\in\cY\). We appeal to the convexity-concavity of \(f\) again to obtain

\[\def\cX{\mathcal{X}} \def\cY{\mathcal{Y}} \begin{aligned} \epsilon(\bar{x}_T,\bar{y}_T) &= \max_{y\in\cY}f(\bar{x}_T,y) - \min_{x\in\cX}f(x,\bar{y}_T) \\ &\le \frac{1}{\sum_{t=0}^T\eta_t}(\max_{y\in\cY}\sum_{t=0}^T\eta_tf(x_t,y) - \min_{x\in\cX}\sum_{t=0}^T\eta_tf(x,y_t)) \\ &\le \frac{1}{\sum_{t=0}^T\eta_t}(\max_{x\in\cX,y\in\cY}\sum_{t=0}^T\eta_t(\langle g_t,x_t - x\rangle - \langle h_t,y_t-y\rangle) \end{aligned}\]

To complete the proof, it is enough to show

\[\max_{x\in\cX,y\in\cY}\sum_{t=0}^T\eta_t(\langle g_t,x_t - x\rangle - \langle h_t,y_t-y\rangle \le \frac{D_{\cX}^2 + D_{\cY}^2 + L^2\sum_{t=0}^T\eta_t^2}{2}.\]

For any \(x\in\cX\), we have

\[\begin{aligned} \|x_{t+1} - x\|_2^2 &= \|P_{\cX}(x_t - \eta_Tg_t) - P_{\cX}(x)\|_2^2 \\ &\le \|x_t - \eta_tg_t - x\|_2^2 \\ &= \|x_t - x\|_2^2 - 2\eta_T\langle g_t,x_t,x\rangle + \eta_t^2\|g_t\|_2^2, \end{aligned}\]

where we appealed to non-expansiveness of projection in the second step. Rearranging, we obtain

\[2\eta_T\langle g_t,x_t,x\rangle \le \|x_t - x\|_2^2 - \|x_{t+1} - x\|_2^2 + \eta_t^2\|g_t\|_2^2.\]

for any \(x\in\cX\). Similarly, it is possible to show

\[-2\eta_T\langle h_t,y_t,y\rangle \le \|y_t - y\|_2^2 - \|y_{t+1} - y\|_2^2 + \eta_t^2\|h_t\|_2^2\]

for any \(y\in\cY\). We combine the two inequalities to obtain

\[\begin{aligned} &2\eta_t(\langle g_t,x_t,x\rangle - \langle h_t,y_t,y\rangle) \\ &\quad\le \|x_t - x\|_2^2 + \|y_t - y\|_2^2 - \|x_{t+1} - x\|_2^2 - \|y_{t+1} - y\|_2^2 + \eta_t^2\|g_t\|_2^2 + \eta_t^2\|h_t\|_2^2 \\ &\quad\le \|x_t - x\|_2^2 + \|y_t - y\|_2^2 - \|x_{t+1} - x\|_2^2 - \|y_{t+1} - y\|_2^2 + \eta_t^2 L^2, \end{aligned}\]

where we appealed to the Lipschitz continuity of \(f\) in the second step. Summing over iterations up till \(T\), we arrive at the desired bound:

\[\begin{aligned} &2\sum_{t=1}^T\eta_t(\langle g_t,x_t,x\rangle - \langle h_t,y_t,y\rangle) \\ &\quad\le \|x_0 - x\|_2^2 + \|y_0 - y\|_2^2 - \|x_{T+1} - x\|_2^2 - \|y_{T+1} - y\|_2^2 + L^2\sum_{t=0}^T\eta_t^2 \\ &\quad\le D_{\cX}^2 + D_{\cY}^2 + L^2\sum_{t=0}^T\eta_t^2. \end{aligned}\]

Posted on February 02, 2021 from San Francisco, CA.