STATS 606

Accelerated gradient methods proofs

This post accompanies the lecture video on accelerated gradient methods. Please see the video for the problem setup and definitions.

Convergence of the heavy ball method

Recall the iterates of the heavy ball methods satisfy

\[\begin{bmatrix} x_{t+1} \\ x_t \end{bmatrix} = \begin{bmatrix} (1+\theta)I_d & -\theta I_d \\ I_d & 0_{d\times d} \end{bmatrix}\begin{bmatrix} x_t \\ x_{t-1} \end{bmatrix} - \begin{bmatrix} \eta\partial f(x_t) \\ 0_d \end{bmatrix}\]

It is not hard to check that the iterate errors satisfy

\[\begin{aligned} \begin{bmatrix} x_{t+1} - x_* \\ x_t - x_* \end{bmatrix} &= \begin{bmatrix} (1+\theta)I_d & -\theta I_d \\ I_d & 0_{d\times d} \end{bmatrix}\begin{bmatrix} x_t \\ x_{t-1} \end{bmatrix} - \begin{bmatrix} \eta\partial f(x_t) \\ 0_d \end{bmatrix} \\ &= \underbrace{\begin{bmatrix} (1+\theta)I_d - \eta Q & -\theta I_d \\ I_d & 0_{d\times d} \end{bmatrix}}_{H} \begin{bmatrix} x_t - x_* \\ x_{t-1} - x_* \end{bmatrix} \end{aligned}\]

We iterate the preceding expression and take norms to obtain

\[\left\|\begin{bmatrix} x_{T+1} - x_* \\ x_T - x_* \end{bmatrix}\right\|_2 = \|H^T\|_2\left\|\begin{bmatrix} x_1 - x_* \\ x_0 - x_* \end{bmatrix} \right\|_2\]

By Gelfand’s formula, \(\|H^T\|_2 \approx \rho(H)^T\), where \(\rho(H) \triangleq \max_j\{\vert\lambda_j(H)\vert\}\) is the spectral radius of \(H\). In the rest of this proof, we show that \(\rho(H) \le \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\).

We start by observing \(H\) is similar to a block diagonal matrix with \(2\times 2\) blocks, so their spectral radii coincide:

\[\begin{aligned} \rho(H) &= \rho\left(\begin{bmatrix}(1+\theta)I_d - \eta\Lambda & -\theta I_d \\ I_d & 0_{d\times d}\end{bmatrix}\right) \\ &= \max\nolimits_{j\in[d]}\rho\left(\begin{bmatrix}1 + \theta - \eta\lambda_j & -\theta \\ 1 & 0\end{bmatrix}\right), \end{aligned}\]

where \(Q = V\Lambda V^T\) is the eigenvalue decomposition of \(Q\) and the \(\lambda_j\)’s are the eigenvalues of \(Q\). The eigenvalues of the \(2\times 2\) blocks are the roots of

\[z^2 - (1+\theta-\eta\lambda_j)z + \theta = 0.\]

If \((1+\theta-\eta\lambda_j)^2 \le 4\theta\), then both roots have magnitude \(\sqrt{\theta}\). We set \(\theta = \max\{(1-\sqrt{\eta\lambda_{\max}(Q)}^2,(1-\sqrt{\eta\lambda_{\min}(Q)}^2\}\) to ensure

\[\textstyle\theta\in[(1-\sqrt{\eta\lambda_j})^2,(1+\sqrt{\eta\lambda_j})^2],\]

which in turn ensures \((1+\theta-\eta\lambda_j)^2 \le 4\theta\). This choice of momentum parameter ensures \(\rho(H) \le \sqrt{\theta}\). We set \(\eta = \frac{4}{(\lambda_{\max}(Q)^{\frac12} + \lambda_{\min}(Q)^{\frac12})^2}\) so that \(1-\sqrt{\eta\lambda_{\max}(Q)} = 1-\sqrt{\eta\lambda_{\min}(Q)}\). This implies

\[\begin{aligned} \theta &= \max\left\{\left(1-\frac{2\lambda_{\max}(Q)^{\frac12}}{\lambda_{\max}(Q)^{\frac12} + \lambda_{\min}(Q)^{\frac12}}\right)^2,\left(1-\frac{2\lambda_{\max}(Q)^{\frac12}}{\lambda_{\min}(Q)^{\frac12} + \lambda_{\min}(Q)^{\frac12}}\right)^2\right\} \\ &= \left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^2. \end{aligned}\]

Heuristic derivation of N83 ODE

Rearranging the N83 update rule, we have

\[\frac{x_{t+1}-x_t}{\sqrt{\eta}} = \frac{t-1}{t+2}\frac{x_t - x_{t-1}}{\sqrt{\eta}} - \sqrt{\eta}\partial f(y_t).\]

We wish to consider the sequence of iterates \((x_t)\) a a discrete version of a curve \(X(s)\). It turns out the correct way to discretize the curve is \(x_t \approx X(t\sqrt{n})\), which suggests the change of variables \(t = \frac{s}{\sqrt{\eta}}\). This change of variables leads to \(X(s) \approx x_{s/\sqrt{\eta}} = x_t\) and \(X(s + \sqrt{\eta}) \approx x_{t+1}\). We Taylor expand \(X(s)\) to obtain

\[\begin{aligned} \frac{x_{t+1}-x_t}{\sqrt{\eta}} \approx \dot{X}(s) + \frac12\ddot{X}(s)\sqrt{\eta} \\ \frac{x_t-x_{t-1}}{\sqrt{\eta}} \approx \dot{X}(s) - \frac12\ddot{X}(s)\sqrt{\eta} \end{aligned}\]

where \(X(t)\) is the N83 flow. The N83 update rule implies

\[\dot{X}(s) + \frac12\ddot{X}(s)\sqrt{\eta} \approx (1 - \frac{3\sqrt{\eta}}{s})(\dot{X}(s) - \frac12\ddot{X}(s)\sqrt{\eta}) - \sqrt{\eta}\partial f(X(s)).\]

We rearrange and drop the asymptotically negligible terms (as \(\eta\to 0\)) to obtain the N83 ODE:

\[\ddot{X}(s) + \frac3s\dot{X}(s) + \partial f(X(s)) = 0.\]

Convergence of N83 ODE

Define the energy/Lyapunov function \(E(s) \triangleq s^2(f(X) - f_*) + 2\|X + \frac{s}{2}\dot{X} - x_*\|_2^2\). We have

\[\begin{aligned} \dot{E} &= 2s(f(X) - f_*) + s^2\langle\partial f(X),\dot{X}\rangle + 4\langle X+\frac{s}{2}\dot{X} - x_*,\frac32\dot{X} + \frac{s}{2}\ddot{X}\rangle \\ &= 2s(f(X) - f_*) - 2s\langle X-x_*,\partial f(X)\rangle \\ &\le 0, \end{aligned}\]

where we appealed to the N83 ODE in the second step and the convexity of \(f\) in the third step. This implies \(\dot{E}\) is non-increasing (in \(S\)). Thus

\[f(X(s)) - f_* \le \frac{E(s)}{s^2} \le \frac{E(0)}{s^2} = O(\frac{1}{s^2}),\]

where we recalled the definition of \(E\) in the first step.

Convergence of FISTA

We start by recalling the basic inequality for proximal gradient methods:

\[f(x_{t+1}) - f(x) \le \frac{L}{2}(\|x - y_t\|_2^2 - \|x - x_{t+1}\|_2^2).\]

Using this basic inequality, we show that FISTA reduces an energy/Lyapunov function at each iteration. Letting \(x = \frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t\) in the basic inequality, we obtain

\[\begin{aligned} &\textstyle f(x_{t+1}) - f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t) \\ &\quad\le\frac{L}{2}\|\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t - y_t\|_2^2 - \frac{L}{2}\|\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t - x_{t+1}\|_2^2 \\ &\quad=\frac{L}{2\theta_t^2}\|x_* + (\theta_t-1)x_t - \theta_ty_t\|_2^2 - \frac{L}{2\theta_t^2}\|x_* + (\theta_t-1)x_t - \theta_tx_{t+1}\|_2^2 \\ &\quad=\frac{L}{2\theta_t^2}(\|u_t\|_2^2 - \|u_{t+1}\|_2^2), \end{aligned}\]

where we defined \(u_t \triangleq \theta_{t-1}x_t - (x_* + (\theta_{t-1}-1)x_{t-1})\) and recalled the \(y_t\) update rule in the third step. The convexity of \(f\) implies

\[\textstyle f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t) \le \frac{1}{\theta_t}f_* + (1-\frac{1}{\theta_t})f(x_t),\]

so the left side of the preceding inequality is at least

\[\textstyle f(x_{t+1}) - f_* - (1-\frac{1}{\theta_t})(f(x_t) - f_*) \le f(x_{t+1}) - f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t).\]

Combining the upper and lower bounds of \(f(x_{t+1}) - f(\frac{1}{\theta_t}x_* + (1-\frac{1}{\theta_t})x_t)\), we have

\[\textstyle \theta_t^2(f(x_{t+1}) - f_*) - (\theta_t^2 - \theta_t)(f(x_t) - f_*) \le \frac{L}{2}(\|u_t\|_2^2 - \|u_{t+1}\|_2^2).\]

Finally, we recognize \(\theta_t^2 - \theta_t = \theta_{t-1}^2\) and rearrange to show that FISTA reduces an energy function at each iterations:

\[\|u_{t+1}\|_2^2 + \frac{2\theta_t^2}{L}(f(x_{t+1}) - f_*) \le \|u_t\|_2^2 + \frac{2\theta_{t-1}^2}{L}(f(x_t) - f_*).\]

In particular, we have

\[\begin{aligned} \frac{2\theta_{t-1}^2}{L}(f(x_t) - f_*) &\le \|u_1\|_2^2 + \frac{2\theta_0^2}{L}(f(x_1) - f_*) \\ &= \|x_1 - x_*\|_2^2 + \frac{2}{L}(f(x_1) - f_*) \end{aligned}\]

Recalling the basic inequality for proximal gradient methods, we have

\[\begin{aligned} \frac{2}{L}(f(x_1) - f_*) &\le \|y_0 - x_*\|_2^2 - \|x_1 - x_*\|_2^2 \\ &= \|x_0 - x_*\|_2^2 - \|x_1 - x_*\|_2^2. \end{aligned}\]

Rearranging, we have

\[\|x_1 - x_*\|_2^2 + \frac{2}{L}(f(x_1) - f_*) \le \|x_0 - x_*\|_2^2.\]

We combine this with the bound on the reduction in energy from the first iteration to obtain

\[\frac{2\theta_{t-1}^2}{L}(f(x_t) - f_*) \le \|x_0 - x_*\|_2^2.\]

Rearranging, we have

\[f(x_t) - f_* \le \frac{L}{2\theta_{t-1}^2}\|x_0 - x_*\|_2^2 \le \frac{2L\|x_9 - x_*\|_2^2}{(t+1)^2},\]

where we recalled \(\theta_t \ge \frac{t+2}{2}\) for all \(t \ge 1\) in the second step.

Posted on February 16, 2021 from San Francisco, CA.