STATS 606

Gradient descent proofs

This post accompanies the lecture video on gradient descent. Please see the video for the problem setup and definitions.

Convergence of GD with constant step sizes on QP’s

The GD udpate rule implies successive optimization errors satisfy

\[x_{t+1} - x_* = x_t - x_* - \eta\partial f(x_t) = (I - \eta A)(x_t - x_*).\]

Taking norms, we have

\[\|x_{t+1} - x_*\|_2 \le \|I - \eta A\|_2\|x_t - x_*\|_2,\]

where \(\|\cdot\|_2\) with a matrix argument denotes the norm induced by the Euclidean norm on vectors. Recalling the \(\|\cdot\|_2\) norm of a symmetric matrix is the magnitude of its eigenvalue furthest from zero,

\[\|I - \eta A\|_2 = \max\{|1 - \eta\lambda_{\max}(A)|,|1-\eta\lambda_{\min}(A)|\}.\]

Thus the choice of step length that minimizes \(\|I - \eta A\|_2\) is \(\eta = \frac{2}{\lambda_{\max}(A) + \lambda_{\min}(A)}\), and the minimum is \(\|I - \eta A\|_2 = \frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)}\). Combining this expression with the preceding error bound, we have

\[\begin{aligned} \|x_t - x_*\|_2 &\le (\frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)})\|x_{t-1} - x_*\|_2 \\ &\quad\vdots \\ &\le (\frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)})^t\|x_0 - x_*\|_2. \end{aligned}\]

Convergence of GD with exact line search on QP’s

It is not hard to check that exact line search leads to \(\eta_t = \frac{\|\partial f(x_t)\|_2^2}{\langle\partial f(x_t),A\partial f(x_t)\rangle}\). This implies successive cost values satisfy

\[\begin{aligned} f(x_{t+1}) &= \frac12\langle x_t - \eta_t\partial f(x_t) - x_*,A(x_t - \eta_t\partial f(x_t) - x_*)\rangle \\ &= \frac12\langle x_t - x_*,A(x_t - x_*)\rangle - \eta_t\|\partial f(x_t)\|_2^2 + \frac{\eta_t^2}{2}\langle\partial f(x_t),A\partial f(x_t)\rangle \\ &= \frac12\langle x_t - x_*,A(x_t - x_*)\rangle - \frac{\|\partial f(x_t)\|_2^4}{2\langle\partial f(x_t),A\partial f(x_t)\rangle} \\ &= (1 - \frac{\|\partial f(x_t)\|_2^4}{\langle\partial f(x_t),A\partial f(x_t)\rangle\langle\partial f(x_t),A^{-1}\partial f(x_t)\rangle})f(x_t), \end{aligned}\]

where we recalled \(f(x_t) = \frac12\langle\partial f(x_t),A^{-1}\partial f(x_t)\rangle\) in the last step. We recall Kantorovich’s inequality:

\[\frac{\|x\|_2^4}{\langle x,Ax\rangle\langle x,A^{-1}x\rangle} \ge \frac{4\lambda_{\max}(A)\lambda_{\min}(A)}{(\lambda_{\max}(A) + \lambda_{\min}(A))^2}.\]

This inequality implies

\[\begin{aligned} f(x_{t+1}) &\le (1-\frac{4\lambda_{\max}(A)\lambda_{\min}(A)}{(\lambda_{\max}(A) + \lambda_{\min}(A))^2})f(x_t) \\ &= (\frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)})^2f(x_t). \end{aligned}\]

We iterate the preceding bound (and recall \(f(x_*) = 0\)) to obtain the stated result.

Convergence of GD on SC and SS problems

The fundamental theorem of calculus implies

\[\partial f(x_t) - \partial f(x_*) = \int_0^1\partial^2f(x_s)(x_t - x_*)ds,\]

where \(x_s \triangleq (1-s)x_t + sx_*\). This and the GD update rule implies successive optimization errors satisfies

\[\begin{aligned} x_{t+1} - x_* = x_t - x_* - \eta\partial f(x_t) = (I - \eta\int_0^1\partial^2f(x_s)ds)(x_t - x_*). \end{aligned}\]

Taking norms and plugging in \(\eta = \frac{2}{L + \mu}\), we have

\[\|x_{t+1} - x_*\|_2^2 \le \sup\nolimits_{s\in[0,1]}\|I - \eta\partial^2f(x_s)\|_2\|x_t - x_*\|_2 \le (\frac{L-\mu}{L+\mu})\|x_t - x_*\|_2.\]

We recognize \(\frac{L-\mu}{L+\mu} = \frac{\kappa -1}{\kappa + 1}\) and iterate the preceding bound to obtain the stated result.

Convergence of GD under regularity conditions

We have

\[\begin{aligned} \|x_{t+1} - x_*\|_2^2 &\textstyle= \|x_t - x_* - \frac1L\partial f(x_t)\|_2^2 \\ &= \|x_t - x_*\|_2^2 + \frac{1}{L^2}\|\partial f(x_t)\|_2^2 - \frac2L\langle \partial f(x_t),x_t - x_*\rangle \\ &\le \|x_t - x_*\|_2^2 - \frac{\mu}{L}\|x_t - x_*\|_2^2 \\ &= (1-\frac{\mu}{L})\|x_t - x_*\|_2^2, \end{aligned}\]

where we appealed to the regularity condition in the third step. We iterate the preceding bound to obtain the stated result.

Convergence of GD under PL condition

The strong smoothness condition implies

\[\begin{aligned} f(x_{t+1}) - f(x_t) &\le \langle\partial f(x_t),x_{t+1} - x_t\rangle + \frac{L}{2}\|x_{t+1} - x_t\|_2^2 \\ &= -\eta_t\|\partial f(x_t)\|_2^2 + \frac{\eta_t^2 L}{2}\|\partial f(x_t)\|_2^2 \\ &= -\frac{1}{2L}\|\partial f(x_t)\|_2^2. \end{aligned}\]

This bound is a stronger version of the observation that GD is a descent method (on SS functions). It implies

\[\begin{aligned} f(x_{t+1}) - f(x_*) &\le f(x_t) - f(x_*) - \frac{1}{2L}\|\partial f(x_t)\|_2^2 \\ &\le f(x_t) - f(x_*) - \frac{\mu}{L}(f(x_t) - f(x_*)) \\ &= (1-\frac{\mu}{L})(f(x_t) - f(x_*)). \end{aligned}\]

We iterate the preceding bound to obtain the stated result.

Convergence of GD to stationary points

Recall GD on \(L\)-SS functions is a descent method:

\[\frac{1}{2L}\|\partial f(x_t)\|_2^2 \le f(x_t) - f(x_{t+1})\]

Summing the preceding bound, we have

\[\begin{aligned} \frac{1}{2L}\sum_{t=0}^{t-1}\|\partial f(x_t)\|_2^2 &\le \sum_{t=0}^{t-1}f(x_t) - f(x_{t+1}) \\ &= f(x_0) - f(x_t) \\ &\le f(x_0) - f(x_*). \end{aligned}\]

We lower bound the left side with \(\frac{T}{2L}\min_{t\in[T]}\|\partial f(x_t)\|_2^2\), and rearrange to obtain the stated result.

Posted on January 19, 2021 from San Francisco, CA.