STATS 606

Mirror descent proofs

This post accompanies the lecture video on mirror descent. Please see the video for the problem setup and definitions.

Three point lemma

We have

\[\begin{aligned} &D_\varphi(x,z) + D_\varphi(z,y) - D_\varphi(x,y) \\ &\quad=\varphi(x) - \varphi(z) - \langle\partial\varphi(z),x-z)\rangle + \varphi(z) - \varphi(y) - \langle\partial\varphi(y),z-y)\rangle \\ &\qquad-\varphi(x) + \varphi(y) + \langle\partial\varphi(y),x-y)\rangle \\ &\quad=-\langle\partial\varphi(z),x-z)\rangle - \langle\partial\varphi(y),z-y)\rangle + \langle\partial\varphi(y),x-y)\rangle \\ &\quad=\langle\partial\varphi(y) - \partial\varphi(z),x-z\rangle. \end{aligned}\]

Bregman Pythagorean theorem

Let \(g \triangleq \partial_xD_\varphi(x,y)\mid_{x=y_*}\) (the gradient of \(D_\varphi(x,y)\) WRT its first argument evaluated at \((y_*,y)\)). Recall \(\def\cC{\mathcal{C}}y_* \in \argmin_{x\in\cC}D_\varphi(x,y)\), so the directional derivative of \(D_\varphi(x,y)\) in all feasible directions is non-negative: \(\langle g, x-y_*\rangle \ge 0\) for all \(x\in\cC\). Thus

\[0 \ge \langle g,y_* - x\rangle = \langle\partial\varphi(y) - \partial\varphi(y_*),x-y_*\rangle\]

By the three point lemma, the right side is exactly \(D_\varphi(x,y_*) + D_\varphi(y_*,y) - D_\varphi(x,y)\), which is equivalent to the stated result.

Convergence of mirror descent

We start by showing a basic inequality for mirror descent:

\[\eta_t(f(x_t) - f(x_*)) \le D_{\varphi}(x_*,x_t) - D_\varphi(x_*,x_{t+1}) + \frac{\eta_t^2L^2}{2\mu}.\]

We have

\[f(x_t) - f(x_*) \le \langle g_t,x_t - x_*\rangle = \frac{1}{\eta_t}\langle\partial\varphi(x_t) - \partial\varphi(x_{t+\frac12}),x_t - x_*\rangle\]

where we recalled \(g_t\in\partial f(x_t)\) in the first step and appealed to the optimality conditions of \(x_{t+\frac12}\) in the second step. By the three point lemma and the (Bregman) Pythagorean theorem, we have

\[\begin{aligned} &\frac{1}{\eta_t}\langle\partial\varphi(x_t) - \partial\varphi(x_{t+\frac12}),x_t - x_*\rangle \\ &\quad= \frac{1}{\eta_t}(D_\varphi(x_*,x_t) + D_\varphi(x_t,x_{t+\frac12}) - D_\varphi(x_*,x_{t+\frac12})) \\ &\quad\le \frac{1}{\eta_t}(D_\varphi(x_*,x_t) + D_\varphi(x_t,x_{t+\frac12}) - D_\varphi(x_*,x_{t+1}) - D_\varphi(x_{t+1},x_{t+\frac12})) \\ &\quad= \frac{1}{\eta_t}(D_\varphi(x_*,x_t) - D_\varphi(x_*,x_{t+1})) + \frac{1}{\eta_t}(D_\varphi(x_t,x_{t+\frac12}) - D_\varphi(x_{t+1},x_{t+\frac12})), \end{aligned}\]

It remains to show

\[D_\varphi(x_t,x_{t+\frac12}) - D_\varphi(x_{t+1},x_{t+\frac12}) \le \frac{\eta_t^2L^2}{2\mu}.\]

We have

\[\begin{aligned} &D_\varphi(x_t,x_{t+\frac12}) - D_\varphi(x_{t+1},x_{t+\frac12}) \\ &\quad\le \varphi(x_t) - \varphi(x_{t+1}) - \langle\partial\varphi(x_{t+\frac12}),x_t - x_{t+1}\rangle \\ &\quad\le \langle\partial\varphi(x_t),x_t - x_{t+1}\rangle - \frac{\mu}{2}\|x_t - x_{t+1}\|_2^2 - \langle\partial\varphi(x_{t+\frac12}),x_t - x_{t+1}\rangle \\ &\quad\le \langle\partial\varphi(x_t) - \partial\varphi(x_{t+\frac12}),x_t - x_{t+1}\rangle - \frac{\mu}{2}\|x_t - x_{t+1}\|_2^2 \\ &\quad = \eta_t\langle g_t,x_t - x_{t+1}\rangle - \frac{\mu}{2}\|x_t - x_{t+1}\|_2^2 \\ &\quad\le \eta_tL\|x_t - x_{t+1}\|_2^2 - \frac{\rho}{2}\|x_t - x_{t+1}\|_2^2, \end{aligned}\]

where we appealed to the strong convexity of \(\varphi\) in the second step, the optimality condition of \(x_{t+\frac12}\) in the fourth step, and the Lipschitz continuity of \(f\) in the fifth step. We optimize the right side WRT \(\eta_t\) to obtain

\[D_\varphi(x_t,x_{t+\frac12}) - D_\varphi(x_{t+1},x_{t+\frac12}) \le \frac{\eta_t^2L^2}{2\mu}.\]

as desired. This shows the basic inequality for mirror descent.

To show convergence of mirror descent, we sum over the first \(T\) iterations to obtain

\[\begin{aligned} \sum_{t=0}^T\eta_t(f(x_t) - f(x_*)) &\le D_{\varphi}(x_*,x_0) - D_\varphi(x_*,x_{t+1}) + \frac{L^2}{2\mu}\sum_{t=0}^T\eta_t^2 \\ &\le R_{\cC,\varphi} + \frac{L^2}{2\mu}\sum_{t=0}^T\eta_t^2, \end{aligned}\]

where \(R_{\cC,\varphi} \triangleq \sup_{x\in\cC}D_\varphi(x,x_0)\). This implies the stated result:

\[f(\bar{x}_T) - f(x_*) \le \frac{\sum_{t=0}^T\eta_t(f(x_t) - f(x_*))}{\sum_{t=0}^T\eta_t} \le \frac{R_{\cC,\varphi} + \frac{L^2}{2\mu}\sum_{t=0}^T\eta_t^2}{\sum_{t=0}^T\eta_t}.\]

Posted on February 09, 2021 from San Francisco, CA.