Nesterov’s Accelerated Gradient Descent

转自:http://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent/

In this lecture we consider the same setting than in the previous post (that is we want to minimize a smooth convex function over \mathbb{R}^n). Previously we saw that the plain Gradient Descent algorithm has a rate of convergence of order 1/t after t steps, while the lower bound that we proved is of order 1/t^2.

We present now a beautiful algorithm due to Nesterov, called Nesterov’s Accelerated Gradient Descent, which attains a rate of order 1/t^2. First we define the following sequences:

  <p style='text-align:center;'><span class='MathJax_Preview'><img src='http://haofeng.blog.ustc.edu.cn/wp-content/plugins/latex/cache/tex_cc8691efe5e100dabce9e1a0232c0847.gif' style='vertical-align: middle; border: none;' class='tex' alt=

" class="ql-img-displayed-equation quicklatex-auto-format" height="55" src="http://blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-320ac3e87f5287e14956f64c719a3337_l3.svg" style="margin: 0px !important; padding: 0px !important; border: none !important; vertical-align: middle !important; background-image: none !important; display: inline-block !important; background-position: initial initial !important; background-repeat: initial initial !important;" title="Rendered by QuickLaTeX.com" width="377" />

(Note that \gamma_s \leq 0.) Now the algorithm is simply defined by the following equations, with an arbitrary initial point x_1 = y_1,

  \begin{eqnarray*} y_{s+1} & = & x_s - \frac{1}{\beta} \nabla f(x_s) , \\ x_{s+1} & = & (1 - \gamma_s) y_{s+1} + \gamma_s y_s . \end{eqnarray*}

In other words, Nesterov’s Accelerated Gradient Descent performs a simple step of gradient descent to go from x_s to y_{s+1}, and then it ‘slides’ a little bit further than y_{s+1} in the direction given by the previous point y_s.

The intuition behind the algorithm is quite difficult to grasp, and unfortunately the analysis will not be very enlightening either. Nonetheless Nesterov’s Accelerated Gradient is an optimal method (in terms of oracle complexity) for smooth convex optimization, as shown by the following theorem.


Theorem (Nesterov 1983) Let f be a convex and \beta-smooth function, then Nesterov’s Accelerated Gradient Descent satisfies

  <p style='text-align:center;'><span class='MathJax_Preview'><img src='http://haofeng.blog.ustc.edu.cn/wp-content/plugins/latex/cache/tex_60b0f1201ca20112b113092eab7f8367.gif' style='vertical-align: middle; border: none;' class='tex' alt=

" class="ql-img-displayed-equation quicklatex-auto-format" height="39" src="http://blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-33fc63d703b1e275e1474783b159fdf1_l3.svg" style="margin: 0px !important; padding: 0px !important; border: none !important; vertical-align: middle !important; background-image: none !important; display: inline-block !important; background-position: initial initial !important; background-repeat: initial initial !important;" title="Rendered by QuickLaTeX.com" width="237" />

We follow here the proof by Beck and Teboulle from the paper ‘A fast iterative shrinkage-thresholding algorithm for linear inverse problems‘.

Proof: We start with the following observation, that makes use of Lemma 1 and Lemma 2 from the previous lecture: let x, y \in \R^n, then

  \begin{align*} & f\left(x - \frac{1}{\beta} \nabla f(x) \right) - f(y) \\ & \leq f\left(x - \frac{1}{\beta} \nabla f(x) \right) - f(x) + \nabla f(x)^{\top} (x-y) \\ & \leq \nabla f(x)^{\top} \left(x - \frac{1}{\beta} \nabla f(x) - x \right) + \frac{\beta}{2} \left\|x - \frac{1}{\beta} \nabla f(x) - x \right\|^2 + \nabla f(x)^{\top} (x-y) \\ & = - \frac{1}{2 \beta} \| \nabla f(x) \|^2 + \nabla f(x)^{\top} (x-y) . \end{align*}

Now let us apply this inequality to x = x_{s} and y= y_s, which gives

(1) \begin{eqnarray*} f(y_{s+1}) - f(y_s) & = & f\left(x_s - \frac{1}{\beta} \nabla f(x_s) \right) - f(y_s) \notag \\ & \leq & - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 + \nabla f(x_s)^{\top} (x_s-y_s) \notag \\ & = & - \frac{\beta}{2} \| y_{s+1} - x_s \|^2 - \beta (y_{s+1} - x_s)^{\top} (x_s-y_s) .  \end{eqnarray*}

Similarly we apply it to x=x_s and y=x^* which gives

(2) \begin{equation*}  f(y_{s+1}) - f(x^*) \leq - \frac{\beta}{2} \| y_{s+1} - x_s \|^2 - \beta (y_{s+1} - x_s)^{\top} (x_s-x^*) . \end{equation*}

Now multiplying (1) by (\lambda_{s}-1) and adding the result to (2), one obtains with \delta_s = f(y_s) - f(x^*),

  <p style='text-align:center;'><span class='MathJax_Preview'><img src='http://haofeng.blog.ustc.edu.cn/wp-content/plugins/latex/cache/tex_dc4d73bc2bb7224202c8b599bd59bee3.gif' style='vertical-align: middle; border: none;' class='tex' alt=

" class="ql-img-displayed-equation quicklatex-auto-format" height="37" src="http://blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-64cca66d4f44282628543359d2f176b0_l3.svg" style="margin: 0px !important; padding: 0px !important; border: none !important; vertical-align: middle !important; background-image: none !important; display: inline-block !important; background-position: initial initial !important; background-repeat: initial initial !important;" title="Rendered by QuickLaTeX.com" width="582" />

Multiplying this inequality by \lambda_{s} and using that by definition \lambda_{s-1}^2 = \lambda_{s}^2 - \lambda_{s} one obtains

(3) \begin{align*} & \lambda_{s}^2 \delta_{s+1} - \lambda_{s-1}^2 \delta_s \notag \\ & \leq - \frac{\beta}{2} \bigg( \|\lambda_{s}( y_{s+1} - x_s )\|^2 + 2 \lambda_{s} (y_{s+1} - x_s)^{\top} (\lambda_{s} x_{s} - (\lambda_{s} - 1) y_s-x^*) \bigg).  \end{align*}

Now one can verify that

(4) \begin{align*} & \|\lambda_{s}( y_{s+1} - x_s )\|^2 + 2 \lambda_{s} (y_{s+1} - x_s)^{\top} (\lambda_{s} x_{s} - (\lambda_{s} - 1) y_s-x^*)\notag \\ & = \| \lambda_{s} y_{s+1} - (\lambda_{s} - 1) y_{s}-x^* \|^2 - \| \lambda_{s} x_{s} - (\lambda_{s} - 1) y_{s}-x^* \|^2 .  \end{align*}

Next remark that, by definition, one has

(5) \begin{align*} & x_{s+1} = y_{s+1} + \gamma_s (y_s - y_{s+1}) \notag \\ & \Leftrightarrow \lambda_{s+1} x_{s+1} = \lambda_{s+1} y_{s+1} + (1-\lambda_{s})(y_s - y_{s+1}) \notag \\ & \Leftrightarrow \lambda_{s+1} x_{s+1} - (\lambda_{s+1} - 1) y_{s+1}= \lambda_{s} y_{s+1} - (\lambda_{s}-1) y_{s} .  \end{align*}

Putting together (3), (4) and (5) one gets with u_s = \lambda_{s} x_{s} - (\lambda_{s} - 1) y_{s} - x^*,

  

Summing these inequalities from s=1 to s=t-1 one obtains:

  <p style='text-align:center;'><span class='MathJax_Preview'><img src='http://haofeng.blog.ustc.edu.cn/wp-content/plugins/latex/cache/tex_0acb764783126ea3f9f6bf7dfd939269.gif' style='vertical-align: middle; border: none;' class='tex' alt=

" class="ql-img-displayed-equation quicklatex-auto-format" height="42" src="http://blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-f0d6e7938ec09c86add6bbac940b9658_l3.svg" style="margin: 0px !important; padding: 0px !important; border: none !important; vertical-align: middle !important; background-image: none !important; display: inline-block !important; background-position: initial initial !important; background-repeat: initial initial !important;" title="Rendered by QuickLaTeX.com" width="131" />

By induction it is easy to see that \lambda_{t-1} \geq \frac{t}{2} which concludes the proof.

\lambda_{s}^2 \delta_{s+1} - \lambda_{s-1}^2 \delta_s \leq \frac{\beta}{2} \bigg(\|u_s\|^2 - \|u_{s+1}\|^2 \bigg) .