{"id":446,"date":"2018-04-25T20:20:56","date_gmt":"2018-04-25T20:20:56","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=446"},"modified":"2018-05-08T06:34:37","modified_gmt":"2018-05-08T06:34:37","slug":"proximal-algorithms","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/04\/25\/proximal-algorithms\/","title":{"rendered":"Proximal Algorithms"},"content":{"rendered":"<h3>Introduction<\/h3>\n<p>Proximal algorithms are a class of algorithms that can be used to solve constrained optimization problems that may involve non-smooth penalties in the objective function. Their formulation is rather general in that a number of other optimization algorithms can be derived as special cases of proximal algorithms. At the core of\u00a0 proximal algorithms exists the evaluation of the proximal operator:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctextnormal%7Bprox%7D_%7B%5Cgamma%2C+f%7D+%28x%29%3D+%5Ctextnormal%7Bargmin%7D_%7Bz%7D+%5Cleft%5C%7Bf%28z%29+%2B+%5Cfrac%7B1%7D%7B2%5Cgamma%7D+%5Cleft%5C%7Cx-z%5Cright%5C%7C%5E%7B2%7D_%7B2%7D%5Cright%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;textnormal{prox}_{&#92;gamma, f} (x)= &#92;textnormal{argmin}_{z} &#92;left&#92;{f(z) + &#92;frac{1}{2&#92;gamma} &#92;left&#92;|x-z&#92;right&#92;|^{2}_{2}&#92;right&#92;}\" class=\"latex\" \/><\/p>\n<p>The proximal operator returns a point in the search space that minimizes the problem defined by the Moreau envelope of the objective function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> [2]:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cgamma%7D%28x%29+%3D+%5Cinf_%7Bz%7D+%5Cleft%5C%7Bf%28z%29+%2B+%5Cfrac%7B1%7D%7B2%5Cgamma%7D+%5Cleft%5C%7Cx-z%5Cright%5C%7C%5E%7B2%7D_%7B2%7D%5Cright%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;gamma}(x) = &#92;inf_{z} &#92;left&#92;{f(z) + &#92;frac{1}{2&#92;gamma} &#92;left&#92;|x-z&#92;right&#92;|^{2}_{2}&#92;right&#92;}\" class=\"latex\" \/><\/p>\n<p>The Moreau envelope is a regularized version of the function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/>. It is continuously differentiable and defined in all of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BR%7D%5E%7Bn%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{R}^{n}\" class=\"latex\" \/>, even though <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> may not satisfy either condition,\u00a0 and has the same set of minimizers as\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> [1,2].<\/p>\n<p><em>Example 1<\/em>. Let <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%29+%3D+%7Cx%7C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x) = |x|\" class=\"latex\" \/>. The Moreau envelope of\u00a0\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x)\" class=\"latex\" \/> is\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%7B%5Cgamma%7D%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^{&#92;gamma}(x)\" class=\"latex\" \/> and for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cgamma+%3D+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;gamma = 1\" class=\"latex\" \/>:<\/p>\n<figure id=\"attachment_791\" aria-describedby=\"caption-attachment-791\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-791 size-medium\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Moreau-300x205.png\" alt=\"\" width=\"300\" height=\"205\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Moreau-300x205.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Moreau.png 374w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><figcaption id=\"caption-attachment-791\" class=\"wp-caption-text\">Moreau envelope of Abs(x)<\/figcaption><\/figure>\n<p>Notice that the Moreau envelope is smooth even though <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> itself is not.<\/p>\n<h3>Standard Problem Formulation<\/h3>\n<p>Consider the minimization of functions of the form<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctextnormal%7Bargmin%7D_%7Bx+%5Cin+%5Cmathcal%7BX%7D%7D%5C%2C+F%28x%29+%3A%3D+g%28x%29+%2B+h%28x%29%5Cqquad%5Cqquad+%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;textnormal{argmin}_{x &#92;in &#92;mathcal{X}}&#92;, F(x) := g(x) + h(x)&#92;qquad&#92;qquad (1)\" class=\"latex\" \/><\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(x)\" class=\"latex\" \/> is convex and differentiable and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h(x)\" class=\"latex\" \/> is convex but non-smooth. Such objective functions arise when a penalty such as the lasso is added to the original objective function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(x)\" class=\"latex\" \/> to induce sparsity in the solution.<\/p>\n<p>Gradient descent can be used to minimize the original objective <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(x)\" class=\"latex\" \/> as it is smooth and convex, and we can converge to an <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/>-suboptimal solution at a rate of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=O%281%2F%5Cepsilon%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"O(1\/&#92;epsilon)\" class=\"latex\" \/>. Adding a penalty or regularization term <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h(x)\" class=\"latex\" \/> that is non-smooth, however, may cause gradient descent to fail and converge to an incorrect minima. In such a circumstance, subgradient methods [3] are used to minimize the non-smooth function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=F%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"F(x)\" class=\"latex\" \/>. The drawback of using subgradient methods is that they converge far more slowly than gradient descent. An\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/>-suboptimal solution can be obtained only after <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=O%281%2F%5Cepsilon%5E2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"O(1\/&#92;epsilon^2)\" class=\"latex\" \/> iterations.<\/p>\n<p>When the solution to the proximal operator of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h(x)\" class=\"latex\" \/> is known in closed form, proximal algorithms can be used to significantly speed up the convergence. Table 1 in [2] lists the closed form solution of evaluating the proximal operator on an extensive list of non-smooth functions. Convergence to the an\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/>-suboptimal solution can be attained after <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=O%281%2F%5Cepsilon%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"O(1\/&#92;epsilon)\" class=\"latex\" \/> calls to evaluate the prox operator. Therefore, if the evaluation of the prox operator is cheap, convergence is linear as in the case of gradient descent.<\/p>\n<h3>Proximal Gradient Descent Method<\/h3>\n<p>Proximal gradient descent method can be employed to solve optimization problems of type (1). It is composed of the following two steps:<\/p>\n<ol>\n<li style=\"text-align: left\"><em>Gradient step<\/em>: Starting at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_k\" class=\"latex\" \/>, take a step in the direction of the gradient of the differentiable part, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(x)\" class=\"latex\" \/>:\n<div style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk%7D%5E%7B%2B%7D+%3D+x_%7Bk%7D+-+%5Cgamma+%5Cnabla+g%28x_k%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k}^{+} = x_{k} - &#92;gamma &#92;nabla g(x_k)\" class=\"latex\" \/><\/div>\n<\/li>\n<li style=\"text-align: left\"><em>Evaluate prox operator<\/em>: Starting at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_k%5E%7B%2B%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_k^{+}\" class=\"latex\" \/>, evaluate the prox operator of the non-smooth part, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h(x)\" class=\"latex\" \/>:\n<div style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk%2B1%7D+%3D+%5Ctextnormal%7Bprox%7D_%7B%5Cgamma%2C+h%7D%28x_%7Bk%7D%5E%7B%2B%7D%29+%3D+%5Ctextnormal%7Bprox%7D_%7B%5Cgamma%2C+h%7D%28x_k+-+%5Cgamma+%5Cnabla+g%28x_k%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k+1} = &#92;textnormal{prox}_{&#92;gamma, h}(x_{k}^{+}) = &#92;textnormal{prox}_{&#92;gamma, h}(x_k - &#92;gamma &#92;nabla g(x_k))\" class=\"latex\" \/><\/div>\n<\/li>\n<\/ol>\n<p>In [2], the authors give a glimpse into how the above recipe can be motivated in two different ways &#8212; first as an MM-algorithm and second by deriving optimality conditions using subdifferential calculus.<\/p>\n<p>Proximal Gradient Descent method will require <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=O%281%2F%5Cepsilon%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"O(1\/&#92;epsilon)\" class=\"latex\" \/> calls to the prox operator in order to obtain an <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon\" class=\"latex\" \/>-suboptimal solution.<\/p>\n<h3>Proximal Newton Method<\/h3>\n<p>In Proximal Gradient Descent Method, the gradient step is obtained by constructing a quadratic under-approximant to the differentiable function using the scaled identity, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=I%2F%5Cgamma&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"I\/&#92;gamma\" class=\"latex\" \/>, as an approximation to the Hessian of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(x)\" class=\"latex\" \/>.<\/p>\n<p>Proximal Newton Method can be obtained by using the actual Hessian, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H+%3A%3D+%5Cnabla%5E%7B2%7Dg%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H := &#92;nabla^{2}g(x)\" class=\"latex\" \/>, to build the quadratic under-approximant of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(x)\" class=\"latex\" \/>. Since <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g(x)\" class=\"latex\" \/> is convex, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H+%5Csucceq+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H &#92;succeq 0\" class=\"latex\" \/>. Defining the scaled proximal map as in [4]:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctextnormal%7Bprox%7D_%7BH%7D%28x%29+%3D+%5Ctextnormal%7Bargmin%7D_%7Bz%7D+%5Cfrac%7B1%7D%7B2%7D%5Cleft%5C%7Cx-z%5Cright%5C%7C%5E%7B2%7D_%7BH%7D+%2B+h%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;textnormal{prox}_{H}(x) = &#92;textnormal{argmin}_{z} &#92;frac{1}{2}&#92;left&#92;|x-z&#92;right&#92;|^{2}_{H} + h(x)\" class=\"latex\" \/><\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cleft%5C%7Cx%5Cright%5C%7C%5E%7B2%7D_%7BH%7D+%3D+x%5E%7BT%7DHx&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;left&#92;|x&#92;right&#92;|^{2}_{H} = x^{T}Hx\" class=\"latex\" \/>, the Proximal Newton Method can be written as [4]:<\/p>\n<ol>\n<li style=\"text-align: left\"><em>Newton step<\/em>: Starting at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_k\" class=\"latex\" \/>, take a Newton step:\n<div style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk%7D%5E%7B%2B%7D+%3D+x_%7Bk%7D+-+H_%7Bk%7D%5E%7B-1%7D+%5Cnabla+g%28x_k%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k}^{+} = x_{k} - H_{k}^{-1} &#92;nabla g(x_k)\" class=\"latex\" \/><\/div>\n<\/li>\n<li style=\"text-align: left\"><em>Evaluate prox operator<\/em>: Starting at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_k%5E%7B%2B%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_k^{+}\" class=\"latex\" \/>, evaluate the scaled prox operator:\n<div style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=z_%7Bk%7D+%3D+%5Ctextnormal%7Bprox%7D_%7BH_%7Bk%7D%7D%28x_%7Bk%7D%5E%7B%2B%7D%29+%3D+%5Ctextnormal%7Bprox%7D_%7BH_%7Bk%7D%7D%28x_%7Bk%7D+-+H_%7Bk%7D%5E%7B-1%7D+%5Cnabla+g%28x_k%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"z_{k} = &#92;textnormal{prox}_{H_{k}}(x_{k}^{+}) = &#92;textnormal{prox}_{H_{k}}(x_{k} - H_{k}^{-1} &#92;nabla g(x_k))\" class=\"latex\" \/><\/div>\n<\/li>\n<li style=\"text-align: left\"><em>Take a step in direction <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28z%5E%7Bk%7D+-+x%5E%7Bk%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(z^{k} - x^{k})\" class=\"latex\" \/><\/em>: Starting at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_k\" class=\"latex\" \/>, evaluate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k+1}\" class=\"latex\" \/> as:\n<div style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk%2B1%7D+%3D+x_%7Bk%7D+%2B+t_%7Bk%7D+%28z_%7Bk%7D+-+x_%7Bk%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k+1} = x_{k} + t_{k} (z_{k} - x_{k})\" class=\"latex\" \/><\/div>\n<\/li>\n<\/ol>\n<p>It should be noted that the the proximal operator in the case of Proximal Newton Method relies on the computation of the Hessian. Hence, any line search strategy to identify an appropriate step size will prove expensive.<\/p>\n<p>Proximal Gradient Descent Method can be recovered from Proximal Newton Method by replacing the Hessian with <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H+%3D+I%2F%5Cgamma&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H = I\/&#92;gamma\" class=\"latex\" \/>. Whereas the Proximal Gradient Descent Method converges linearly, Proximal Newton Method shows local quadratic convergence [4].<\/p>\n<h3>Nesterov Acceleration<\/h3>\n<p>Nesterov suggests that the optimal convergence rate for first order methods is <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=O%281%2F%5Csqrt%7B%5Cepsilon%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"O(1\/&#92;sqrt{&#92;epsilon})\" class=\"latex\" \/> in [6]. In [5], he introduces ideas to accelerate convergence of smooth functions and shows that the optimal rate is achievable. Accelerated Proximal Gradient Descent Method is built upon the foundations of the ideas laid in [6]. The method is structured as follows [4]:<\/p>\n<ol>\n<li style=\"text-align: left\"><em>Choose initial point<\/em>: <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7B-1%7D+%3D+x_%7B0%7D+%5Cin+%5Cmathbb%7BR%7D%5E%7Bn%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{-1} = x_{0} &#92;in &#92;mathbb{R}^{n}\" class=\"latex\" \/><\/li>\n<li style=\"text-align: left\"><em>Compute an intermediate vector <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk-1%7D%5E%7B%2B%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k-1}^{+}\" class=\"latex\" \/><\/em>: Starting at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk-2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k-2}\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk-1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k-1}\" class=\"latex\" \/>, evaluate:\n<div style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk-1%7D%5E%7B%2B%7D+%3D+x_%7Bk-1%7D+%2B+%5Cfrac%7Bk-2%7D%7Bk%2B1%7D+%5Cleft%28x_%7Bk+-+1%7D+-+x_%7Bk+-+2%7D%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k-1}^{+} = x_{k-1} + &#92;frac{k-2}{k+1} &#92;left(x_{k - 1} - x_{k - 2}&#92;right)\" class=\"latex\" \/><\/div>\n<\/li>\n<li style=\"text-align: left\"><em>Evaluate the prox operator after taking a gradient step using <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk-1%7D%5E%7B%2B%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k-1}^{+}\" class=\"latex\" \/><\/em>: Starting at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk-1%7D%5E%7B%2B%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k-1}^{+}\" class=\"latex\" \/>, evaluate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k}\" class=\"latex\" \/> as:\n<div style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk%7D+%3D+%5Ctextnormal%7Bprox%7D_%7Bt_k%2C+h%7D%28x_%7Bk-1%7D%5E%7B%2B%7D+-+t_%7Bk%7D+%5Cnabla+g%28x_%7Bk-1%7D%5E%7B%2B%7D%29%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k} = &#92;textnormal{prox}_{t_k, h}(x_{k-1}^{+} - t_{k} &#92;nabla g(x_{k-1}^{+}))\" class=\"latex\" \/><\/div>\n<\/li>\n<\/ol>\n<p>The intepretation of<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_%7Bk-1%7D%5E%7B%2B%7D+%3D+x_%7Bk-1%7D+%2B+%5Cfrac%7Bk-2%7D%7Bk%2B1%7D+%5Cleft%28x_%7Bk+-+1%7D+-+x_%7Bk+-+2%7D%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_{k-1}^{+} = x_{k-1} + &#92;frac{k-2}{k+1} &#92;left(x_{k - 1} - x_{k - 2}&#92;right)\" class=\"latex\" \/><\/p>\n<p>is the same as momentum. If <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h%28x%29+%5Cequiv+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h(x) &#92;equiv 0\" class=\"latex\" \/>, then the prox operator is identity and accelerated gradient descent method is recovered [4].<\/p>\n<h3>Summary<\/h3>\n<p>Proximal algorithms can be used to solve constrained optimization problems that can be split into sum of convex differentiable and convex non-smooth parts. If the prox operator is cheap to evaluate, then linear convergence is recovered in the usual scenario, like in the case of gradient descent. Several other algorithms can be recast in terms of a proximal method [1,2]. Although closed form solutions to prox operator may be required, in [7] the authors study the convergence rates when the prox operator is evaluated in an inexact manner.<\/p>\n<h3>References<\/h3>\n<p>[1]\u00a0Parikh, N. and Boyd, S., 2014. Proximal algorithms.\u00a0<i>Foundations and Trends\u00ae in Optimization<\/i>,\u00a0<i>1<\/i>(3), pp.127-239. [<a href=\"https:\/\/web.stanford.edu\/~boyd\/papers\/pdf\/prox_algs.pdf\">paper<\/a>]<\/p>\n<p>[2] Polson, N.G., Scott, J.G. and Willard, B.T., 2015. Proximal algorithms in statistics and machine learning.\u00a0<i>Statistical Science<\/i>,\u00a0<i>30<\/i>(4), pp.559-581. [<a href=\"https:\/\/arxiv.org\/pdf\/1502.03175.pdf\">paper<\/a>]<\/p>\n<p>[3]\u00a0Boyd, S., Xiao, L. and Mutapcic, A., 2003. Subgradient methods.\u00a0<i>lecture notes of EE392o, Stanford University, Autumn Quarter<\/i>,\u00a0<i>2004<\/i>, pp.2004-2005. [<a href=\"http:\/\/web.mit.edu\/6.976\/www\/notes\/subgrad_method.pdf\">monograph<\/a>]<\/p>\n<p>[4]\u00a0Tibshirani, R., 2016. Proximal Newton Method.\u00a0<i>Slides from 10-725, Carnegie Mellon University, Fall <\/i>,\u00a0<i>2016<\/i>. [<a href=\"http:\/\/www.stat.cmu.edu\/~ryantibs\/convexopt\/lectures\/prox-newton.pdf\">presentation<\/a>]<\/p>\n<p>[5]\u00a0Nesterov, Y., 1983, February. A method of solving a convex programming problem with convergence rate O (1\/k2). In\u00a0<i>Soviet Mathematics Doklady<\/i>\u00a0(Vol. 27, No. 2, pp. 372-376). [<a href=\"http:\/\/www.cis.pku.edu.cn\/faculty\/vision\/zlin\/1983-A%20Method%20of%20Solving%20a%20Convex%20Programming%20Problem%20with%20Convergence%20Rate%20O(k%5E(-2))_Nesterov.pdf\">paper<\/a>]<\/p>\n<p>[6]\u00a0Nesterov, Y., 2013.\u00a0<i>Introductory lectures on convex optimization: A basic course<\/i>\u00a0(Vol. 87). Springer Science &amp; Business Media. [<a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.693.855&amp;rep=rep1&amp;type=pdf\">book<\/a>]<\/p>\n<p>[7]\u00a0Schmidt, M., Roux, N.L. and Bach, F.R., 2011. Convergence rates of inexact proximal-gradient methods for convex optimization. In\u00a0<i>Advances in neural information processing systems<\/i>\u00a0(pp. 1458-1466). [<a href=\"http:\/\/papers.nips.cc\/paper\/4452-convergence-rates-of-inexact-proximal-gradient-methods-for-convex-optimization.pdf\">paper<\/a>]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Proximal algorithms are a class of algorithms that can be used to solve constrained optimization problems that may involve non-smooth penalties in the objective function. Their formulation is rather general in that a number of other optimization algorithms can be derived as special cases of proximal algorithms. At the core of\u00a0 proximal algorithms exists &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/04\/25\/proximal-algorithms\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Proximal Algorithms<\/span><\/a><\/p>\n","protected":false},"author":146,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-446","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9CQAE-7c","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/446","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/146"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=446"}],"version-history":[{"count":4,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/446\/revisions"}],"predecessor-version":[{"id":789,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/446\/revisions\/789"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=446"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=446"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=446"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}