{"id":22,"date":"2018-02-26T00:43:54","date_gmt":"2018-02-26T00:43:54","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=22"},"modified":"2018-03-10T01:17:38","modified_gmt":"2018-03-10T01:17:38","slug":"22","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/02\/26\/22\/","title":{"rendered":"Subgradient Method and Stochastic Gradient Descent"},"content":{"rendered":"<h2>Subgradient Method<\/h2>\n<p>Subgradient method is the basic optimization algorithm for convex\u00a0 nondifferentiable functions. First, let&#8217;s see what a subgradient is. A subgradient is a generalization of the gradient for nondifferentiable functions. For simplicity, let&#8217;s elaborate it in a one-dimensional space. The gradient of a function at a certain point is the slope of the tangent line at that point.<\/p>\n<figure id=\"attachment_33\" aria-describedby=\"caption-attachment-33\" style=\"width: 560px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-33 size-full\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/02\/gradient.jpg\" alt=\"\" width=\"560\" height=\"420\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/02\/gradient.jpg 560w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/02\/gradient-300x225.jpg 300w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><figcaption id=\"caption-attachment-33\" class=\"wp-caption-text\">Fig. 1: Gradient of a function.<\/figcaption><\/figure>\n<p>However, for nondifferentiable functions, there is no unique gradient at nondifferentiable points. Instead, there are subgradients <span class=\"fontstyle0\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g\" class=\"latex\" \/> which satisfy the following condition:<\/span><\/p>\n<p><span class=\"fontstyle0\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28y%29%5Cgeq+J%28x%29+%2Bg%5ET%28y-x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(y)&#92;geq J(x) +g^T(y-x)\" class=\"latex\" \/> for all y.<\/span><\/p>\n<p>All vectors <span class=\"fontstyle0\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g\" class=\"latex\" \/><\/span> that satisfy this condition are called subgradients of the function J at point x.<\/p>\n<figure id=\"attachment_35\" aria-describedby=\"caption-attachment-35\" style=\"width: 560px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-35 size-full\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/02\/subgradient.png\" alt=\"\" width=\"560\" height=\"420\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/02\/subgradient.png 560w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/02\/subgradient-300x225.png 300w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><figcaption id=\"caption-attachment-35\" class=\"wp-caption-text\">Fig. 2: Slopes of both blue and red line are the subgradients of the nondifferentiable function in the breaking point.<\/figcaption><\/figure>\n<div class=\"mceTemp\"><\/div>\n<p>In Fig. 2, you can see that slopes of both blue and red lines are subgradients of the function at\u00a0<span class=\"fontstyle0\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%3D1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x=1\" class=\"latex\" \/><\/span>.<\/p>\n<p>Now assume that our goal is to minimize the cost function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x)\" class=\"latex\" \/> . The update algorithm is simply<\/p>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7B%28k%2B1%29%7D%3Dx%5E%7B%28k%29%7D-%5Calpha_k+g%5E%7B%28k%29%7D%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{(k+1)}=x^{(k)}-&#92;alpha_k g^{(k)},\" class=\"latex\" \/>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha_k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha_k\" class=\"latex\" \/> is subgradient of\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x)\" class=\"latex\" \/> at point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7B%28k%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{(k)}\" class=\"latex\" \/>.<\/p>\n<p>However, the subgradient direction is not always a descent direction (can you think of an example?), so we have to keep track of the smallest objective function value that we have reached so far(i.e. iteration <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/>):<\/p>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ek_%7Bbest%7D%3D%5Cmin+%5C%7Bf%28x%5E%7B%281%29%7D%29%2C%5Ccdots%2Cf%28x%5E%7B%28k%29%7D%29%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^k_{best}=&#92;min &#92;{f(x^{(1)}),&#92;cdots,f(x^{(k)})&#92;}\" class=\"latex\" \/>\n<p>&nbsp;<\/p>\n<p>The proof of convergence for this algorithm depends on three conditions:<\/p>\n<ul>\n<li>The function\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x)\" class=\"latex\" \/> is a convex\u00a0 function. However,\u00a0 it is not necessarily strongly convex.<\/li>\n<li>The size of the subgradient is bounded, i.e, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7Cg%5E%7B%28k%29%7D%5C%7C_2+%5Cleq+G&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;|g^{(k)}&#92;|_2 &#92;leq G\" class=\"latex\" \/> for all <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/>. For example, if the function\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x)\" class=\"latex\" \/> is Lipschitz continuous, then its subgradient is bounded.<\/li>\n<li>The distance of the initial point to the optimal set is bounded by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R\" class=\"latex\" \/>, i.e.,\u00a0 <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5C%7Cx%5E%7B%281%29%7D-x%5E%2A%5Cleq+R%5C%7C_2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;|x^{(1)}-x^*&#92;leq R&#92;|_2\" class=\"latex\" \/>.<\/li>\n<\/ul>\n<p>With the aforementioned conditions, it is shown that the distance between best point found so far <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ek_%7Bbest%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^k_{best}\" class=\"latex\" \/> of the algorithm with the optimum value <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5E%2A+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^* \" class=\"latex\" \/> is bounded, i.e.,<\/p>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ek_%7Bbest%7D-f%5E%2A+%5Cleq+%5Cfrac%7BR%5E2%2BG%5E2%5Csum_%7Bi%3D1%7D%5Ek+%5Calpha_i%5E2%7D%7B2+%5Csum_%7Bi%3D1%7D%5Ek+%5Calpha_i%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^k_{best}-f^* &#92;leq &#92;frac{R^2+G^2&#92;sum_{i=1}^k &#92;alpha_i^2}{2 &#92;sum_{i=1}^k &#92;alpha_i}\" class=\"latex\" \/>\n<p>Now, depending on the step size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha_i&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha_i\" class=\"latex\" \/>, we find different values for the bound. Two step sizes that we went through their detail in the class were:<\/p>\n<ul>\n<li>Constant step size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha_i%3D%5Calpha&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha_i=&#92;alpha\" class=\"latex\" \/>: which is a positive constant. In this case, we will have <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ek_%7Bbest%7D-f%5E%2A+%5Cleq+%5Cfrac%7BR%5E2%2BG%5E2+k+%5Calpha%5E2%7D%7B2+k+%5Calpha%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^k_{best}-f^* &#92;leq &#92;frac{R^2+G^2 k &#92;alpha^2}{2 k &#92;alpha}\" class=\"latex\" \/>. As <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/> goes to infinity, the bound converges to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=G%5E2+%5Calpha%2F2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"G^2 &#92;alpha\/2\" class=\"latex\" \/>.<\/li>\n<li>Square summable but not summable <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bk%3D1%7D%5E%5Cinfty+%5Calpha_k%5E2%3C%5Cinfty&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{k=1}^&#92;infty &#92;alpha_k^2&lt;&#92;infty\" class=\"latex\" \/>, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum_%7Bk%3D1%7D%5E%5Cinfty+%5Calpha_k%3D%5Cinfty&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum_{k=1}^&#92;infty &#92;alpha_k=&#92;infty\" class=\"latex\" \/><\/li>\n<\/ul>\n<p>We showed that the convergence rate in both cases are sublinear with the definition <a href=\"https:\/\/calculus.subwiki.org\/wiki\/Rate_of_convergence#Steadily_linear.2C_superlinear.2C_and_sublinear_convergence\">here<\/a>.<\/p>\n<p>Also, we discussed in class that if the number of iterations is determined, i.e., we want to reach the best we can in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/> iterations, then the best possible bound that we can get with any step size is:<\/p>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%5Ek_%7Bbest%7D-f%5E%2A+%5Cleq+%5Cfrac%7BR+G%7D%7B%5Csqrt%7Bk%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f^k_{best}-f^* &#92;leq &#92;frac{R G}{&#92;sqrt{k}}\" class=\"latex\" \/>\n<p>Using this bound, we found that if we want to reduce the gap between the current objective value and the optimal objective by a factor of 1000, we need <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=10%5E6&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"10^6\" class=\"latex\" \/> iterations. To compare it with the gradient descent method, we have to note that if <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7Bm%7D%7BM%7D%3D0.5&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{m}{M}=0.5\" class=\"latex\" \/>, then it only needs 10 iterations in gradient descent. However, this gradient descent convergence rate is for strongly convex case only.<\/p>\n<h2>Stochastic Gradient Descent<\/h2>\n<p>We have seen that we can use standard gradient descent given a convex function and its gradient. We discussed earlier in this section that we can use subgradient descent given a convex function with undefined gradient and its subgradient. What approach is appropriate when we are given a convex function with a stochastic gradient?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-142\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/convex-stochastic-gradient-300x151.png\" alt=\"\" width=\"632\" height=\"318\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/convex-stochastic-gradient-300x151.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/convex-stochastic-gradient-768x387.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/convex-stochastic-gradient-1024x516.png 1024w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/convex-stochastic-gradient.png 1421w\" sizes=\"auto, (max-width: 632px) 100vw, 632px\" \/><\/p>\n<p>Stochastic gradient descent (SGD) is a term used to describe the application of classic gradient descent and subgradient methods when only a stochastic estimate of the true (sub)gradient is available.<\/p>\n<p>To get a better feel for what this means and why this is useful, imagine the following two motivating examples. First, consider the training of a neural network (NN) when we have millions of training samples. The huge number of training samples causes the calculation of the gradient to be very expensive, making training slow. We could randomly select a subset of \u200btraining samples (say, 1000 of them) and compute a gradient with respect to only that \u200bsample. \u200bWe assume (and in this case, know) that the gradient estimate generated from a sample will behave like the true gradient \u200bon average \u200band still point us in a general descent direction.<\/p>\n<p>In another example consider a black box optimization problem, meaning that the optimization algorithm only has access to actual function values *without* gradient information. At first glance, it may seem like gradient descent methods cannot be used for this problem. However, if we assume that the underlying function has some amount of &#8220;smoothness&#8221; and randomly sample function values \u200bnear the current iterate, we can use a least squares fit to estimate the gradient.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-143\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/stochastic-function-estimate-300x151.png\" alt=\"\" width=\"626\" height=\"315\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/stochastic-function-estimate-300x151.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/stochastic-function-estimate-768x386.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/stochastic-function-estimate-1024x514.png 1024w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/03\/stochastic-function-estimate.png 1432w\" sizes=\"auto, (max-width: 626px) 100vw, 626px\" \/><\/p>\n<p>In general, the formal assumption for any SGD method is that the optimization algorithm has access to an estimate <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bg%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{g}\" class=\"latex\" \/> for the true (sub)gradient <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnabla+f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nabla f\" class=\"latex\" \/> at every point <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/>, which satisfies<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7B%5Cmathbb%7BE%7D%5Cleft%28%5Chat%7Bg%7D%28x%29%5Cright%29%7D+%3D+%5Cnabla+f%28x%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"{&#92;mathbb{E}&#92;left(&#92;hat{g}(x)&#92;right)} = &#92;nabla f(x),\" class=\"latex\" \/><\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D%5Cleft%28%5Chat%7Bg%7D%28x%29%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}&#92;left(&#92;hat{g}(x)&#92;right)\" class=\"latex\" \/> denotes the <em>expected value<\/em> of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bg%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{g}\" class=\"latex\" \/> at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/>.<\/p>\n<p>Using <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bg%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{g}\" class=\"latex\" \/> as a proxy for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnabla+f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nabla f\" class=\"latex\" \/>, we can perform standard (sub)gradient descent in order to converge in expected value to the true minimum. That is, we can use the &#8220;vanilla&#8221; SGD update<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7B%28k%2B1%29%7D+%3D+x%5E%7B%28k%29%7D+-+%5Calpha%5E%7B%28k%29%7D+%5Chat%7Bg%7D%28x%5E%7B%28k%29%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{(k+1)} = x^{(k)} - &#92;alpha^{(k)} &#92;hat{g}(x^{(k)})\" class=\"latex\" \/><\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha%5E%7B%28k%29%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha^{(k)}\" class=\"latex\" \/> is the step size.<\/p>\n<p>We see two convergence proofs for this vanilla SGD. The first leverages an inductive argument to show that for a strongly convex differentiable function $f$, after <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"t\" class=\"latex\" \/> iterations using a diminishing step size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha%5E%7B%28k%29%7D+%3D+%5Cfrac%7BC%7D%7Bk%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha^{(k)} = &#92;frac{C}{k}\" class=\"latex\" \/>, the expected error in function value is given by<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathbb%7BE%7D%28f%28x%5E%7B%28t%29%7D%29+-+f%28x%5E%2A%29%29+%5Capprox+%5Cmathcal%7BO%7D%28%5Cfrac%7B1%7D%7Bt%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathbb{E}(f(x^{(t)}) - f(x^*)) &#92;approx &#92;mathcal{O}(&#92;frac{1}{t})\" class=\"latex\" \/><\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^*\" class=\"latex\" \/> is the true optimal value for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x\" class=\"latex\" \/>.<\/p>\n<p>In a slight modification, often called &#8220;robust&#8221; SGD, we relax the assumptions so that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> need only be convex with a subgradient. Now, using a constant step size <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha%5E%7B%28k%29%7D+%3D+%5Calpha&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha^{(k)} = &#92;alpha\" class=\"latex\" \/>, robust SGD can be proven to converge to some neighborhood of the true minima <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^*\" class=\"latex\" \/> with a radius proportional to <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha\" class=\"latex\" \/> at a rate of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathcal%7BO%7D%28%5Cfrac%7B1%7D%7B%5Csqrt%7Bt%7D%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathcal{O}(&#92;frac{1}{&#92;sqrt{t}})\" class=\"latex\" \/>. It follows, that for a general function, it is good practice to begin SGD with a constant step size, then diminish the step size whenever the algorithm seems to have stalled in its progress.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Subgradient Method Subgradient method is the basic optimization algorithm for convex\u00a0 nondifferentiable functions. First, let&#8217;s see what a subgradient is. A subgradient is a generalization of the gradient for nondifferentiable functions. For simplicity, let&#8217;s elaborate it in a one-dimensional space. The gradient of a function at a certain point is the slope of the tangent &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/02\/26\/22\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Subgradient Method and Stochastic Gradient Descent<\/span><\/a><\/p>\n","protected":false},"author":151,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-22","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/s9CQAE-22","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/22","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/151"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=22"}],"version-history":[{"count":51,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/22\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/22\/revisions\/149"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=22"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=22"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=22"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}