{"id":5,"date":"2018-02-03T16:40:19","date_gmt":"2018-02-03T16:40:19","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=5"},"modified":"2018-02-03T16:40:19","modified_gmt":"2018-02-03T16:40:19","slug":"gradient-descent-newtons-method-and-lbfgs","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/02\/03\/gradient-descent-newtons-method-and-lbfgs\/","title":{"rendered":"Gradient Descent, Newton&#8217;s Method, and LBFGS"},"content":{"rendered":"<p>In the first few sessions of the course, we went over gradient descent (with exact line search), Newton&#8217;s Method, and quasi-Newton methods. For me, and many of the students, this was the first time I had sat down to go over the convergence guarantees of these methods and how they are proven.<\/p>\n<p>Generally, we were examining descent methods that aim to solve optimizations of the form<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmin_%7Bx+%5Cin+%5Cmathbb%7BR%7D%7D+f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;min_{x &#92;in &#92;mathbb{R}} f(x)\" class=\"latex\" \/><\/p>\n<p>by iteratively trying input vectors in a sequence defined by<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7B%28k%2B1%29%7D+%3D+x%5E%7B%28k%29%7D+%2B+t%5E%7B%28k%29%7D+%5CDelta+x&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{(k+1)} = x^{(k)} + t^{(k)} &#92;Delta x\" class=\"latex\" \/>.<\/p>\n<p>For the analyses we studied so far, we assume the function is strongly convex, meaning that there exists some constant <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/> for which<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28y%29+%5Cle+f%28x%29+%2B+%5Cnabla+f%28x%29%5E%5Ctop+%28y+-+x%29+%2B+%5Cfrac%7Bm%7D%7B2%7D+%7C%7C+y+-+x+%7C%7C%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(y) &#92;le f(x) + &#92;nabla f(x)^&#92;top (y - x) + &#92;frac{m}{2} || y - x ||^2\" class=\"latex\" \/>,<\/p>\n<p>and there is another constant <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"M\" class=\"latex\" \/> such that<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28y%29+%5Cge+f%28x%29+%2B+%5Cnabla+f%28x%29%5E%5Ctop+%28y+-+x%29+%2B+%5Cfrac%7BM%7D%7B2%7D+%7C%7C+y+-+x+%7C%7C%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(y) &#92;ge f(x) + &#92;nabla f(x)^&#92;top (y - x) + &#92;frac{M}{2} || y - x ||^2\" class=\"latex\" \/>.<\/p>\n<p>In other words, the function can be upper and lower bounded by quadratic functions.<\/p>\n<h2>Gradient Descent<\/h2>\n<p>We first went over the proof in Boyd and Vanderberghe&#8217;s <a href=\"https:\/\/web.stanford.edu\/~boyd\/cvxbook\/\">textbook<\/a> on convex optimization for gradient descent. The approach they used to show the convergence rate of gradient descent was to bound the improvement each iteration, which shows that the gap between the current objective value and the optimal objective value shrinks by a constant factor each iteration. This constant factor shrinkage is known as\u00a0<em>linear convergence<\/em>. This type of analysis results in a bound of the form<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%29+-+p%2A+%5Cle+c%5Ek+%28f%28x%5E%7B%280%29%7D%29+-+p%2A%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x) - p* &#92;le c^k (f(x^{(0)}) - p*)\" class=\"latex\" \/>,<\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=c+%3D+1+-+m+%2F+M&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"c = 1 - m \/ M\" class=\"latex\" \/> is considered a constant, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=p%2A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"p*\" class=\"latex\" \/> is the value of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/> at the optimum. Each individual step to arrive at this bound were fairly easy to understand, but I found it a bit hard to see how the original analysts decided what steps to take. Did they start with the linear convergence rate and work backwards? Or did they try different manipulations of the strong convexity conditions until they reached a linear rate? I hope as we study more proofs in the semester that I, and the students, start to gain a better intuition about how to do these proofs ourselves when faced with a new method.<\/p>\n<h2>Newton&#8217;s Method<\/h2>\n<p>We continued with the Boyd &amp; Vanderberghe <a href=\"https:\/\/web.stanford.edu\/~boyd\/cvxbook\/\">book<\/a>, looking at its discussion of Newton&#8217;s Method (with backtracking). Many of us in the class had the rough idea that Newton&#8217;s Method, because it uses second-order information, achieves a quadratic convergence rate, and the analysis does confirm that.<\/p>\n<p>The backtracking Newton Update is<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7B%28k%2B1%29%7D+%3D+x%5E%7B%28k%29%7D+-+t%5E%7B%28k%29%7D+%5Cnabla%5E2+f%28x%29%5E%7B-1%7D+%5Cnabla+f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{(k+1)} = x^{(k)} - t^{(k)} &#92;nabla^2 f(x)^{-1} &#92;nabla f(x)\" class=\"latex\" \/>,<\/p>\n<p>where the step size $t^{(k)}$ shrinks until [to-do: describe backtracking condition].<\/p>\n<p>The analysis adds another assumption about <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f\" class=\"latex\" \/>: that its Hessian is Lipschitz continuous:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7C%7C%5Cnabla%5E2+f%28x%29+-+%5Cnabla%5E2+f%28y%29%7C%7C+%5Cle+L+%7C%7Cx+-+y%7C%7C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"||&#92;nabla^2 f(x) - &#92;nabla^2 f(y)|| &#92;le L ||x - y||\" class=\"latex\" \/>.<\/p>\n<p>The convergence analysis is broken into two phases, one where the backtracking search backtracks, and the other where the full Newton Step is used. During the phase with the full Newton Step, it is shown that the error is bounded by a rapidly shrinking constant:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28x%5E%7B%28l%29%7D%29+-+p%2A+%5Cle+%5Cfrac%7B2m%5E3%7D%7BL%5E2%7D+%5Cleft%28+%5Cfrac%7B1%7D%7B2%7D+%5Cright%29%5E%7B2%5E%7Bl+-+k+%2B+1%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(x^{(l)}) - p* &#92;le &#92;frac{2m^3}{L^2} &#92;left( &#92;frac{1}{2} &#92;right)^{2^{l - k + 1}}\" class=\"latex\" \/>,<\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/> is the step at which we enter this second phase. This is the quadratically converging phase of Newton&#8217;s Method. The key is then how quickly we enter the second phase. The eventual bound on how many steps it takes to enter the second phase is<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k+%5Cle+%5Cfrac%7BM%5E2+L%5E2+%2F+m%5E5%7D%7B%5Calpha+%5Cbeta+%5Cmin%5C%7B1%2C+9+%281+-+2+%5Calpha%29%5E2+%5C%7D%7D+%28f%28x%5E%7B%280%29%7D%29+-+p%2A%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k &#92;le &#92;frac{M^2 L^2 \/ m^5}{&#92;alpha &#92;beta &#92;min&#92;{1, 9 (1 - 2 &#92;alpha)^2 &#92;}} (f(x^{(0)}) - p*)\" class=\"latex\" \/>.<\/p>\n<p>In this formula <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha\" class=\"latex\" \/> and $\\latex beta$ are constant parameters of the backtracking line search, so the number of iterations necessary to start the quadratically converging phase is a constant scaling of the initial error.<\/p>\n<h2>LBFGS<\/h2>\n<p>Newton&#8217;s Method is great, but each iteration is rather expensive because it involves the computation of the Hessian and inverting it. For high-dimensional problems, this can make Newton&#8217;s Method practically unusable. Our last topic of this block of classes was on one of the more famous quasi-Newton methods. We decided to read the highly cite paper [to-do paper title], but we were surprised as a class to discover that this paper is more of an empirical study of LBFGS in comparison to other quasi-Newton approaches. So we supplemented with other resources, like Vandenberghe&#8217;s <a href=\"http:\/\/www.seas.ucla.edu\/~vandenbe\/236C\/lectures\/qnewton.pdf\">slides<\/a>.<\/p>\n<p>We were surprised in the proof presented in Liu and Nocedal&#8217;s paper that they prove LBFGS to be a linearly converging method. If LBFGS is not provably faster than gradient descent, it&#8217;s not clear why anyone would use it. However, we hypothesized as a class that there may be other proofs elsewhere that show super-linear convergence, because many of us thought we had seen LBFGS listed as a super-linear algorithm. We&#8217;ll certainly look into this later as a class.<\/p>\n<p>We spent time in class going over the secant condition that LBFGS, and BFGS, uses to approximate the Hessian. The key idea we had to understand is that the secant method can be viewed as a linear approximation of the Hessian analogous to making a linear approximation of the gradient by measuring the function value at two points. One measures the gradient at two locations <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnabla+f%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nabla f(x)\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnabla+f%28y%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nabla f(y)\" class=\"latex\" \/>, and the secant condition is<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=H%28x+-+y%29+%3D+%5Cnabla+f%28x%29+-+%5Cnabla+f%28y%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"H(x - y) = &#92;nabla f(x) - &#92;nabla f(y)\" class=\"latex\" \/>.<\/p>\n<p>The BFGS method then iteratively projects the previous Hessian approximation onto the space of Hessian approximations that agree with this condition, which is reasonable if you believe the Hessian does not change much. LBFGS instead only stores the previous <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m\" class=\"latex\" \/> gradients and estimates the Hessian using those.<\/p>\n<p>We also puzzled over how LBFGS can avoid the quadratic cost of either storing or computing the approximate Hessian. As a class, we were initially confused because most formulas for the approximate Hessian include outer product terms involving the differences between the input vectors and the gradients, e.g.,<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cunderbrace%7B%28x%5E%7B%28j%29%7D+-+x%5E%7B%28j+-+1%29%7D%29%7D_%7Bs%5Ej%7D+%7E+%5Cunderbrace%7B%28%5Cnabla+f%28x%5E%7B%28j%29%7D%29+-+%5Cnabla+f%28x%5E%7B%28j+-+1%29%7D%29%29%5E%5Ctop%7D_%7By_j%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;underbrace{(x^{(j)} - x^{(j - 1)})}_{s^j} ~ &#92;underbrace{(&#92;nabla f(x^{(j)}) - &#92;nabla f(x^{(j - 1)}))^&#92;top}_{y_j}\" class=\"latex\" \/>,<\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s_j&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s_j\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y_j&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y_j\" class=\"latex\" \/> are the shorthand used to describe these difference vectors in the slides. Based on the existence of these outer products, it appears as if an <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=O%28n%5E2%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"O(n^2)\" class=\"latex\" \/> cost is unavoidable, when all the literature says the memory and running time costs are <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=O%28mn%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"O(mn)\" class=\"latex\" \/>.<\/p>\n<p>The key insight to why we can avoid the quadratic cost\u2014of even thinking about the full-sized Hessian\u2014is that in all cases of these outer products, they are eventually right multiplied by yet another vector of length <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/>. We then have products of length-<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n\" class=\"latex\" \/> vectors that look like<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cunderbrace%7Ba+b%5E%5Ctop%7D_%7Bn+%5Ctimes+n%7D+%5Cunderbrace%7Bc%7D_%7Bn+%5Ctimes+1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;underbrace{a b^&#92;top}_{n &#92;times n} &#92;underbrace{c}_{n &#92;times 1}\" class=\"latex\" \/>,<\/p>\n<p>which can be computed much more efficiently by first multiplying the dot product on the right:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cunderbrace%7Ba%7D_%7Bn+%5Ctimes+1%7D+%5Cunderbrace%7Bb%5E%5Ctop+c%7D_%7B1+%5Ctimes+1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;underbrace{a}_{n &#92;times 1} &#92;underbrace{b^&#92;top c}_{1 &#92;times 1}\" class=\"latex\" \/>.<\/p>\n<h2>Concluding Thoughts<\/h2>\n<p>One of my goals with looking at these foundational methods first was to gain an intuition of how to prove convergence of more sophisticated optimization approaches. To do so, I was hoping to find nice patterns of proof techniques. I&#8217;m not certain that I did that, personally, with these three analyses. Parts of the analyses were analogous to each other, but overall, their convergence proofs seemed rather different to me. Perhaps as we see more analyses throughout the semester, we&#8217;ll start to recognize patterns more.<\/p>\n<p>It&#8217;s interesting to have sat down to really dig into these methods, since so much of what I know as a machine learning researchers comes from &#8220;mythology&#8221; about these methods. One important note is the strength of the assumptions underlying the convergence guarantees. We often use these methods and variants on functions that may not be strongly convex or convex at all. These misuses of these algorithms will naturally occur. &#8220;When all you have is a hammer&#8230;&#8221;\u00a0From spending three (and a half) classes looking at these convergence guarantees, I don&#8217;t yet have a clear idea of what we can say theoretically when these assumptions are violated.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the first few sessions of the course, we went over gradient descent (with exact line search), Newton&#8217;s Method, and quasi-Newton methods. For me, and many of the students, this was the first time I had sat down to go over the convergence guarantees of these methods and how they are proven. Generally, we were &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/02\/03\/gradient-descent-newtons-method-and-lbfgs\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Gradient Descent, Newton&#8217;s Method, and LBFGS<\/span><\/a><\/p>\n","protected":false},"author":141,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-5","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9CQAE-5","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/5","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/141"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=5"}],"version-history":[{"count":5,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/5\/revisions"}],"predecessor-version":[{"id":17,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/5\/revisions\/17"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=5"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=5"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=5"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}