{"id":796,"date":"2018-05-11T05:43:39","date_gmt":"2018-05-11T05:43:39","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=796"},"modified":"2018-05-11T05:43:39","modified_gmt":"2018-05-11T05:43:39","slug":"nesterov-accelerated-gradient-descent","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/05\/11\/nesterov-accelerated-gradient-descent\/","title":{"rendered":"(Nesterov) Accelerated Gradient Descent"},"content":{"rendered":"<p>We define standard gradient descent as:<br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%5E%7Bk%2B1%7D+%3D+w%5Ek+-+%5Calpha+%5CDelta+f%28w%5Ek%29+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w^{k+1} = w^k - &#92;alpha &#92;Delta f(w^k) \" class=\"latex\" \/><\/p>\n<p>Momentum adds a relatively subtle change to gradient descent by replace f with z. Where<br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=z%5E%7Bk%2B1%7D+%3D+%5Cbeta+z%5Ek+%2B+%5CDelta+f%28w%5Ek%29+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"z^{k+1} = &#92;beta z^k + &#92;Delta f(w^k) \" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%5E%7Bk%2B1%7D+%3D+w%5Ek+-+%5Calpha+z%5E%7Bk%2B1%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w^{k+1} = w^k - &#92;alpha z^{k+1} \" class=\"latex\" \/><\/p>\n<p>As such, alpha retains the role of representing the step size. However, we introduce a new constant, beta. Beta adds a decaying effect on momentum, and must be between 0 and 1, noninclusive. We briefly demonstrate the edge cases of beta below:<br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta+%3D+0%3A++++w%5E%7Bk%2B1%7D+%3D+w%5Ek+%2B+%5Calpha+%5Cbig+%28+0+%5Ctimes+z%5Ek+%2B+%5CDelta+f%28w%5Ek%29+%5Cbig+%29++++%5Crightarrow++++w%5E%7Bk%2B1%7D+%3D+w%5Ek+%2B%5Calpha+%5CDelta+f%28w%5Ek%29++++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta = 0:    w^{k+1} = w^k + &#92;alpha &#92;big ( 0 &#92;times z^k + &#92;Delta f(w^k) &#92;big )    &#92;rightarrow    w^{k+1} = w^k +&#92;alpha &#92;Delta f(w^k)    \" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta+%3D+1%3A+w%5E%7Bk%2B1%7D+%3D+w%5Ek+-+%5Calpha+%5Cbig+%28++++%5CDelta+f%28w%5Ek%29+%2B+%5CDelta+f%28w%5E%7Bk-1%7D%29+%2B+%5CDelta+f%28w%5E%7Bk-2%7D%29+%2B+%5Cdots++++%5Cbig+%29++++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta = 1: w^{k+1} = w^k - &#92;alpha &#92;big (    &#92;Delta f(w^k) + &#92;Delta f(w^{k-1}) + &#92;Delta f(w^{k-2}) + &#92;dots    &#92;big )    \" class=\"latex\" \/><\/p>\n<p>We now introduce a change of basis which will be convenient for later. Consider a matrix form of gradient descent:<br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%5E%7Bk%2B1%7D+%3D+w%5Ek+-+%5Calpha+%28Aw%5Ek+-+b%29+%5Ctext%7B+where+%7D+%5CDelta+f%28w%29+%3D+Aw%5Ek-b%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w^{k+1} = w^k - &#92;alpha (Aw^k - b) &#92;text{ where } &#92;Delta f(w) = Aw^k-b\u00a0\" class=\"latex\" \/><\/p>\n<p>Which has optimal solution<br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%5E%7B%5Cstar%7D+%3D+A%5E%7B-1%7Db%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w^{&#92;star} = A^{-1}b\u00a0\" class=\"latex\" \/><br \/>\nwe consider the eigenvectors of A which is a natural space of viewing A where all dimensions act independently.<\/p>\n<p>Given:<br \/>\n$latex\u00a0A = Q \\text{ diag }(\\lambda_1 \\dots \\lambda_n)Q^T $<br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5Ek+%3D+Q%5ET+%28w%5Ek+-+w%5E%7B%5Cstar%7D%29++++%5Cleftarrow+%5Ctext%7Bnew+basis%7D++++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^k = Q^T (w^k - w^{&#92;star})    &#92;leftarrow &#92;text{new basis}    \" class=\"latex\" \/><\/p>\n<p>We may derive this by:<br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%5E%7Bk%2B1%7D+%3D+w%5Ek+-+%5Calpha+%28Aw%5Ek+-+b%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w^{k+1} = w^k - &#92;alpha (Aw^k - b)\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%5E%7Bk%2B1%7D+-w%5E%7B%5Cstar%7D+%3D+w%5Ek+-+w%5E%7B%5Cstar%7D+-+%5Calpha+%28Aw%5Ek+-+b%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w^{k+1} -w^{&#92;star} = w^k - w^{&#92;star} - &#92;alpha (Aw^k - b)\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=Q%5ET%5Cbig%28w%5E%7Bk%2B1%7D+-w%5E%7B%5Cstar%7D%5Cbig%29+%3D+Q%5ET%5Cbig%28w%5Ek+-+w%5E%7B%5Cstar%7D%5Cbig%29+-+%5Calpha+Q%5ET+%5Cbig%28Aw%5Ek+-+b%5Cbig+%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"Q^T&#92;big(w^{k+1} -w^{&#92;star}&#92;big) = Q^T&#92;big(w^k - w^{&#92;star}&#92;big) - &#92;alpha Q^T &#92;big(Aw^k - b&#92;big )\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7Bk%2B1%7D+%3D+x%5Ek+-+%5Calpha+Q%5ET+%5Cbig%28Aw%5Ek+-+b%5Cbig+%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{k+1} = x^k - &#92;alpha Q^T &#92;big(Aw^k - b&#92;big )\u00a0\" class=\"latex\" \/><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w%5E%7B%5Cstar%7D+%3D+A%5E%7B-1%7Db++++%5Cimplies++++Aw%5E%7B%5Cstar%7D+%3D+b++++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w^{&#92;star} = A^{-1}b    &#92;implies    Aw^{&#92;star} = b    \" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7Bk%2B1%7D+%3D+x%5Ek+-+%5Calpha+Q%5ET+%5Cbig%28Aw%5Ek+-+Aw%5E%5Cstar+%5Cbig+%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{k+1} = x^k - &#92;alpha Q^T &#92;big(Aw^k - Aw^&#92;star &#92;big )\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7Bk%2B1%7D+%3D+x%5Ek+-+%5Calpha+Q%5ET+A+%5Cbig+%28w%5Ek+-+w%5E%5Cstar+%5Cbig+%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{k+1} = x^k - &#92;alpha Q^T A &#92;big (w^k - w^&#92;star &#92;big )\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7Bk%2B1%7D+%3D+x%5Ek+-+%5Calpha+Q%5ET+Q+%5Ctext%7Bdiag%7D%5Cbig%28%5Clambda%5Cbig%29Q%5ET+%5Cbig+%28w%5Ek+-+w%5E%5Cstar+%5Cbig+%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{k+1} = x^k - &#92;alpha Q^T Q &#92;text{diag}&#92;big(&#92;lambda&#92;big)Q^T &#92;big (w^k - w^&#92;star &#92;big )\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7Bk%2B1%7D+%3D+x%5Ek+-+%5Calpha+%5Ctext%7Bdiag%7D%5Cbig%28%5Clambda%5Cbig%29Q%5ET+%5Cbig+%28w%5Ek+-+w%5E%5Cstar+%5Cbig+%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{k+1} = x^k - &#92;alpha &#92;text{diag}&#92;big(&#92;lambda&#92;big)Q^T &#92;big (w^k - w^&#92;star &#92;big )\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7Bk%2B1%7D+%3D+x%5Ek+-+%5Calpha+%5Ctext%7Bdiag%7D%5Cbig%28%5Clambda%5Cbig%29Q%5ET+%5Cbig+%28w%5Ek+-+w%5E%5Cstar+%5Cbig+%29%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{k+1} = x^k - &#92;alpha &#92;text{diag}&#92;big(&#92;lambda&#92;big)Q^T &#92;big (w^k - w^&#92;star &#92;big )\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x%5E%7Bk%2B1%7D+%3D+x%5Ek+-+%5Calpha+%5Ctext%7Bdiag%7D%5Cbig%28%5Clambda%5Cbig%29+x%5Ek%C2%A0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x^{k+1} = x^k - &#92;alpha &#92;text{diag}&#92;big(&#92;lambda&#92;big) x^k\u00a0\" class=\"latex\" \/><br \/>\n<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_i%5E%7Bk%2B1%7D+%3D+x_i%5Ek+-+%5Calpha+%5Clambda_i+x_i%5Ek++++%3D+%281-+%5Calpha+%5Clambda_i+%29x_i%5Ek++++&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_i^{k+1} = x_i^k - &#92;alpha &#92;lambda_i x_i^k    = (1- &#92;alpha &#92;lambda_i )x_i^k    \" class=\"latex\" \/><br \/>\n$latex\u00a0= \\sum_i^n x_i^0(1-\\alpha \\lambda_i)^k q_i $<\/p>\n<p>Thus, we now can view gradient descent in basis of the eigenvectors of A. We consider the contributions of error as<br \/>\n$latex\u00a0f(w^k)-f(w^\\star) = \\sum (1 &#8211; \\alpha \\lambda_i)^{2k} \\lambda_i [x_i^0]^2 $<br \/>\nFor convergence, we note that<br \/>\n$latex\u00a01 &#8211; \\alpha \\lambda_i $<br \/>\nmust be strictly less than 1.<\/p>\n<p>&nbsp;<\/p>\n<p>Note: To be continued&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We define standard gradient descent as: Momentum adds a relatively subtle change to gradient descent by replace f with z. Where As such, alpha retains the role of representing the step size. However, we introduce a new constant, beta. Beta adds a decaying effect on momentum, and must be between 0 and 1, noninclusive. We &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/05\/11\/nesterov-accelerated-gradient-descent\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">(Nesterov) Accelerated Gradient Descent<\/span><\/a><\/p>\n","protected":false},"author":164,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-796","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9CQAE-cQ","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/796","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/164"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=796"}],"version-history":[{"count":20,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/796\/revisions"}],"predecessor-version":[{"id":816,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/796\/revisions\/816"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=796"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=796"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=796"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}