{"id":226,"date":"2018-04-13T01:33:41","date_gmt":"2018-04-13T01:33:41","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=226"},"modified":"2018-04-13T01:33:41","modified_gmt":"2018-04-13T01:33:41","slug":"learning-to-learn-by-gradient-descent-by-gradient-descent","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/04\/13\/learning-to-learn-by-gradient-descent-by-gradient-descent\/","title":{"rendered":"Learning to learn by gradient descent by gradient descent"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Tasks in ML are typically defined as finding a minimizer of an objective function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(&#92;theta)\" class=\"latex\" \/> over some domain <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta+%5Cin+%5CTheta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta &#92;in &#92;Theta\" class=\"latex\" \/>. The minimization itself is performed using gradient descent -based methods, where the parameters are updated taking into account the\u00a0 gradient information. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_%7Bt%2B1%7D+%3D+%5Ctheta_%7Bt%7D+-+%5Calpha_%7Bt%7D+%5Cnabla+f%28%5Ctheta_%7Bt%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_{t+1} = &#92;theta_{t} - &#92;alpha_{t} &#92;nabla f(&#92;theta_{t})\" class=\"latex\" \/>. Many optimizations have been proposed over the years which try to improve the descent-based methods by incorporating various techniques utilizing the geometry , adaptive learning rates, momentum .<\/p>\n<p>The paper approaches the optimization from a novel perspective: using learned update rules for our network,\u00a0 capable of generalizing to a specific class of problems. The update to the current set of parameters is predicted by a neural-network, specifically\u00a0RNN-based network ( <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_%7Bt%2B1%7D+%3D+%5Ctheta_%7Bt%7D+%2B+g_%7Bt%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_{t+1} = &#92;theta_{t} + g_{t}\" class=\"latex\" \/>).<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-303\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/teaser-300x275.png\" alt=\"\" width=\"300\" height=\"275\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/teaser-300x275.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/teaser.png 578w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<h3>Framework<\/h3>\n<p>Given an\u00a0\u00a0Optimizee: <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(&#92;theta)\" class=\"latex\" \/> parameterized by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta\" class=\"latex\" \/> and Optimizer\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f%28%5Cphi%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f(&#92;phi)\" class=\"latex\" \/> parameterized by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cphi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;phi\" class=\"latex\" \/>.<\/p>\n<p>The objective function for the optimizer is<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=L%28%5Cphi%29+%3D+%5Cmathop%7B%5Cmathbb%7BE%7D%7D_%7Bf%7D%7B+%5Clbrack+%5Csum_%7Bt%3D1%7D%5E%7BT%7D+w_%7Bt%7D+f%28%5Ctheta_%7Bt%7D%29+%5Crbrack+%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"L(&#92;phi) = &#92;mathop{&#92;mathbb{E}}_{f}{ &#92;lbrack &#92;sum_{t=1}^{T} w_{t} f(&#92;theta_{t}) &#92;rbrack }\" class=\"latex\" \/><\/p>\n<p>where <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_%7Bt%2B1%7D+%3D+%5Ctheta_%7Bt%7D+%2B+g_%7Bt%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_{t+1} = &#92;theta_{t} + g_{t}\" class=\"latex\" \/><\/p>\n<p>and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g_t+%2C+h_%7Bt%2B1%7D+%3D+m%28+%5Cnabla_%7Bt%7D+%2C+h_%7Bt%7D%2C+%5Cphi%29%C2%A0+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g_t , h_{t+1} = m( &#92;nabla_{t} , h_{t}, &#92;phi)\u00a0 \" class=\"latex\" \/>.<\/p>\n<p>The update steps $ latex g_{t}$ and next hidden state <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=h_%7Bt%2B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"h_{t+1}\" class=\"latex\" \/> are the output of the recurrent neural network architecture parameterized by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cphi&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;phi\" class=\"latex\" \/>.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w_%7Bt%7D+%5Cin+R_%7B%5Cleq+0%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w_{t} &#92;in R_{&#92;leq 0}\" class=\"latex\" \/> are weights associated with the predicted values of the optimizee for the last t time steps. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=w_%7Bt%7D+%3D1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"w_{t} =1\" class=\"latex\" \/> value is used in all the experiments. The optimizee parameters are updated with truncated back-propagation through time. The gradient flow can be seen in the computational graph below. The gradients along the dashed lines are ignored in order to avoid computing expensive second-derivative calculation.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-301\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/flowgradient-300x157.png\" alt=\"\" width=\"554\" height=\"290\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/flowgradient-300x157.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/flowgradient-768x402.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/flowgradient.png 798w\" sizes=\"auto, (max-width: 554px) 100vw, 554px\" \/><\/p>\n<h3>Coordinatewise LSTM architecture<\/h3>\n<p>Applying per-parameter different LSTM would have been computationally very expensive and would have introduced more than tens\/thousands of parameters to optimize on.<br \/>\nTo avoid this issue, this paper utilizes a coordinatewise network architecture (LSTM-based architecture) where the optimizer parameters are shared over different parameters of the optimizee and separate hidden state is maintained for each optimizee parameter. The base LSTM architecture is a two-layer LSTM using standard forget-gate architecture.<\/p>\n<h3>\u00a0<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-302\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/coordinate-300x153.png\" alt=\"\" width=\"424\" height=\"216\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/coordinate-300x153.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/coordinate-768x392.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/coordinate.png 784w\" sizes=\"auto, (max-width: 424px) 100vw, 424px\" \/><\/h3>\n<h3>Preprocessing<\/h3>\n<p>One key difficulty when passing the gradients of parameters to the optimizer network is how to handle the different scales of the gradients.\u00a0 Typical deep networks work well when the input to the network is not arbitrarily-scaled.<\/p>\n<p>Instead of passing the gradient <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnabla&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nabla\" class=\"latex\" \/> directly,\u00a0 <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Clog+%5Cnabla%2C+sgn%28%5Cnabla%29+%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;log &#92;nabla, sgn(&#92;nabla) )\" class=\"latex\" \/> the logarithm of the gradients and the sign of the gradients to the RNN architecture is passed to the coordinatewise architecture.<\/p>\n<h2><strong>Experiment<\/strong><\/h2>\n<p>The authors used 2 layer LSTMs with 20 hidden units in each layer for optimizer.\u00a0 Each optimizer is trained by minimizing the loss equation using truncated BPTT. Adam with the learning rate chosen by random line search is used for minimization. The authors compare the trained optimizers with standard optimizers like SGD, RMSProp, ADAM and NAG. The learning rate for these optimizers are tuned while the other parameters are set to default values in Torch7.<\/p>\n<p>The authors evaluated\u00a0a class of 10 dimensional synthetic quadratic functions whose minimizing function is in form:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-231 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/eq1-2-300x83.jpg\" alt=\"\" width=\"145\" height=\"40\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/eq1-2-300x83.jpg 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/eq1-2.jpg 302w\" sizes=\"auto, (max-width: 145px) 100vw, 145px\" \/><\/p>\n<p>where W is a 10X10 matrix and y is 10 dimensional vector drawn from IID Guassian distribution.\u00a0 As shown in the fig below LSTM based optimizer performs much better than the standard optimizers.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-233 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/ded-1-300x228.jpg\" alt=\"\" width=\"300\" height=\"228\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/ded-1-300x228.jpg 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/ded-1.jpg 477w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>The author has trained optimizer over a neural network with 1 hidden layer of 20 hidden units using sigmoid activation on MNIST training set. They have experimentally shown that the optimizer generalizes much better than standard optimizers for different architectures with more layers (2) and hidden units (40) as shown below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-235\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/dewq-300x209.jpg\" alt=\"\" width=\"317\" height=\"221\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/dewq-300x209.jpg 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/dewq.jpg 502w\" sizes=\"auto, (max-width: 317px) 100vw, 317px\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-236\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/rf-300x258.jpg\" alt=\"\" width=\"308\" height=\"265\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/rf-300x258.jpg 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/rf.jpg 433w\" sizes=\"auto, (max-width: 308px) 100vw, 308px\" \/><\/p>\n<p>However, when the architecture is changed drastically like using relu activation instead of Sigmoid, the LSTM optimizer does not scale well as shown below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-237 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/dasa-300x241.jpg\" alt=\"\" width=\"300\" height=\"241\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/dasa-300x241.jpg 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/dasa.jpg 415w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>The author has used LSTM optimizer for neural art to generate multiple image styles using 1 style image and 1800 content images. They found that LSTM optimizer outperform standard optimizers for test content images at training resolution as well as twice the training resolution as shown below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-239\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/art-300x196.jpg\" alt=\"\" width=\"292\" height=\"191\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/art-300x196.jpg 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/art.jpg 483w\" sizes=\"auto, (max-width: 292px) 100vw, 292px\" \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-240\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/art2-300x154.jpg\" alt=\"\" width=\"357\" height=\"184\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/art2-300x154.jpg 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/art2.jpg 619w\" sizes=\"auto, (max-width: 357px) 100vw, 357px\" \/><\/p>\n<h2><strong>Conclusion<\/strong><\/h2>\n<p>The authors have shown how to cast the design of optimization algorithms as learning problems. Their experiments have shown that learnt neural optimizers perform better than state of art optimizers for deep learning.<\/p>\n<h2><strong>References<\/strong><\/h2>\n<p>[1]\u00a0 Learning to Learn by gradient descent by gradient descent( https:\/\/arxiv.org\/abs\/1606.04474)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Tasks in ML are typically defined as finding a minimizer of an objective function over some domain . The minimization itself is performed using gradient descent -based methods, where the parameters are updated taking into account the\u00a0 gradient information. . Many optimizations have been proposed over the years which try to improve the descent-based &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/04\/13\/learning-to-learn-by-gradient-descent-by-gradient-descent\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Learning to learn by gradient descent by gradient descent<\/span><\/a><\/p>\n","protected":false},"author":140,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-226","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9CQAE-3E","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/226","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/140"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=226"}],"version-history":[{"count":3,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/226\/revisions"}],"predecessor-version":[{"id":307,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/226\/revisions\/307"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=226"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=226"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=226"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}