{"id":84,"date":"2018-03-13T09:31:42","date_gmt":"2018-03-13T09:31:42","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=84"},"modified":"2018-03-13T09:31:42","modified_gmt":"2018-03-13T09:31:42","slug":"adam","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/03\/13\/adam\/","title":{"rendered":"Adam"},"content":{"rendered":"<p>This post presents a summary of the background and class discussion on the Adam (adaptive moment estimation) algorithm. The primary source for this discussion was the original Adam\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1412.6980\">paper<\/a>. Adam is quite possibly the most popular optimization algorithm being used today in machine learning, particularly in deep learning.<\/p>\n<h3>Background (Ryan Kingery)<\/h3>\n<p>Adam can be thought of as a generalization of stochastic gradient descent (SGD). It essentially combines three popular additions to SGD into one algorithm: AdaGrad, Nesterov momentum, and RMSProp.<\/p>\n<p style=\"text-align: center\">Adam = AdaGrad + momentum + RMSProp<\/p>\n<p>Recall from a previous post that AdaGrad modifies SGD by allowing for variable learning rates that can hopefully take advantage of the curvature of the objective function to allow for faster convergence. With Adam we extend this idea by using moving averages of the gradient moment estimates to smooth the stochastic descent trajectory. The idea is that smoothing this trajectory should speed up convergence by making Adam converge more like batch gradient descent, but without the need of using the entire dataset.<\/p>\n<p>To discuss momentum and RMSProp we must recall the definition of an exponentially weighted moving average (EWMA), also called exponential smoothing or infinite impulse response filtering. Suppose we have a random sample <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_1%2C%5Ccdots%2Cx_T&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_1,&#92;cdots,x_T\" class=\"latex\" \/> of time-ordered data. An EWMA generates a new sequence\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m_1%2C%5Ccdots%2Cm_T&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m_1,&#92;cdots,m_T\" class=\"latex\" \/> defined by<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m_t+%3D%C2%A0+%5Cbeta+m_%7Bt-1%7D+%2B+%281-%5Cbeta%29x_t+%3D+%281-%5Cbeta%29+%5Csum_%7Bi%3D1%7D%5Et+%5Cbeta%5E%7Bt-i%7Dx_i.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m_t =\u00a0 &#92;beta m_{t-1} + (1-&#92;beta)x_t = (1-&#92;beta) &#92;sum_{i=1}^t &#92;beta^{t-i}x_i.\" class=\"latex\" \/><\/p>\n<p>The hyperparameter <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=0+%5Cleq+%5Cbeta+%5Cleq+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"0 &#92;leq &#92;beta &#92;leq 1\" class=\"latex\" \/> is called a smoothing parameter. Intuitively, the EWMA is a way of smoothing time-ordered data. When <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta+%5Capprox+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta &#92;approx 0\" class=\"latex\" \/>, the EWMA sequence depends only on the current value of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_t\" class=\"latex\" \/>, which just gives the original non-smoothed sequence back. When\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta+%5Capprox+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta &#92;approx 1\" class=\"latex\" \/>, the EWMA doesn&#8217;t depend at all on the current value\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_t\" class=\"latex\" \/>, which results in a constant value for all <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m_t\" class=\"latex\" \/>.<\/p>\n<p>The EWMA is useful when we&#8217;re given noisy data and would like to filter out some of the noise to allow for better estimation of the underlying process. We thus in practice tend to favor higher values of\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta\" class=\"latex\" \/>, usually 0.9 or higher. It is precisely this idea that momentum and RMSProp exploit. Before mentioning this, though, a minor point.<\/p>\n<p>It turns out that the EWMA tends to produce bias estimates for the early values <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=x_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"x_t\" class=\"latex\" \/>. When the process hasn&#8217;t generated enough data yet, EWMA sets all the remaining values it needs to 0, which tends to bias the earlier part of the EWMA sequence towards 0. To fix this problem, one can use instead a bias-corrected EWMA defined by<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat+m_t+%3D+%5Cfrac%7Bm_t%7D%7B1-%5Cbeta%5Et%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat m_t = &#92;frac{m_t}{1-&#92;beta^t}.\" class=\"latex\" \/><\/p>\n<p>One might naturally ask whether this bias correction term is something to worry about in practice, and the answer, of course, is it depends. The time it takes for the bias to go away goes roughly like <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B3%7D%7B1-%5Cbeta%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{3}{1-&#92;beta}\" class=\"latex\" \/>. Thus, the larger <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta\" class=\"latex\" \/> is the longer it takes for this bias effect to go away. For\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta%3D0.9&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta=0.9\" class=\"latex\" \/> it takes about 30 iterations, and for\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta%3D0.999&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta=0.999\" class=\"latex\" \/> it takes about 3000 iterations.<\/p>\n<p>In the context of machine learning though, it doesn&#8217;t really make much of a difference when you&#8217;re training for many thousands of iterations anyway. To see why, suppose you have a modest dataset of 10,000 observations. If you train this using a mini-batch size of 32, that gives about 312 iterations per epoch. Training for only 10 epochs then already puts you over 3000 iterations. Thus, unless you&#8217;re in a scenario where you can train your data very quickly (e.g. transfer learning), the bias correction probably won&#8217;t be important to you. It is thus not uncommon in practice to ignore bias correction when implementing EWMA in machine learning algorithms.<\/p>\n<p>Now, the Adam algorithm uses EWMA to estimate the first and second order moment estimates of the objective function gradient <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnabla+f%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nabla f(&#92;theta)\" class=\"latex\" \/>. Let us then briefly review what the moment of a random variable is.<\/p>\n<p>Recall that the\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n%5E%7Bth%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"n^{th}\" class=\"latex\" \/> moment\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmu_n&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mu_n\" class=\"latex\" \/> of a random variable\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> is defined by<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmu_n+%3D+E%28X%5En%29.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mu_n = E(X^n).\" class=\"latex\" \/><\/p>\n<p>We can see then that the first moment is just the expectation\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmu+%3D+E%28X%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mu = E(X)\" class=\"latex\" \/>, and the second moment is the uncentered variance<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmu_2+%3D+%5Cmu%5E2+%2B+var%28X%29.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mu_2 = &#92;mu^2 + var(X).\" class=\"latex\" \/><\/p>\n<p>For a random vector\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X+%5Cin+%5Cmathbb%7BR%7D%5En&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X &#92;in &#92;mathbb{R}^n\" class=\"latex\" \/>, we can extend the definition by defining the expectation component-wise by<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Cmu%29_i+%3D+E%28X_i%29%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;mu)_i = E(X_i),\" class=\"latex\" \/><\/p>\n<p>and the second moment component-wise by<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28%5Cmu_2%29_%7Bij%7D+%3D+E%5Cbig%28%28XX%5ET%29_%7Bij%7D%5Cbig%29.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(&#92;mu_2)_{ij} = E&#92;big((XX^T)_{ij}&#92;big).\" class=\"latex\" \/><\/p>\n<p>Back to Adam again, denote the gradient <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cnabla+f%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;nabla f(&#92;theta)\" class=\"latex\" \/> by\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g\" class=\"latex\" \/> and the element-wise square of the gradient by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g^2\" class=\"latex\" \/>. We then take an EWMA of each of these using smoothing parameters <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta_1%2C+%5Cbeta_2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta_1, &#92;beta_2\" class=\"latex\" \/>:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m_t+%3D+%5Cbeta_1+m_%7Bt-1%7D+%2B+%281-%5Cbeta_1%29g_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m_t = &#92;beta_1 m_{t-1} + (1-&#92;beta_1)g_t\" class=\"latex\" \/>\u00a0 \u00a0 (momentum)<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=v_t+%3D+%5Cbeta_2+v_%7Bt-1%7D+%2B+%281-%5Cbeta_2%29g_t%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"v_t = &#92;beta_2 v_{t-1} + (1-&#92;beta_2)g_t^2\" class=\"latex\" \/>\u00a0 \u00a0 (RMSProp)<\/p>\n<p>In practice, it&#8217;s common to take <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta_1%3D0.9&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta_1=0.9\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta_2%3D0.999&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta_2=0.999\" class=\"latex\" \/>, which from the above discussion you can see favors high smoothing. We can now state the Adam parameter update formally:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Ctheta_t+%3D+%5Ctheta_%7Bt-1%7D+%2B+%5Cfrac%7B%5Chat+m_t%7D%7B%5Csqrt%7B%5Chat+v_t%7D%2B%5Cvarepsilon%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;theta_t = &#92;theta_{t-1} + &#92;frac{&#92;hat m_t}{&#92;sqrt{&#92;hat v_t}+&#92;varepsilon}.\" class=\"latex\" \/><\/p>\n<p>Note that the bias corrected terms are included in the update! This is because the authors of the paper did so, not because it&#8217;s usually done in practice, especially for the momentum term\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=m_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"m_t\" class=\"latex\" \/>. Also note the presence of the\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cvarepsilon&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;varepsilon\" class=\"latex\" \/>. This is included to prevent numerical instability in the denominator, and is typically fixed in advance to a very small number, usually\u00a0<img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cvarepsilon+%3D+10%5E%7B-8%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;varepsilon = 10^{-8}\" class=\"latex\" \/>. You can think of this as a form of Laplace smoothing if you wish.<\/p>\n<p>The ratio <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7Bm_t%7D%7B%5Csqrt%7Bv_t%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{m_t}{&#92;sqrt{v_t}}\" class=\"latex\" \/> is used to provide an estimate of the ratio of moments <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7BE%28g%29%7D%7B%5Csqrt%7BE%28g%5E2%29%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{E(g)}{&#92;sqrt{E(g^2)}}\" class=\"latex\" \/>, which can be thought of as a point estimate of the best direction in which to descent and the best step size to take. Note that these are ratios of vectors, meaning that means that the divisions are assumed to be done component-wise.<\/p>\n<p>We thus now have a &#8220;smoother&#8221; version of AdaGrad that allows for robustness in the presence of the noise inherent in SGD. Whether the algorithm holds in practice is a question that has repeatedly been validated in numerous deep learning applications. Whether this actually helps in theory leads to a discussion of the algorithm&#8217;s convergence &#8220;guarantees&#8221;.<\/p>\n<h3>Convergence (Colin Shea-Blymyer)<\/h3>\n<p>The proof of convergence for this algorithm is very long, and I fear that we won&#8217;t gain much from going through it step-by-step. Instead, we&#8217;ll turn our attention to an analysis of the regret bounds for this algorithm.<\/p>\n<p style=\"text-align: left\">To begin, we&#8217;ll look at what regret means in this instance. Formally, regret is defined as<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%28T%29+%3D+%5Csum%5Climits_%7Bt%3D1%7D%5ET%5Bf_t%28%5Ctheta_t%29-f_t%28%5Ctheta%5E%2A%29%5D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R(T) = &#92;sum&#92;limits_{t=1}^T[f_t(&#92;theta_t)-f_t(&#92;theta^*)].\" class=\"latex\" \/><\/p>\n<p style=\"text-align: left\">In English, that represents the sum how far every step of our optimization has been from the optimal point. A high regret means our optimization has not been very efficient.<\/p>\n<p>Next we&#8217;ll define <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g_t\" class=\"latex\" \/> as the gradient of our function at the point we&#8217;ve reached in our algorithm at step <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"t\" class=\"latex\" \/>, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g_%7Bt%2Ci%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g_{t,i}\" class=\"latex\" \/> as the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i%5E%7Bth%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i^{th}\" class=\"latex\" \/> element of that gradient. We&#8217;ll use slicing on these gradients to define<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=g_%7B1%3At%2Ci%7D+%3D+%5Bg_%7B1%2Ci%7D%2C+g_%7B2%2Ci%7D%2C+...%2C+g_%7Bt%2Ci%7D%5D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"g_{1:t,i} = [g_{1,i}, g_{2,i}, ..., g_{t,i}].\" class=\"latex\" \/><\/p>\n<p>That is, a vector that contains the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=i%5E%7Bth%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"i^{th}\" class=\"latex\" \/> element of all gradients we&#8217;ve seen. We&#8217;ll also define <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cgamma&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;gamma\" class=\"latex\" \/> as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B%5Cbeta_1%5E2%7D%7B%5Csqrt%5B%5D%7B%5Cbeta_2%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{&#92;beta_1^2}{&#92;sqrt[]{&#92;beta_2}}\" class=\"latex\" \/>, which is a ratio of the of the decay of the importance of the first and second moments, squared and square-rooted, respectively. Further, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;lambda\" class=\"latex\" \/> represents the exponential decay rate of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta_%7B1%2Ct%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta_{1,t}\" class=\"latex\" \/>.<\/p>\n<p>The next step is to make some assumptions. First, assume the gradients of our function <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=f_t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"f_t\" class=\"latex\" \/> are bounded thus: <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7C%7C%5Cnabla+f_t%28%5Ctheta%29%7C%7C_2+%5Cleq+G&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"||&#92;nabla f_t(&#92;theta)||_2 &#92;leq G\" class=\"latex\" \/>, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7C%7C%5Cnabla+f_t%28%5Ctheta%29%7C%7C_%5Cinfty+%5Cleq+G_%5Cinfty&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"||&#92;nabla f_t(&#92;theta)||_&#92;infty &#92;leq G_&#92;infty\" class=\"latex\" \/>. Second, we assume a bound on the distance between any two points discovered by Adam:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7C%7C%5Ctheta_n-%5Ctheta_m%7C%7C_2+%5Cleq+D%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"||&#92;theta_n-&#92;theta_m||_2 &#92;leq D,\" class=\"latex\" \/><\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%7C%7C%5Ctheta_n-%5Ctheta_m%7C%7C_%5Cinfty+%5Cleq+D_%5Cinfty.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"||&#92;theta_n-&#92;theta_m||_&#92;infty &#92;leq D_&#92;infty.\" class=\"latex\" \/><\/p>\n<p>Finally, we define some ranges for our parameters: <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta_1%2C+%5Cbeta_2+%5Cin+%5B0%2C1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta_1, &#92;beta_2 &#92;in [0,1)\" class=\"latex\" \/>, and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B%5Cbeta_1%5E2%7D%7B%5Csqrt%7B%5Cbeta_2%7D%7D+%3C+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{&#92;beta_1^2}{&#92;sqrt{&#92;beta_2}} &lt; 1\" class=\"latex\" \/>; the learning rate at a given step <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=t&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"t\" class=\"latex\" \/> is <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Calpha_t+%3D+%5Cfrac%7B%5Calpha%7D%7B%5Csqrt%7Bt%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;alpha_t = &#92;frac{&#92;alpha}{&#92;sqrt{t}}\" class=\"latex\" \/>; finally, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta_%7B1%2Ct%7D+%3D+%5Cbeta_1%5Clambda%5E%7Bt-1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta_{1,t} = &#92;beta_1&#92;lambda^{t-1}\" class=\"latex\" \/> for <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Clambda+%5Cin+%280%2C1%29.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;lambda &#92;in (0,1).\" class=\"latex\" \/><\/p>\n<p>Now we are armed to tackle the guarantee given for all <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=T+%5Cleq+1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"T &#92;leq 1\" class=\"latex\" \/>:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%28T%29%5Cleq%5Cfrac%7BD%5E2%7D%7B2%5Calpha%281%5Cbeta_1%29%7D%5Csum%5Climits_%7Bi%3D1%7D%5Ed%5Csqrt%7BT%5Chat%7Bv%7D_%7BT%2Ci%7D%7D%2B%5Cfrac%7B%5Calpha%281%2B%5Cbeta_1%29G_%5Cinfty%7D%7B%281-%5Cbeta_1%29%5Csqrt%7B1-%5Cbeta_2%7D%281-%5Cgamma%29%5E2%7D%5Csum%5Climits_%7Bi%3D1%7D%5Ed%7C%7Cg_%7B1%3AT%2Ci%7D%7C%7C_2%2B%5Csum%5Climits_%7Bi%3D1%7D%5Ed%5Cfrac%7BD_%5Cinfty%5E2G_%5Cinfty%5Csqrt%7B1-%5Cbeta_2%7D%7D%7B2%5Calpha%281-%5Cbeta_1%29%281-%5Clambda%29%5E2%7D.&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R(T)&#92;leq&#92;frac{D^2}{2&#92;alpha(1&#92;beta_1)}&#92;sum&#92;limits_{i=1}^d&#92;sqrt{T&#92;hat{v}_{T,i}}+&#92;frac{&#92;alpha(1+&#92;beta_1)G_&#92;infty}{(1-&#92;beta_1)&#92;sqrt{1-&#92;beta_2}(1-&#92;gamma)^2}&#92;sum&#92;limits_{i=1}^d||g_{1:T,i}||_2+&#92;sum&#92;limits_{i=1}^d&#92;frac{D_&#92;infty^2G_&#92;infty&#92;sqrt{1-&#92;beta_2}}{2&#92;alpha(1-&#92;beta_1)(1-&#92;lambda)^2}.\" class=\"latex\" \/><\/p>\n<p>To make analysis of this easier, we&#8217;ll refer to specific terms in this inequality as such:<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%28T%29%5Cleq+A+B+%2B+C+D+%2B+E&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R(T)&#92;leq A B + C D + E\" class=\"latex\" \/><\/p>\n<p>Allowing each summation, and fraction without summation be a separate term.<\/p>\n<p>Term <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"A\" class=\"latex\" \/> is saying that a large maximum distance between points discovered by Adam can allow for a larger regret, but can be tempered by a large learning rate, and a smaller decay on the first moment. This is scaled by <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=B&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"B\" class=\"latex\" \/>. Recall that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7Bv%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{v}\" class=\"latex\" \/> is the bias-corrected second moment estimate, so <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=B&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"B\" class=\"latex\" \/> refers to the amount of variance in the function, scaled by the step number &#8211; more variance allows for more regret, especially in the later iterations of our method. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"C\" class=\"latex\" \/> has a lot to say.<\/p>\n<p>We&#8217;ll start with the numerator: the learning rate, the decay of the first moment, and the maximum Chebyshev magnitude of our gradients &#8211; all aspects that allow for a larger regret when endowed with large values. In the denominator we have subtractive terms, so to maximize regret, the decay of the second and first moments, and (something like) the ratio between them should all be small. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D\" class=\"latex\" \/> scales <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"C\" class=\"latex\" \/> much in the same way <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=B&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"B\" class=\"latex\" \/> scaled <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"A\" class=\"latex\" \/> &#8211; instead of referring to the variance (the second moment), as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=B&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"B\" class=\"latex\" \/> does, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D\" class=\"latex\" \/> is looking at the magnitudes of vectors formed by piecing together all previous gradients.<\/p>\n<p>This, for me, is the hardest part to gain an intuition of. Finally, we find ourselves at <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=E&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"E\" class=\"latex\" \/>. This piece is tricky because it&#8217;s opaque as to what the index for the summation refers to. Disregarding that, however, we see the maximum (Chebyshev) distance between points discovered (squared) multiplying the maximum (Chebyshev) norm of the gradients, multiplying a term that was in the denominator in <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"C\" class=\"latex\" \/> &#8211; one less the decay rate on the second moment. In the denominator of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=E&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"E\" class=\"latex\" \/> we see the first bit of the denominator of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=A&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"A\" class=\"latex\" \/> &#8211; the learning rate, and one less the decay rate on the first moment, and one less the decay of the decay of the first moment (squared).<\/p>\n<p>Looking back through this, we can start to see how certain terms effect how large the regret of the algorithm can be: the <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%281-%5Cbeta_1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(1-&#92;beta_1)\" class=\"latex\" \/> term is never found in the numerator, though <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%281%2B%5Cbeta_1%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"(1+&#92;beta_1)\" class=\"latex\" \/> is, suggesting that high values of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cbeta_1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;beta_1\" class=\"latex\" \/> lead to lower values of <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%28T%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R(T)\" class=\"latex\" \/>. Inversely, <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"D\" class=\"latex\" \/> and <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=G&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"G\" class=\"latex\" \/> are never found in the denominator, clearly stating that smaller bounds on the distances and gradients lead to smaller regret values.<\/p>\n<p>Finally, by accepting that<\/p>\n<p style=\"text-align: center\"><img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Csum%5Climits_%7Bi%3D1%7D%5Ed%7C%7Cg_%7B1%3AT%2Ci%7D%7C%7C_2+%5Cleq+dG_%5Cinfty+%5Csqrt%7BT%7D%2C&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;sum&#92;limits_{i=1}^d||g_{1:T,i}||_2 &#92;leq dG_&#92;infty &#92;sqrt{T},\" class=\"latex\" \/><\/p>\n<p>we find that <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7BR%28T%29%7D%7BT%7D%3DO%28%5Cfrac%7B1%7D%7B%5Csqrt%7BT%7D%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{R(T)}{T}=O(&#92;frac{1}{&#92;sqrt{T}})\" class=\"latex\" \/>. This allows us to say that the regret of Adam converges to 0 in the limit as <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=T%5Crightarrow%5Cinfty&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"T&#92;rightarrow&#92;infty\" class=\"latex\" \/>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post presents a summary of the background and class discussion on the Adam (adaptive moment estimation) algorithm. The primary source for this discussion was the original Adam\u00a0paper. Adam is quite possibly the most popular optimization algorithm being used today in machine learning, particularly in deep learning. Background (Ryan Kingery) Adam can be thought of &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/03\/13\/adam\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Adam<\/span><\/a><\/p>\n","protected":false},"author":155,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-84","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/s9CQAE-adam","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/84","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/155"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=84"}],"version-history":[{"count":28,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/84\/revisions"}],"predecessor-version":[{"id":155,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/84\/revisions\/155"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=84"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=84"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=84"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}