Adam

This post presents a summary of the background and class discussion on the Adam (adaptive moment estimation) algorithm. The primary source for this discussion was the original Adam paper. Adam is quite possibly the most popular optimization algorithm being used today in machine learning, particularly in deep learning.

Background (Ryan Kingery)

Adam can be thought of as a generalization of stochastic gradient descent (SGD). It essentially combines three popular additions to SGD into one algorithm: AdaGrad, Nesterov momentum, and RMSProp.

Adam = AdaGrad + momentum + RMSProp

Recall from a previous post that AdaGrad modifies SGD by allowing for variable learning rates that can hopefully take advantage of the curvature of the objective function to allow for faster convergence. With Adam we extend this idea by using moving averages of the gradient moment estimates to smooth the stochastic descent trajectory. The idea is that smoothing this trajectory should speed up convergence by making Adam converge more like batch gradient descent, but without the need of using the entire dataset.

To discuss momentum and RMSProp we must recall the definition of an exponentially weighted moving average (EWMA), also called exponential smoothing or infinite impulse response filtering. Suppose we have a random sample x_1,\cdots,x_T of time-ordered data. An EWMA generates a new sequence m_1,\cdots,m_T defined by

m_t =  \beta m_{t-1} + (1-\beta)x_t = (1-\beta) \sum_{i=1}^t \beta^{t-i}x_i.

The hyperparameter 0 \leq \beta \leq 1 is called a smoothing parameter. Intuitively, the EWMA is a way of smoothing time-ordered data. When \beta \approx 0, the EWMA sequence depends only on the current value of x_t, which just gives the original non-smoothed sequence back. When \beta \approx 1, the EWMA doesn’t depend at all on the current value x_t, which results in a constant value for all m_t.

The EWMA is useful when we’re given noisy data and would like to filter out some of the noise to allow for better estimation of the underlying process. We thus in practice tend to favor higher values of \beta, usually 0.9 or higher. It is precisely this idea that momentum and RMSProp exploit. Before mentioning this, though, a minor point.

It turns out that the EWMA tends to produce bias estimates for the early values x_t. When the process hasn’t generated enough data yet, EWMA sets all the remaining values it needs to 0, which tends to bias the earlier part of the EWMA sequence towards 0. To fix this problem, one can use instead a bias-corrected EWMA defined by

\hat m_t = \frac{m_t}{1-\beta^t}.

One might naturally ask whether this bias correction term is something to worry about in practice, and the answer, of course, is it depends. The time it takes for the bias to go away goes roughly like \frac{3}{1-\beta}. Thus, the larger \beta is the longer it takes for this bias effect to go away. For \beta=0.9 it takes about 30 iterations, and for \beta=0.999 it takes about 3000 iterations.

In the context of machine learning though, it doesn’t really make much of a difference when you’re training for many thousands of iterations anyway. To see why, suppose you have a modest dataset of 10,000 observations. If you train this using a mini-batch size of 32, that gives about 312 iterations per epoch. Training for only 10 epochs then already puts you over 3000 iterations. Thus, unless you’re in a scenario where you can train your data very quickly (e.g. transfer learning), the bias correction probably won’t be important to you. It is thus not uncommon in practice to ignore bias correction when implementing EWMA in machine learning algorithms.

Now, the Adam algorithm uses EWMA to estimate the first and second order moment estimates of the objective function gradient \nabla f(\theta). Let us then briefly review what the moment of a random variable is.

Recall that the n^{th} moment \mu_n of a random variable X is defined by

\mu_n = E(X^n).

We can see then that the first moment is just the expectation \mu = E(X), and the second moment is the uncentered variance

\mu_2 = \mu^2 + var(X).

For a random vector X \in \mathbb{R}^n, we can extend the definition by defining the expectation component-wise by

(\mu)_i = E(X_i),

and the second moment component-wise by

(\mu_2)_{ij} = E\big((XX^T)_{ij}\big).

Back to Adam again, denote the gradient \nabla f(\theta) by g and the element-wise square of the gradient by g^2. We then take an EWMA of each of these using smoothing parameters \beta_1, \beta_2:

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t    (momentum)

v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2    (RMSProp)

In practice, it’s common to take \beta_1=0.9 and \beta_2=0.999, which from the above discussion you can see favors high smoothing. We can now state the Adam parameter update formally:

\theta_t = \theta_{t-1} + \frac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon}.

Note that the bias corrected terms are included in the update! This is because the authors of the paper did so, not because it’s usually done in practice, especially for the momentum term m_t. Also note the presence of the \varepsilon. This is included to prevent numerical instability in the denominator, and is typically fixed in advance to a very small number, usually \varepsilon = 10^{-8}. You can think of this as a form of Laplace smoothing if you wish.

The ratio \frac{m_t}{\sqrt{v_t}} is used to provide an estimate of the ratio of moments \frac{E(g)}{\sqrt{E(g^2)}}, which can be thought of as a point estimate of the best direction in which to descent and the best step size to take. Note that these are ratios of vectors, meaning that means that the divisions are assumed to be done component-wise.

We thus now have a “smoother” version of AdaGrad that allows for robustness in the presence of the noise inherent in SGD. Whether the algorithm holds in practice is a question that has repeatedly been validated in numerous deep learning applications. Whether this actually helps in theory leads to a discussion of the algorithm’s convergence “guarantees”.

Convergence (Colin Shea-Blymyer)

The proof of convergence for this algorithm is very long, and I fear that we won’t gain much from going through it step-by-step. Instead, we’ll turn our attention to an analysis of the regret bounds for this algorithm.

To begin, we’ll look at what regret means in this instance. Formally, regret is defined as

R(T) = \sum\limits_{t=1}^T[f_t(\theta_t)-f_t(\theta^*)].

In English, that represents the sum how far every step of our optimization has been from the optimal point. A high regret means our optimization has not been very efficient.

Next we’ll define g_t as the gradient of our function at the point we’ve reached in our algorithm at step t, and g_{t,i} as the i^{th} element of that gradient. We’ll use slicing on these gradients to define

g_{1:t,i} = [g_{1,i}, g_{2,i}, ..., g_{t,i}].

That is, a vector that contains the i^{th} element of all gradients we’ve seen. We’ll also define \gamma as \frac{\beta_1^2}{\sqrt[]{\beta_2}}, which is a ratio of the of the decay of the importance of the first and second moments, squared and square-rooted, respectively. Further, \lambda represents the exponential decay rate of \beta_{1,t}.

The next step is to make some assumptions. First, assume the gradients of our function f_t are bounded thus: ||\nabla f_t(\theta)||_2 \leq G, and ||\nabla f_t(\theta)||_\infty \leq G_\infty. Second, we assume a bound on the distance between any two points discovered by Adam:

||\theta_n-\theta_m||_2 \leq D,

||\theta_n-\theta_m||_\infty \leq D_\infty.

Finally, we define some ranges for our parameters: \beta_1, \beta_2 \in [0,1), and \frac{\beta_1^2}{\sqrt{\beta_2}} < 1; the learning rate at a given step t is \alpha_t = \frac{\alpha}{\sqrt{t}}; finally, \beta_{1,t} = \beta_1\lambda^{t-1} for \lambda \in (0,1).

Now we are armed to tackle the guarantee given for all T \leq 1:

R(T)\leq\frac{D^2}{2\alpha(1\beta_1)}\sum\limits_{i=1}^d\sqrt{T\hat{v}_{T,i}}+\frac{\alpha(1+\beta_1)G_\infty}{(1-\beta_1)\sqrt{1-\beta_2}(1-\gamma)^2}\sum\limits_{i=1}^d||g_{1:T,i}||_2+\sum\limits_{i=1}^d\frac{D_\infty^2G_\infty\sqrt{1-\beta_2}}{2\alpha(1-\beta_1)(1-\lambda)^2}.

To make analysis of this easier, we’ll refer to specific terms in this inequality as such:

R(T)\leq A B + C D + E

Allowing each summation, and fraction without summation be a separate term.

Term A is saying that a large maximum distance between points discovered by Adam can allow for a larger regret, but can be tempered by a large learning rate, and a smaller decay on the first moment. This is scaled by B. Recall that \hat{v} is the bias-corrected second moment estimate, so B refers to the amount of variance in the function, scaled by the step number – more variance allows for more regret, especially in the later iterations of our method. C has a lot to say.

We’ll start with the numerator: the learning rate, the decay of the first moment, and the maximum Chebyshev magnitude of our gradients – all aspects that allow for a larger regret when endowed with large values. In the denominator we have subtractive terms, so to maximize regret, the decay of the second and first moments, and (something like) the ratio between them should all be small. D scales C much in the same way B scaled A – instead of referring to the variance (the second moment), as B does, D is looking at the magnitudes of vectors formed by piecing together all previous gradients.

This, for me, is the hardest part to gain an intuition of. Finally, we find ourselves at E. This piece is tricky because it’s opaque as to what the index for the summation refers to. Disregarding that, however, we see the maximum (Chebyshev) distance between points discovered (squared) multiplying the maximum (Chebyshev) norm of the gradients, multiplying a term that was in the denominator in C – one less the decay rate on the second moment. In the denominator of E we see the first bit of the denominator of A – the learning rate, and one less the decay rate on the first moment, and one less the decay of the decay of the first moment (squared).

Looking back through this, we can start to see how certain terms effect how large the regret of the algorithm can be: the (1-\beta_1) term is never found in the numerator, though (1+\beta_1) is, suggesting that high values of \beta_1 lead to lower values of R(T). Inversely, D and G are never found in the denominator, clearly stating that smaller bounds on the distances and gradients lead to smaller regret values.

Finally, by accepting that

\sum\limits_{i=1}^d||g_{1:T,i}||_2 \leq dG_\infty \sqrt{T},

we find that \frac{R(T)}{T}=O(\frac{1}{\sqrt{T}}). This allows us to say that the regret of Adam converges to 0 in the limit as T\rightarrow\infty.