Uncategorized – Page 2 – Optimization in Machine Learning

Proximal Algorithms

Introduction

Proximal algorithms are a class of algorithms that can be used to solve constrained optimization problems that may involve non-smooth penalties in the objective function. Their formulation is rather general in that a number of other optimization algorithms can be derived as special cases of proximal algorithms. At the core of proximal algorithms exists the evaluation of the proximal operator:

$\textnormal{prox}_{\gamma, f} (x)= \textnormal{argmin}_{z} \left\{f(z) + \frac{1}{2\gamma} \left\|x-z\right\|^{2}_{2}\right\}$

The proximal operator returns a point in the search space that minimizes the problem defined by the Moreau envelope of the objective function $f$ [2]:

$f^{\gamma}(x) = \inf_{z} \left\{f(z) + \frac{1}{2\gamma} \left\|x-z\right\|^{2}_{2}\right\}$

The Moreau envelope is a regularized version of the function $f$ . It is continuously differentiable and defined in all of $\mathbb{R}^{n}$ , even though $f$ may not satisfy either condition, and has the same set of minimizers as $f$ [1,2].

Example 1. Let $f(x) = |x|$ . The Moreau envelope of $f(x)$ is $f^{\gamma}(x)$ and for $\gamma = 1$ :

Notice that the Moreau envelope is smooth even though $f$ itself is not.

Standard Problem Formulation

Consider the minimization of functions of the form

$\textnormal{argmin}_{x \in \mathcal{X}}\, F(x) := g(x) + h(x)\qquad\qquad (1)$

where $g(x)$ is convex and differentiable and $h(x)$ is convex but non-smooth. Such objective functions arise when a penalty such as the lasso is added to the original objective function $g(x)$ to induce sparsity in the solution.

Gradient descent can be used to minimize the original objective $g(x)$ as it is smooth and convex, and we can converge to an $\epsilon$ -suboptimal solution at a rate of $O(1/\epsilon)$ . Adding a penalty or regularization term $h(x)$ that is non-smooth, however, may cause gradient descent to fail and converge to an incorrect minima. In such a circumstance, subgradient methods [3] are used to minimize the non-smooth function $F(x)$ . The drawback of using subgradient methods is that they converge far more slowly than gradient descent. An $\epsilon$ -suboptimal solution can be obtained only after $O(1/\epsilon^2)$ iterations.

When the solution to the proximal operator of $h(x)$ is known in closed form, proximal algorithms can be used to significantly speed up the convergence. Table 1 in [2] lists the closed form solution of evaluating the proximal operator on an extensive list of non-smooth functions. Convergence to the an $\epsilon$ -suboptimal solution can be attained after $O(1/\epsilon)$ calls to evaluate the prox operator. Therefore, if the evaluation of the prox operator is cheap, convergence is linear as in the case of gradient descent.

Proximal Gradient Descent Method

Proximal gradient descent method can be employed to solve optimization problems of type (1). It is composed of the following two steps:

Gradient step: Starting at $x_k$ , take a step in the direction of the gradient of the differentiable part, $g(x)$ :
$x_{k}^{+} = x_{k} - \gamma \nabla g(x_k)$
Evaluate prox operator: Starting at $x_k^{+}$ , evaluate the prox operator of the non-smooth part, $h(x)$ :
$x_{k+1} = \textnormal{prox}_{\gamma, h}(x_{k}^{+}) = \textnormal{prox}_{\gamma, h}(x_k - \gamma \nabla g(x_k))$

In [2], the authors give a glimpse into how the above recipe can be motivated in two different ways — first as an MM-algorithm and second by deriving optimality conditions using subdifferential calculus.

Proximal Gradient Descent method will require $O(1/\epsilon)$ calls to the prox operator in order to obtain an $\epsilon$ -suboptimal solution.

Proximal Newton Method

In Proximal Gradient Descent Method, the gradient step is obtained by constructing a quadratic under-approximant to the differentiable function using the scaled identity, $I/\gamma$ , as an approximation to the Hessian of $g(x)$ .

Proximal Newton Method can be obtained by using the actual Hessian, $H := \nabla^{2}g(x)$ , to build the quadratic under-approximant of $g(x)$ . Since $g(x)$ is convex, $H \succeq 0$ . Defining the scaled proximal map as in [4]:

$\textnormal{prox}_{H}(x) = \textnormal{argmin}_{z} \frac{1}{2}\left\|x-z\right\|^{2}_{H} + h(x)$

where $\left\|x\right\|^{2}_{H} = x^{T}Hx$ , the Proximal Newton Method can be written as [4]:

Newton step: Starting at $x_k$ , take a Newton step:
$x_{k}^{+} = x_{k} - H_{k}^{-1} \nabla g(x_k)$
Evaluate prox operator: Starting at $x_k^{+}$ , evaluate the scaled prox operator:
$z_{k} = \textnormal{prox}_{H_{k}}(x_{k}^{+}) = \textnormal{prox}_{H_{k}}(x_{k} - H_{k}^{-1} \nabla g(x_k))$
Take a step in direction $(z^{k} - x^{k})$ : Starting at $x_k$ , evaluate $x_{k+1}$ as:
$x_{k+1} = x_{k} + t_{k} (z_{k} - x_{k})$

It should be noted that the the proximal operator in the case of Proximal Newton Method relies on the computation of the Hessian. Hence, any line search strategy to identify an appropriate step size will prove expensive.

Proximal Gradient Descent Method can be recovered from Proximal Newton Method by replacing the Hessian with $H = I/\gamma$ . Whereas the Proximal Gradient Descent Method converges linearly, Proximal Newton Method shows local quadratic convergence [4].

Nesterov Acceleration

Nesterov suggests that the optimal convergence rate for first order methods is $O(1/\sqrt{\epsilon})$ in [6]. In [5], he introduces ideas to accelerate convergence of smooth functions and shows that the optimal rate is achievable. Accelerated Proximal Gradient Descent Method is built upon the foundations of the ideas laid in [6]. The method is structured as follows [4]:

Choose initial point: $x_{-1} = x_{0} \in \mathbb{R}^{n}$
Compute an intermediate vector $x_{k-1}^{+}$ : Starting at $x_{k-2}$ and $x_{k-1}$ , evaluate:
$x_{k-1}^{+} = x_{k-1} + \frac{k-2}{k+1} \left(x_{k - 1} - x_{k - 2}\right)$
Evaluate the prox operator after taking a gradient step using $x_{k-1}^{+}$ : Starting at $x_{k-1}^{+}$ , evaluate $x_{k}$ as:
$x_{k} = \textnormal{prox}_{t_k, h}(x_{k-1}^{+} - t_{k} \nabla g(x_{k-1}^{+}))$

The intepretation of

$x_{k-1}^{+} = x_{k-1} + \frac{k-2}{k+1} \left(x_{k - 1} - x_{k - 2}\right)$

is the same as momentum. If $h(x) \equiv 0$ , then the prox operator is identity and accelerated gradient descent method is recovered [4].

Summary

Proximal algorithms can be used to solve constrained optimization problems that can be split into sum of convex differentiable and convex non-smooth parts. If the prox operator is cheap to evaluate, then linear convergence is recovered in the usual scenario, like in the case of gradient descent. Several other algorithms can be recast in terms of a proximal method [1,2]. Although closed form solutions to prox operator may be required, in [7] the authors study the convergence rates when the prox operator is evaluated in an inexact manner.

References

[1] Parikh, N. and Boyd, S., 2014. Proximal algorithms. Foundations and Trends® in Optimization, 1(3), pp.127-239. [paper]

[2] Polson, N.G., Scott, J.G. and Willard, B.T., 2015. Proximal algorithms in statistics and machine learning. Statistical Science, 30(4), pp.559-581. [paper]

[3] Boyd, S., Xiao, L. and Mutapcic, A., 2003. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004, pp.2004-2005. [monograph]

[4] Tibshirani, R., 2016. Proximal Newton Method. Slides from 10-725, Carnegie Mellon University, Fall , 2016. [presentation]

[5] Nesterov, Y., 1983, February. A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady (Vol. 27, No. 2, pp. 372-376). [paper]

[6] Nesterov, Y., 2013. Introductory lectures on convex optimization: A basic course (Vol. 87). Springer Science & Business Media. [book]

[7] Schmidt, M., Roux, N.L. and Bach, F.R., 2011. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems (pp. 1458-1466). [paper]

Learning to learn by gradient descent by gradient descent

Introduction

Tasks in ML are typically defined as finding a minimizer of an objective function $f(\theta)$ over some domain $\theta \in \Theta$ . The minimization itself is performed using gradient descent -based methods, where the parameters are updated taking into account the gradient information. $\theta_{t+1} = \theta_{t} - \alpha_{t} \nabla f(\theta_{t})$ . Many optimizations have been proposed over the years which try to improve the descent-based methods by incorporating various techniques utilizing the geometry , adaptive learning rates, momentum .

The paper approaches the optimization from a novel perspective: using learned update rules for our network, capable of generalizing to a specific class of problems. The update to the current set of parameters is predicted by a neural-network, specifically RNN-based network ( $\theta_{t+1} = \theta_{t} + g_{t}$ ).

Framework

Given an Optimizee: $f(\theta)$ parameterized by $\theta$ and Optimizer $f(\phi)$ parameterized by $\phi$ .

The objective function for the optimizer is

$L(\phi) = \mathop{\mathbb{E}}_{f}{ \lbrack \sum_{t=1}^{T} w_{t} f(\theta_{t}) \rbrack }$

where $\theta_{t+1} = \theta_{t} + g_{t}$

and $g_t , h_{t+1} = m( \nabla_{t} , h_{t}, \phi)$ .

The update steps $ latex g_{t}$ and next hidden state $h_{t+1}$ are the output of the recurrent neural network architecture parameterized by $\phi$ .

$w_{t} \in R_{\leq 0}$ are weights associated with the predicted values of the optimizee for the last t time steps. $w_{t} =1$ value is used in all the experiments. The optimizee parameters are updated with truncated back-propagation through time. The gradient flow can be seen in the computational graph below. The gradients along the dashed lines are ignored in order to avoid computing expensive second-derivative calculation.

Coordinatewise LSTM architecture

Applying per-parameter different LSTM would have been computationally very expensive and would have introduced more than tens/thousands of parameters to optimize on.
To avoid this issue, this paper utilizes a coordinatewise network architecture (LSTM-based architecture) where the optimizer parameters are shared over different parameters of the optimizee and separate hidden state is maintained for each optimizee parameter. The base LSTM architecture is a two-layer LSTM using standard forget-gate architecture.

Preprocessing

One key difficulty when passing the gradients of parameters to the optimizer network is how to handle the different scales of the gradients. Typical deep networks work well when the input to the network is not arbitrarily-scaled.

Instead of passing the gradient $\nabla$ directly, $(\log \nabla, sgn(\nabla) )$ the logarithm of the gradients and the sign of the gradients to the RNN architecture is passed to the coordinatewise architecture.

Experiment

The authors used 2 layer LSTMs with 20 hidden units in each layer for optimizer. Each optimizer is trained by minimizing the loss equation using truncated BPTT. Adam with the learning rate chosen by random line search is used for minimization. The authors compare the trained optimizers with standard optimizers like SGD, RMSProp, ADAM and NAG. The learning rate for these optimizers are tuned while the other parameters are set to default values in Torch7.

The authors evaluated a class of 10 dimensional synthetic quadratic functions whose minimizing function is in form:

where W is a 10X10 matrix and y is 10 dimensional vector drawn from IID Guassian distribution. As shown in the fig below LSTM based optimizer performs much better than the standard optimizers.

The author has trained optimizer over a neural network with 1 hidden layer of 20 hidden units using sigmoid activation on MNIST training set. They have experimentally shown that the optimizer generalizes much better than standard optimizers for different architectures with more layers (2) and hidden units (40) as shown below.

However, when the architecture is changed drastically like using relu activation instead of Sigmoid, the LSTM optimizer does not scale well as shown below.

The author has used LSTM optimizer for neural art to generate multiple image styles using 1 style image and 1800 content images. They found that LSTM optimizer outperform standard optimizers for test content images at training resolution as well as twice the training resolution as shown below.

Conclusion

The authors have shown how to cast the design of optimization algorithms as learning problems. Their experiments have shown that learnt neural optimizers perform better than state of art optimizers for deep learning.

References

[1] Learning to Learn by gradient descent by gradient descent( https://arxiv.org/abs/1606.04474)

A Bayesian Perspective On Generalization And Stochastic Gradient Descent. Part 2

Section 4. Bayes Theorem and Stochastic Gradient Descent

In section 4, the main idea is about the idea “Bayesian principles account for the generalization gap”:

The test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant).
Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima.
Expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.

To evaluate the idea that there is the “generalization gap”, they used a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. They use SGD with a momentum parameter of 0.9. They use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. The model trained on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training.

In ﬁgure 3, they exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking ﬁrst at ﬁgure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches.

In ﬁgure 4a, they exhibit training curves for a range of batch sizes between 1 and 1000. They ﬁnd that the model cannot train when the batch size B less than 10. In ﬁgure 4b they plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition.

Section 5. STOCHASTIC DIFFERENTIAL EQUATIONS AND THE SCALING RULES

Based on the above evaluation, they conclude that:

The test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant.
The authors argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence.
However it is not the batch size itself which controls this tradeoff, but the underlying scale of random ﬂuctuations in the SGD dynamics.
They now identify this SGD “noise scale”, and use it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size, and momentum coefﬁcient.

To get the relationship between noise scale and other four parameters: “Bach size, Learning rate, training set size and Momentum coefficients”. Based on the following states, we could get the relationships as between:

1. Central limit theorem:

2. Model the gradient error with Gaussian random noise:

3. Stochastic differential equation:

Based on these above three rules, we get:

The noise scale falls when the batch size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters ﬁxed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. When we vary the learning rate or the training set size, we should keep the noise scale ﬁxed, which implies:

This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times.

In ﬁgure 5a, the authors plot the test accuracy as a function of batch size after (10000/) training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as increases. In ﬁgure 5b, they plot the best-observed batch size as a function of learning rate, observing a clear linear trend. The error bars indicate the distance from the best-observed batch size to the next batch size sampled in our experiments.

In ﬁgure 6a they exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps. Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases.

In ﬁgure 7a, they plot the test set performance as a function of batch size after 10000 gradient updates, for a range of momentum coefﬁcients. In ﬁgure 7b, the authors’ plot best-observed batch size as a function of the momentum coefﬁcient, and ﬁt our results to the scaling rule above; obtaining remarkably good agreement. We propose a simple heuristic for tuning the batch size, learning rate and momentum coefﬁcient.

A Bayesian Perspective On Generalization And Stochastic Gradient Descent. Part 1

1.Background

Zhang et al. (2016) observed a very “surprising” result. They trained a deep network and they took the same input images, but randomized the labels, and found that their networks were not able to predict the test set, they still memorized the training labels. This result raise a question: why they can work well even though the models assign the label randomly?

Some useful results:

hold the learning rate fixed and increase the batch size, the test accuracy usually falls.
a linear scaling rule between batch size and learning rate
a square root rule on theoretical grounds

2. Empirical principle

“broad minima”>”sharp minima”:curvature of “broad minima” is small
stochastic gradient descent (SGD) finds wider minima as the batch size is reduced.
the curvature of a minimum can be arbitrarily increased by changing the model parameterization

In this paper, they consider the following two questions from a Bayesian perspective

a) how can we predict if a minimum will generalize to the test set?

b) why does stochastic gradient descent find minima that generalize well?

3.Bayesian Model Comparison

We consider a classification model with a single parameter ω

The first equation is the formula of the posterior distribution; the second equation is the likelihood and we choose Gaussian as prior.

Therefore, we can write the posterior distribution as below:

Given a new input $x_t$ to predict an unknown label $y_t$, we need to compute the integrals, however, these integrals are dominated by the region near$w_0$, so the integral can be approximately computed as above.

Probability ratio:

The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs. To calculate evidence P(y|x;M),we need the integral based on w. Since this integral is dominated by the region near the minimum w_0, we can estimate the evidence by Taylor expanding and “Laplace” approximation. Finally, we get the evidence for a model with a single parameter.

Then we can generalize this to model with many parameters as blow:

“Occam factor” enforces Occam’s razor: when two models describe the data equally well, the simpler model is usually better.

Next, we need to compare model M with Null model where all the label are randomly assigned with equal probability. We can get the evidence ratio:

E(w_0) is the log evidence ratio in favor of the null model.

From these formulas, we can get:

broad minima generalize better than sharp minima, but not depend on the parameters
need to rescale \lambda_i,\lambda to keep the model same
hard to evaluate for deep networks for there are millions of parameters

4.Bayes Theorem and Generalization

Consider a simple: logistic regression with label 0/1.

Task a: the labels of both the training and test sets are randomized.

Task b: the labels are informative, matching the true label

replicating the observations of Zhang et al. (2016) in deep networks

Experiment results:

Figure 1 shows us that the model perfectly memorizes the random model.

Figure 2(a) shows us that the model can never be expected to generalize well, 2(b) the model is much better than the NULL model according to the Bayesian evidence and cross-entropy.

Reference:

A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT
related material

Convex-Concave Procedure

The section 1 of paper “Variations and extension of the convex–concave procedure” provides an overview of convex-concave problem . The convex-concave procedure (CCP) is a heuristic method used to find local optimum solutions to difference of covex (DC) programming problems.

Difference of convex programming

Consider DC programming problems as the form below

minimize ${ f }_{ 0 }(x)-{ g }_{ 0 }(x)$

subject to ${ f }_{ 1 }(x)-{ g }_{ 1 }(x) \le 0,\quad i=1,...m,$

where $x\in { R }^{ n }$ is the optimization variable and ${ f }_{ i }:{ R }^{ n }\rightarrow R$ and ${ g }_{ i }:{ R }^{ n }\rightarrow R$ for $i=0,...,m$ are convex. Both fuction ${ f }_{ 0 }(x)$ and ${ g }_{ 0 }(x)$ should be affine to make sure DC program is convex.

What if the objective and contraint fuctions are not convex? For example, the Boolean LP probem proposed in this section and the traveling salesman probelm, no polynomial time algorithm is known and are hard to solve. So here convex-concave procedure (CCP) is presented for finding a local optimum to solve convex-optimization problems.

Assuming that all of the ${ f }_{ i }$ and ${ g }_{ i }$ are differentiable for the ease of notation. The basic CCP algorithm has the form of gradient desent algorithm with convexify. The procedure is choosing an initial feasible point randomly or through a heuristic if one is known. Then iterates convexity until the stopping criterion is satisfied. in each iteration the ${ g }_{ i }$ is replaced with its first-order Taylor approximation, which makes the final problem convex and effcient to solve. The key procedure is convexification, which is shown below

Convexify: $\hat { g } (x;{ x }_{ k })\equiv { g }_{ i }({ x }_{ k })+\triangledown { { g }_{ i }({ x }_{ k }) }^{ T }(x-{ x }_{ k })\quad for\quad i=0,...,m$
Solve: Set ${ x }_{ k+1 }$ to the solution of the original problem
k=k+1

In the paper, one reasonal stopping criterion is proposed as ${ (f }_{ 0 }({ x }_{ k })-{ g }_{ 0 }({ x }_{ k }))-{ (f }_{ 0 }({ x }_{ k+1 })-{ g }_{ 0 }({ x }_{ k+1 }))\le \delta$ , where $\delta$ is threshold. Here as we discussed in the class, the stop criterion can be further improved.

The paper proves that each iteration of the algorithm produces a feasible point. It also shows that the convex-concave procedure is indeed a descent algorithm. However, the proof of convergence is ambiguous as negative infinity does not exist. The section 1 provides us an idea to generally convert problem with non-convex contraints to one that can be solved using convex optimzation. The basic idea is to minimize a function, usually use linear approximations to the contraints.

Then the authors go on to show the advantages of CCP which are retaining more information in each of the iterates and are global wihout requiring trust regionswhich limit progress at an iteration to a region where the approximation is valid. The proof of this is shown

convexification at ${ x }_{ k }$ : replace $f(x)-g(x)$ with

$\hat { f } (x)\quad =\quad f(x)-g({ x }_{ k })-\triangledown { g({ x }_{ k }) }^{ T }(x-{ x }_{ k })$

Since $\hat { f } (x)\ge f(x)$ for all x, no trust region is needed.

true objective at $\tilde { x }$ is better than convexified objective
true feasible set contains feasible set for convexified problem

AdaGrad

AdaGrad is simply just an optimization method based off of the Proximal Point Algorithm (otherwise known as the Gradient Descent algorithm), specifically the Stochastic version of gradient descent. The intention behind the formulation of AdaGrad is because SGD (stochastic gradient descent) converges slowly in the cases when features of our model are significantly more important than others. This is because the original SGD algorithm treats all factors equally, regardless of how much each factor contributes towards the actual models outputs.

Seen below, we can recognize that the stochastic gradient descent algorithm takes a large amount of time in order to find the optimal solution because of the constant jumping back and forth on the expected gradient (seen Left). On the otherhand, if we were to average all of the iterates in SGD (not AdaGrad), we can find a much quicker optimization and decrease the amount of backtracking done.

Now consider a case where we have an extremely wide problem domain, most likely due to a feature having an exaggerated effect on the domain. Seen below, we can see that the standard proximal gradient method (SGD) ends up jumping back and forth because it is a poorly conditioned problem. On the other hand, we can see the AdaGrad variant (which somewhat fixes the poor conditioning of the problem) will end up getting to the optimal in a significantly smaller number of steps.

The original SGD algorithm’s update method defines the update rule as:

$x^{(k+1)} = \text{argmin}_{x \in \chi} \{ \langle g^{(k)}, x \rangle + \frac{1}{2 \alpha_k} \lvert \lvert x - x^{(k)} \rvert \rvert_2^2 \}$

This version of the algorithm has a general convergence rate of $O(1/\sqrt{T})$

AdaGrad takes advantage of the matrix norm, in our case, $B$ will be a positive definite matrix such that:

$\lvert \lvert x \rvert \rvert^2_B := x^T Bx \geq 0$

This matrix $B$ has a list of diagonal terms that will modify the update rule of SGD such that:

$x^{(k+1)} = \text{argmin}_{x \in \chi} \{ \langle \nabla f(x^{(k)}), x \rangle + \frac{1}{2 \alpha_k} \lvert \lvert x - x^{(k)} \rvert \rvert_B^2 \}$

which is equivalent to an update rule of:

$x^{(k+1)} = x^{(k)} - \alpha B^{-1} \nabla f(x^{(k)})$

As can be seen when compared to the original update rule, we have introduced the diagonal matrix $B$ in order to converge more quickly because of the reliance on features that are more exaggerated than less important features. Now the question is how we can best approximate $B$, since it is unknown to the user.

AdaGrad now extends the SGD algorithm by defining within iteration $k$ :

$G^{(k)} = \text{diag} \lbrack \sum_{i=1}^{k} g^{(i)} (g^{(i)})^T \rbrack^{1/2}$

We can define $G^{(k)}$ as a diagonal matrix with entries:

$G^{(k)}_{jj} = \sqrt{\sum_{i=1}^{k} (g_j^{(i)})^2}$

Therefore modifying our previous modified update rule to be our final update rule for Adagrad:

$x^{(k+1)} = \text{argmin}_{x \in \chi} \{ \langle \nabla f(x^{(k)}), x \rangle + \frac{1}{2 \alpha_k} \lvert \lvert x - x^{(k)} \rvert \rvert_{G^{(k)}}^2 \}$

which is equivalent to an update rule of:

$x^{(k+1)} = x^{(k)} - \alpha G^{-1} \nabla f(x^{(k)})$

Proof of Convergence:

In order to go about proving that this new update rule will converge on the optimal solution just as SGD does, we must go about looking at the logic behind our choosing the diagonal matrix for our value of the matrix $B$.

To go about proving this we must have a couple of initial assumptions, definitions, and observations:

Assumption 1

$\vert \lvert x - x^{*} \rvert \rvert_{\infty} \leq R_{\infty}$ for all $x \in \chi$

Definition 1

(bound on the second moments of the $j$ th partial derivative of the function $F$ )

$\sigma_j^2 \geq E\lbrack (\nabla_j F(x;a_i))^2 \rbrack$ for all $x \in \chi$

Observation 1

$C \subset R^n$

$\pi_C^A (y) := \text{argmin}_{x \in C} \lvert \lvert x-y \rvert \rvert^2_A$

for any $x \in C$ then:

$\lvert \lvert \pi_C^A (y) - x \rvert \rvert^2_A \leq \lvert \lvert y - x \rvert \rvert^2_A$

Observation 2

$f$ is convex and differentiable such that:

$f(y) \geq f(x) + \langle f(x), y-x \rangle$

We utilize a theorem defined in the original paper which is also proved later on in the paper that suggests that given a step-size $\alpha = R_\infty$ and $\hat{x}_k = \frac{1}{k} \sum_{k=1}^{K} x_k$ such that:

Theorem 3

$E \lbrack f(\hat{x}_k) \rbrack - f(x^*) \leq \frac{2 R_\infty}{K} \sum_{j=1}^{n} E \lbrack ( \sum_{k=1}^{K} g_{k,j}^2 )^{1/2} \rbrack \leq 2 \frac{R_\infty \sum_{j=1}^{n} \sigma_j}{\sqrt{K}}$

The above equation is what we want to end up proving in order to show that the AdaGrad algorithm will converge on an optimal solution given enough steps $k$ .

Proof

Now that we’ve gone about and defined all of our required values, we can go about proving the convergence of the AdaGrad algorithm.

We first look at our update step of:

$x_{k+1} = \text{argmin}_{x \in \chi} \{ \langle g_k, x \rangle + \frac{1}{2\alpha} \lvert \lvert x - x_k \rvert \rvert^2_{G_k} \}$

which when projected into our problem space:

$\tilde{x}_{k+1} = x_k - \alpha G_K^{-1} g_k$

such that our new update step can be:

$x_{k+1} = \text{argmin}_{x \in \chi} \{ \lvert \lvert \tilde{x}_{k+1} - x \rvert \rvert^2_{G_k} \}$

This can be seen that the Potential Function is:

$\frac{1}{2} \lvert \lvert x_k - x^* \rvert \rvert^2_2$

And the AdaGrad Function would then be:

$\frac{1}{2} \lvert \lvert x_{k+1} - x^* \rvert \rvert^2_{G_k}$

Here, we’ll modify our AdaGrad function utilizing the previous definitions and the Potential function:

$\frac{1}{2} \lvert \lvert x_{k+1} - x^* \rvert \rvert^2_{G_k} \leq \frac{1}{2} \lvert \lvert x_k - \alpha G_k^{-1} g_k - x^* \rvert \rvert^2_{G_k}$

$= \frac{1}{2} \lvert \lvert \tilde{x}_{k+1} - x \rvert \rvert^2_{G_k}$

$= \frac{1}{2} \lvert \lvert x_k - x^* \rvert \rvert^2_{G_k} + \alpha \langle g_k, x^* - x_k \rangle + \frac{\alpha^2}{2} \lvert \lvert G_k^{-1} g_k \rvert \rvert^2_{G_k}$

$= \frac{1}{2} \lvert \lvert x_k - x^* \rvert \rvert^2_{G_k} + \alpha \langle g_k, x^* - x_k \rangle + \frac{\alpha^2}{2} \lvert \lvert g_k \rvert \rvert^2_{G_k^{-1}}$

We can see that the middle term of the final equation can be seen as:

$\langle g_k, x^* - x_k \rangle \leq f_k(x^*) - f_k(x_k)$

This rightmost term can be modfied to be:

$f_k(x^*) - f_k(x_k) \leq \frac{1}{2 \alpha} \lbrack \lvert \lvert x_k - x^* \rvert \rvert^2_{G_k} - \lvert \lvert x_k - \alpha G_k^{-1} g_k - x^* \rvert \rvert^2_{G_k} \rbrack + \frac{\alpha}{2} \lvert \lvert g_k \rvert \rvert^2_{G_k^{-1}}$

We can now modify this step with a summation of elements such that:

$\sum_{k=1}^{K} f_k(x^*) - f_k(x_k) \leq \frac{1}{2 \alpha} \sum_{k=1}^{K} \lbrack \lvert \lvert x_k - x^* \rvert \rvert^2_{G_k} - \lvert \lvert x_{k+1} - x^* \rvert \rvert^2_{G_k} \rbrack + \frac{\alpha}{2} \sum_{k=1}^{K} \lvert \lvert g_k \rvert \rvert^2_{G_k^{-1}}$

With this step, we can define $G_{k+1} \succeq G_k$ for all $k$ given that $G_k := \text{diag} (\sum_{i=1}^{K} g_i g_i^T)^{1/2}$ . With this definition we can now create a new Lemma:

$\sum_{k=1}^{K} \langle g_k, G_k^{-1} g_k \rangle \leq 2 \sum_{j=1}^{n} \sqrt{\sum_{k=1}^{K} g_{k,j}^2}$

Since we’ve expanded out our summation of elements above, we need to reduce our first term on the right side of the $\leq$ sign:

$\sum_{k=1}^{K} \lbrack \lvert \lvert x_k - x^* \rvert \rvert^2_{G_k} - \lvert \lvert x_{k+1} - x^* \rvert \rvert^2_{G_k} \rbrack = \lbrack \lvert \lvert x_1 - x^* \rvert \rvert^2_{G_1} - \lvert \lvert x_{k+1} - x^* \rvert \rvert^2_{G_k} \rbrack + \sum_{k=2}^{K} \lbrack \lvert \lvert x_k - x^* \rvert \rvert^2_{G_k} - \lvert \lvert x_{k} - x^* \rvert \rvert^2_{G_{k-1}} \rbrack$

The final term in the above equation is based off of the definitions of the norm and applying out when $G_k - G_{k-1}$ is needed to expand out our summation terms.

We use the basic definition of our space and the diagonal to form the defintion:

$\lvert \lvert y \rvert \rvert^2_{G_k} \leq \lvert \lvert \text{diag} (G_k) \rvert \rvert_1 \lvert \lvert y \rvert \rvert^2_\infty$

We use this definition in the final term of the previous equation that was expanded out in order to see that:

$\sum_{k=2}^{K} \lvert \lvert \text{diag}( G_k - G_{k-1} )\rvert \rvert_{1} \lvert \lvert x_{k} - x^* \rvert \rvert^2_\infty \leq \sum_{k=2}^{K} \lvert \lvert \text{diag}( G_k - G_{k-1} )\rvert \rvert_{1} R^2_\infty$

For better understanding we will define a new function:

$\text{tr} (X) = \lvert \lvert \text{diag} (X_k - X_{k-1}) \rvert \rvert_1$

Using the Assumption 1 we can now see that:

$\sum_{k=1}^{K} \lbrack \lvert \lvert x_k - x^* \rvert \rvert^2_{G_k} - \lvert \lvert x_{k+1} - x^* \rvert \rvert^2_{G_k} \leq R_\infty^2 \text{tr} (G_k) = R_\infty^2 \sum_{j=1}^{n} \sqrt{\sum_{k=1}^{K} g_{k,j}^2}$

Going back to one of our previous definitions, we can reduce this to:

$\sum_{k=1}^{K} f_k(x^*) - f_k(x_k) \leq \lbrack \frac{1}{2 \alpha} R_\infty^2 + \alpha \rbrack \sum_{j=1}^{n} \sqrt{\sum_{k=1}^{K} g_{k,j}^2}$

Given the fact that the potential function $E \lbrack f(x_k) | x_k \rbrack = f(x_k)$ we can now finally modify this to find our convergence proof that we were originally looking for:

$E \lbrack f(\hat{x}_k \rbrack - f(x^*) \leq \frac{1}{k} \lbrack \frac{1}{2 \alpha} R_\infty^2 + \alpha \rbrack E \lbrack \sum_{j=1}^{n} \sqrt{\sum_{k=1}^{K} g_{k,j}^2} \rbrack$

This proves that our given new update rule for AdaGrad will produce an optimal solution at (approximately) the same rate that SGD does.

Understanding Bayesian Optimization

The Prior and its Possibilities

I discussed Bayesian optimization, which is a way of optimizing a function that does not have a formula but can be evaluated. Bayesian optimization use Bayesian inference and thus have prior, likelihood, and posterior distributions. Bayesian optimization can be used to optimize hyperparameters in machine learning. Given a data set for learning on, the hyperparameters are the input to a function. The output to the function is some assessment of the machine learning model’s performance on the data such as F1-score, precision and recall, accuracy, and so on.

The following links were used in my understanding of Bayesian optimization:

PRACTICAL BAYESIAN OPTIMIZATION OF MACHINE LEARNING ALGORITHMS
Gaussian Processes for Dummies (blog post)
A Tutorial on Bayesian Optimization for Machine Learning (slides)
Introduction to Bayesian Optimization (slides)
Tutorial: Gaussian process models for machine learning (slides)

The prior distribution in Bayesian optimization is called a Gaussian process on the prior. This terminology was confusing to me at first since I thought that Bayesian optimization was basically synonymous with Gaussian processes, but I think the prior distribution is called a Gaussian process. Gaussian processes are used as a machine learning method.

What exactly the posterior and the prior are doing was confusing to me as well as some people in the class. At first I thought that the prior was a distribution on functions. For example, if you had a polynomial function with parameterized coefficients, theta, then I thought that the prior and the posterior were a distribution on these coefficients theta, but this is not correct. Thus, initially, I thought that the prior must specify a family of functions.

The prior distribution and the posterior distribution are distributions on function values without regard to the type or family of function. Hence, each possible function value for a given input has a probability assigned by the distribution. A common choice for a prior distribution on the function values is a normal distribution centered at 0 with some variance. This means that the function that maximizes the probability of this distribution is y = 0. The posterior distribution gives a probability distribution on possible function values after function evaluations are performed and incorporated with the likelikhood function, which is jointly normal on the observed function evaluations.

There is an animated GIF in some slides on page 28 that I found that gives a nice visualization of this phenomena: Introduction to Bayesian Optimization (slides).

Since the normal prior centered at 0 is usually chosen, I wonder how much the choice of prior changes things. The prior in a machine learning context could be the average results that given hyperparameters produce across many machine learning projects. There are statistics to calculate the relative effects of the prior and the likelihood on the posterior distribution. It is often the case that the prior has little effect while the likelihood has a greater effect especially with more data; thus, optimizing the prior distribution could be a lot of effort for little gain.

The Meaning and Selection of the Kernel

In Bayesian optimization, there is a function to determine the covariance between any two function evaluation points. This function determines the covariance matrix used in the multivariate Gaussian matrix for the likelihood function. The covariance function is called a kernel, and there are many kernels used in Gaussian processes.

The paper that we discussed in class, Practical Bayesian Optimization of Machine Learning Algorithms, discussed two kernels, the squared exponential kernel and the Matérn 5/2 kernel. The authors argue that the squared exponential kernel is too smooth for hyperparameter tuning in machine learning while the Matérn kernel is more appropriate. The papers show the Matérn kernel performing better in some empirical experiments.

How was this kernel really chosen? Did the authors simply try several kernels in their experiments and found one that performed the best? This doesn’t really give a provable way to choose a good kernel. We wondered whether there was some mathematically formal way to prove that some kernel was better than another, but we thought that this was probably not the case. Maybe it could be proven given certain conditions such as convexity or smoothness assumptions. Intuitively, similar choices of hyperparameters will probably produce similar results, so kernels that assume a somewhat smooth function are probably appropriate.

One of the drawbacks of Bayesian optimization in machine learning hyperparameter tuning is that Bayesian optimization has its own hyperparameters that could be tuned, so this just pushes back the problem of hyperparameter tuning. How are the hyperparameters of Bayesian optimization meant to be tuned? The choice of kernel, acquisition function, prior distribution, and stopping criteria are examples of hyperparameters for Bayesian optimization.

The Meaning of the Acquisition Function

In Bayesian optimization, an acquisition function is used to choose the next point for function evaluation. The paper that we discussed in class mentions three acquisition functions: probability of improvement, expected improvement, and upper confidence bound.

The definition of these functions is very abstract and mathematical, and I had some difficulty interpreting what the functions did. This makes reasoning about them intuitively difficult. The paper focuses on the expected improvement function, and I found an alternative formula for it on slide 35 of the presentation Introduction to Bayesian Optimization. This formula is the following.

∫ max(0, y_{best} – y) p(y | x; \theta, D) dy

From this formula, it is clearer that the expected improvement formula finds a point that maximizes the probability that the function evaluation at that point will be below the best y value found so far. Perhaps the other formulas can be written in a similar manner, giving a better understanding of what they do.

The paper did not present a mathematically formal way of choosing the best acquisition function. Maybe under certain conditions, a given acquisition function can be shown to be optimal.

Saddle-Points in Non-Convex Optimization

Identifying the Saddle Point Problem in High-dimensional Non-convex Optimization (Mukund)

Non-Convex Optimization

Non-convex optimization problems are any group of problems which are not convex (concave, linear etc.). These non-convex optimization problems are difficult to solve because non-convex functions have potentially many local minima and finding the global minima among all the local minima is hard. For this reason, solving such optimization problems are at least NP-Hard. Other reasons which make optimizing non-convex problems are

the presence of saddle points
very flat regions exhibited in the curvature
the curvature of such functions can be widely varying.

Traditionally, methods like Gradient descent, Newton Methods which are used to solve convex optimization problems are used to solve non-convex optimization problems as well.

Saddle Points

Given a function f, a critical point is defined as the point in the function where the derivative of f becomes 0. Typically, critical points are either maxima or minima (local or global) of that function. Saddle points are special type of critical points where the slope becomes 0, but are not local extremum in both axes.

For example, in the graph ${z = x^2 - y^2}$ , the point (0,0) is the saddle point because the gradient at (0,0) is 0, but the point in neither a local maxima not a local minima.

At this point, let us talk a bit more about critical points. The curvature of the function surrounding these critical points can tell us a lot about these critical points itself. The curvature of the function can be evaluated using the Hessian at these points. The eigen values of the Hessian at the critical point can describe the critical point.

If the eigen values are all non-zero and all positive, then the critical point is local maxima
If the eigen values are all non-zero and all negative, then the critical point is local minima
If the eigen values are all non-zero and mix of both positive and negative values, then the critical point is saddle point
If the Hessian matrix is singular, then the critical point is a degenerate critical point

Saddle Point Problem in High Dimensional Non-convex Problems

When a non-convex error function has a single scalar variable, there might be a lot of local maxima and local minima in the function but there is very negligible probability of occurrence of a saddle point. Whereas error functions with N scalar variables are likely to have more saddle points than local minima. In fact, as the dimensionality (N) increases, the number of saddle points increases exponentially. There have been many studies to prove this point. When dealing with high dimensional non-convex problems, which is the case with real world problems, we have to deal with these saddle points. The traditional methods like gradient descent and Newton methods perform poorly in presence of these saddle points. The gradient descent methods are repelled away from these saddle points, but the convergence is painfully slow. Whereas the Newton methods in an effort to speed up convergence, get attracted to these saddle points and do not converge on the local minima.

Dynamics of Optimization Algorithms around Saddle Points

To better understand the difficulty of existing optimization algorithms for high dimensional non-convex problems, it is helpful to understand how these algorithms behave near saddle points during the optimization process.

We can re-parameterize the critical points using the Taylor expansion so that they can be locally analysed. The Taylor expansion can be given as:

$f(\theta^* + \Delta \theta) = f(\theta^*) + \frac {1}{2} \sum_{i=1}^{n} \lambda_{i} \Delta v_{i}^2$

The step size that the gradient descent method uses is $-2 \lambda_i \Delta v_i$ . A step of the gradient descent method will always move in the right direction around the saddle point. If the eigen value is positive, we move towards that point in the direction of the negative curvature and hence the gradient descent method will always move away from the saddle points. The problem with gradient descent is the step size. Because gradients are proportional to corresponding eigenvalues, eigenvalues dictates how fast we move in each direction. If there is a large discrepancy in eigenvalues, the gradient descent will take very small steps.†It might take a very long time to move away form the critical point, if the critical point is a saddle point, or to the critical point if it is a minimum.

The Newton method rescales the gradients in each direction with the inverse of the corresponding eigenvalue. The step size used in the Newton method is $- \Delta v_{i}$ . If the eigenvalue is negative, the Newton method moves in the opposite direction to the gradient descent method due to the rescaling. This results in moving in the wrong direction around the saddle points and Newton methods instead of escaping the saddle point converge on them.

The trust region is a second order method, where the step size taken is $- ( \frac {\lambda_i} {\lambda_i + \alpha} ) \Delta v_i$ . This speeds up the convergence significantly and this method moves in the same direction as the gradient descent method, thus escaping the saddle points. It can also suffer the drawback of slow convergence sometimes if $\alpha$ is too large.

Attacking the Saddle Point Problem in High-dimensional Non-convex Optimization (Yufeng)

Problems with Stochastic Gradient Descent (SGD) and Newton Method

Based on our previous analysis, we know that saddle points are ubiquitous in high dimensional non-convex optimization problems. However, we are still uncertain about how modern optimizers would behave around these saddle points. Specifically, we would like to focus on Stochastic Gradient Descent (SGD) and Newton method here.

Suppose we have a function $f(\theta)$ , in which $\theta$ stands for the whole parameters of a model that we want to optimize. Besides, we have a saddle point $\theta^*$ . Then we may approximate $f(\theta)$ around this saddle point using second-order Taylor expansion by

$f(\theta^* + \Delta\theta) = f(\theta^*) + \nabla f(\theta^*)^\top \Delta\theta + \frac{1}{2} (\Delta\theta)^\top \mathbf{H} \Delta\theta.$

Since a saddle point is also a critical point, the second first-order term can be cancelled with $\nabla f(\theta^*)$ being a $\mathbf{0}$ vector. Now let’s reparametrize $\Delta\theta$ based on how it changes along orthonormal eigenvector directions of the Hessian matrix $\mathbf{H}$ , i.e., $\mathbf{e}_1, \dots, \mathbf{e}_{n_\theta}$ . Here $n_{\theta}$ represents the number of eigenvectors for $\mathbf{H}$ . Correspondingly, we also have eigenvalues $\lambda_1, \dots, \lambda_{n_\theta}$ . Then the movement along eigendirections can be formulated as

$\begin{aligned} \Delta \mathbf{v} &= \left[ \begin{array}{c} e_1^\top\\ \vdots\\ e_{n_\theta}^\top \end{array} \right] \Delta\theta \end{aligned}.$

Since a Hessian matrix should be both real and symmetric, with orthonormal eigenvectors, it can be represented by

$\mathbf{H} = [ e_1, \dots, e_{n_\theta} ] \Lambda \left[ \begin{array}{c} e_1^\top \\ \vdots \\ e_{n_\theta}^\top \end{array} \right],$

where $\mathbf{\Lambda}$ is the matrix with eigenvalues on the diagonal. Therefore, our Taylor expansion approximation can be expressed as:

$\begin{aligned} f( \theta^* + \Delta\theta ) &= f(\theta^*) + \frac{1}{2} (\Delta\theta)^\top \overbrace{ [ e_1, \dots, e_{n_\theta} ] \mathbf{\Lambda} \left[ \begin{array}{c} e_1^\top \\ \vdots \\ e_{n_\theta}^\top \end{array} \right] }^{\mbox{ Orthonormal }} \Delta\theta \\ &= f(\theta^*) + \sum_{i=1}^{n_\theta} \lambda_i \Delta\mathbf{v}_i^2 \end{aligned}.$

Based on the above reparametrization, we can now go forward with how SGD and Newton method stroll around the saddle point $\theta^*$ . First for SGD, the movement along a specific eigenvector $e_i$ is defined as

$\mathbf{m}_i = -\lambda_i \Delta\mathbf{v}_i.$

It’s worth mentioning that we are actually moving along the correct direction here. However, with $\lambda_i$ scaling before $\Delta\mathbf{v}_i$ , eigenvalues of different magnitudes will result in different step sizes along eigenvector directions.

While for Newton method, the movement is quantified by

$\mathbf{m}_i = - \Delta\mathbf{v}_i.$

Although the magnitude of movement is forced to be the same by rescaling with the eigenvalues, negative eigenvalues can actually cause the resulting shift going in the opposite direction. This will unexpectedly increase the loss. Furthermore, since Newton method is originally developed to find roots of an equation, as it applied here on the first derivatives of the original objective function, it can eventually help to find all the critical points iteratively.

General Trust Region Methods

In order to tackle the saddle point problem, the authors defined a class of generalized trust region methods (GTRM), which is the extension of classical trust region method (CTRM). First of all, let’s look at how CTRM works. We’ll use second-order Taylor expansion (quadratic model) to approximate the original objective function

$\begin{aligned} \Delta\theta &= \text{argmin}_{\Delta\theta} f(\theta) + \nabla f(\theta)^\top \Delta\theta + \frac{1}{2} \Delta\theta^\top \mathbf{H} \Delta\theta \\ & \mbox{s.t. } ||\Delta\theta|| \le \Delta, \end{aligned}$

in which $\Delta$ defines the radius of a trust region. While for their proposed GTRM, they made two changes:

The minimization of first-order Taylor expansion is allowed, which can in some way alleviate the difficulty of optimization;
Original norm constraints on the step $\Delta\theta$ are relaxed, which is replaced by some hand-designed distance metrics.

With these extensions, the generalized trust region methods can be summarized as:

$\begin{aligned} \Delta\theta &= \text{argmin}_{\Delta\theta} \mathcal{T}_k\{ f, \theta, \Delta\theta \} \quad \mbox{with }k \in \{1,2\} \\ & \mbox{s.t. } d( \theta, \theta + \Delta\theta ) \le \Delta, \end{aligned}$

in which $\mathcal{T}_k$ represents the $k$ -th order Taylor expansion and $d( \theta, \theta + \Delta\theta )$ defines some distance metrics.

Attacking the Saddle Point Problem

Now let’s get back to how we should attack the saddle point challenge that can’t be completely solved by either SGD or Newton method. It would be straightforward to consider combining the advantages of both methods. Intuitively, we can just rescale the SGD movement by the absolute value of corresponding eigenvalues, i.e.,

$\mathbf{m}_i = - \frac{\lambda_i}{|\lambda_i|} \Delta \mathbf{v}_i,$

which uses the same rescaling as the Newton method but owns the property of moving in the correct eigendirection. However, there hasn’t been any mathematical justification of such designs before. Then we’ll need to answer two potential questions:

Are we optimizing the same objective function, if we replace $\mathbf{H}$ with $|\mathbf{H}|$ which is the matrix obtained by taking the absolute value of each eigenvalue of $\mathbf{H}$ ?
The above expectation might be true near the saddle points (or more generally the critical points), will it still work well when it’s far away from the critical points?

With the above concerns, the authors were able to show that this is actually a valid heuristic design based on their GTRM framework.

To begin with, a first-order Taylor expansion of the objective function is used as the main goal for optimization as:

$\Delta\theta = \text{argmin}_{\Delta\theta} \mathcal{T}_1\{ f, \theta, \Delta\theta \} = f(\theta) + \nabla f(\theta)^\top \Delta\theta,$

which is just an affine transformation so that the minimum will always be negative infinity if there are no constraints on the input domain. Accordingly, we know the minimizer should lie on the boundary of the trust regions. Besides, we need to incorporate the curvature information into this optimization problem, which will have to come from inside the distance metric in constraints. Here in this paper, the authors restrict the trust region to be defined by how far the second-order Taylor expansion can be away from the first-order expansion. Formally, the distance constraint is defined as

$d( \theta, \theta + \Delta\theta ) = |\mathcal{T}_2(f,\theta, \Delta\theta) - \mathcal{T}_1(f,\theta, \Delta\theta) | \le \Delta.$

If expand the distance constraint, we can get

$\frac{1}{2} \left| \Delta\theta^\top \mathbf{ H }\Delta\theta \right| \le \Delta .$

This quadratic problem will not be that easy to solve if $\Delta\theta$ is of high dimension. To circumvent this, the authors proposed a Lemma as shown below.

Lemma 1. Let $\mathbf{A}$ be a nonsingular symmetric matrix in $\mathbb{R}^{n\times n}$ , and $\mathbf{x}\in \mathbb{R}^n$ be any vector. Then it holds that $\mathbf{|x^\top Ax | \le x^\top |A|x}$ , where $|\mathbf{A}|$ is the matrix obtained by taking the absolute value of each of the eigenvalues of $\mathbf{A}$ .

Proof. $\mathbf{A}$ is nonsingular and symmetric $\Rightarrow$ $\{e_1, e_2, \dots, e_n\}$ are orthogonal. If they have been normalized, then $\mathbf{I} = \sum_i e_ie_i^\top$ .

$\begin{aligned} |\mathbf{x^\top Ax} | &= \left| \sum_i \mathbf{ (x^\top e}_i ) \mathbf{e}_i^\top \mathbf{A x }\right| = \left| \sum_i (\mathbf{x^\top e}_i )\lambda_i (\mathbf{e}_i^\top \mathbf{x}) \right| = \left| \sum_i \lambda_i (\mathbf{x}^\top \mathbf{e}_i)^2 \right| \\ & \le \sum_i |\lambda_i (\mathbf{x}^\top \mathbf{e}_i)^2| = \sum_i (\mathbf{x}^\top \mathbf{e}_i) | \lambda_i | (\mathbf{x}^\top \mathbf{e}_i) = \mathbf{ x^\top |A| x } \quad\quad \square \end{aligned}$

Therefore, instead of using the original constraint formula, we can fold $\frac{1}{2}$ into the right-hand side distance upperbound and relax it a little bit as

$\left| \Delta\theta^\top \mathbf{ H }\Delta\theta \right| \le \Delta\theta^\top |\mathbf{ H }|\Delta\theta \le \Delta.$

As we have already stated that such minimizer will always lie on the boundary of the trust region, we can further remove the inequality sign. Our final optimization problem becomes

$\begin{aligned} \Delta\theta &= \text{argmin}_{\theta} f(\theta) + \nabla f(\theta)^\top \Delta\theta \\ & \mbox{s.t. } \Delta\theta^\top |\mathbf{H}| \Delta\theta = \Delta \end{aligned},$

which can be solved analytically by Lagrangian multiplier method.

$\begin{aligned} \mathcal{L}( \Delta\theta, \lambda ) = f(\theta) &+ \nabla f(\theta)^\top \Delta\theta + \lambda ( \Delta\theta^\top |\mathbf{H}| \Delta\theta - \Delta ) \\ &\min_{\Delta\theta} \max_{\lambda} \mathcal{L}( \Delta\theta, \lambda ) \\ \frac{\partial \mathcal{L}}{\partial \Delta\theta } &= \nabla f(\theta) + 2\lambda |\mathbf{H}| \Delta\theta = 0\\ & \Longrightarrow \Delta\theta = -\frac{1}{2\lambda} |\mathbf{H}|^{-1} \nabla f(\theta). \end{aligned}$

If the scalar in front is folded into learning rate, an update step would be

$\Delta\theta = -|\mathbf{H}|^{-1} \nabla f(\theta).$

This is called saddle-free Newton method (SFN), which inherits the merits of both SGD and Newton method.

It can move further (less) in the directions of low (high) curvature as compared to SGD.
With always correct directions, it can move away from saddle points when the Hessian matrix is not positive definite.

However, the exact computation of a Hessian matrix would be infeasible in a high dimensional problem. To escape from exact computation, the authors introduce the Krylov subspace method, which tries to optimize upon another space that is much lower than the original space, i.e., $\hat{f}(\alpha) = f(\theta + \alpha \mathbf{V} )$ . Here $\mathbf{V}$ stands for the $k$ largest eigenvectors of the Hessian matrix. Such eigenvectors can be found efficiently through Lanczos iteration as stated in their paper. So it should be noted that now we are optimizing on the variable $\alpha$ . Afterward, the new absolute value version Hessian matrix associated with this new space is calculated via eigendecomposition for $|\hat{\mathbf{H}}| \leftarrow \left| \frac{\partial^2 f}{\partial \alpha^2} \right|$ . Thereafter, Newton method similar steps can be applied iteratively. The authors also involve another step to determine the learning rate (step size) for their SFN method. The overall pseudocode is shown in Algorithm 1 in the following figure.

As we mentioned before, line 8 above is applied to search for a suitable learning rate $\lambda$ that can minimize current $\hat{f}$ from our understanding. Then in line 9, it’s mapped back to the original parameter space ( $\theta$ ) and the final update is done eventually.

Experimental Validation

After theoretical analysis above, it’s essential to validate these results experimentally. In this section, the existence of saddle points, the effectiveness of SFN on the deep feed-forward neural network and recurrent neural network are elaborated through various experiments.

The Existence of Saddle Points in Neural Networks

In order to compare with other popular optimizers like SGD and damped Newton method exactly, small neural networks on downsampled MNIST and CIFAR-10 are trained, which enables exact computation of updates. The experimental results are shown in the following figure.

Based on the above figure, we can observe that:

In (a) and (d), as the number of hidden units increases, our model become much more high-dimensional. It’s obvious that both SGD and damped Newton method get stuck around a hidden unit number of 25. However, for their proposed SFN, it’s still able to step into a region of much lower error. This indirectly indicates that the number of saddle points increases exponentially as the dimensionality of model increases.
From (b) and (e) of learning curves plotted in terms of epochs, it confirms that SFN can evade away from saddle points and achieves a much lower training error. While for SGD and damped Newton method, both get trapped in regions where SFN does not.
Finally, from (c) and (f), the evolution of the absolute value of most negative eigenvalue is shown, which shows that it moves more toward the right as we achieve a lower error. So we are gradually having a more positive definite Hessian matrix. This indicates a local minimum instead of saddle points, for which we have successfully escaped.

The Effectiveness of Saddle-Free Newton Method

Furthermore, the authors tried to verify that their SFN method is also able to repel away from saddle points and achieves local minimum under deep neural networks. A seven-layer deep autoencoder is trained on the full-scale MNIST dataset, the approximate SFN jumps in after mini-batch SGD got stuck. The learning curve is shown in (a) of the figure below. We can also discover that after SFN is incorporated, the magnitude of the absolute value of the minimum eigenvalue and the norm of gradients have been reduced significantly.

Recurrent Neural Network Optimization

In the end, the authors tried to test if SFN is able to help training the recurrent neural network (RNN) if the training difficulty is caused by saddle points. A small RNN with $120$ hidden units targeted for character-level language modeling is trained on the classical Penn Treebank dataset. Again SFN was used after SGD got trapped. A similar trend for learning curve is found as shown in (c) of the following figure. Furthermore, we can see in fig (d) that the Hessian matrix of the final solution found by SFN has much fewer negative eigenvalues as compared to ones found by mini-batch SGD.

Alternating Direction Method of Multipliers (ADMM)

1. Introduction

This blog post covers analysis of Alternating Direction Method of Multipliers (ADMM); which solves Convex Optimization problems by breaking the problems into subproblems efficiently. Sections: ${1}$ to ${6}$ are written by Nischel Kandru and rest of them by Ahmadreza Azizi.

2. Dual Ascent

${\rightarrow}$ Consider the Convex Optimization problem which minimizes ${f(x)}$ such that ${Ax = b}$ .

Lagrangian for this problem is:
${L(x,y) = f(x) + y^T (Ax-b)}$

Dual problem of the original problem is:
${g(y)= max_{y} -f^{*}(-A^{T}y) -b^{T}y}$ , where ${f^*}$ is the convex conjugate of the objective function ${f}$ and ${y}$ is the dual variable or the lagrange multiplier.

${\rightarrow}$ If strong duality holds, then the optimal values of the primal and dual problems are the same.

${x^* = argmin_x \quad L(x,y^*)}$

where ${x^*}$ is the primal optimal point and ${y^*}$ is the dual optimal point; provided there is only one minimizer of ${L(x,y^*)}$ ( ${f}$ is strictly convex)

${\rightarrow}$ If ${f}$ is just convex (implies there isn’t a unique minimizer), instead dual-subgradient is performed rather than dual gradient;

${x^* \in argmin_x \quad L(x, y^*)}$

${\rightarrow}$ In the dual ascent method, to solve the dual problem, gradient ascent method would be used. Assuming ${g}$ is differentiable, we find:
${x^+ = argmin_x L(x,y)}$ and ${\nabla g(y) = Ax^+ - b}$

${\rightarrow}$ If ${g}$ is non-differentiable, we use sub-gradient method to solve the minimization problem:

${x^{k+1} = argmin_x L(x,y^k)}$ : x-minimization step
${y^{k+1} = y^k + \alpha^k (Ax^{k+1} - b)}$ : dual variable update,
where ${k}$ is the iteration step and ${\alpha^k}$ is the step sizes.

Repeat the above steps iteratively to solve the optimization problem. This is called dual ascent; because with appropriate step size ${\alpha^k}$ , we always get ${g(y^{k+1}) > g(y^k)}$ .

Let’s derive the above minimization steps:

Given ${f}$ , an objective function then the conjugate of the function: ${f^*(y) = max_{x \in R^n} \quad y^T x - f(x)}$ ; maximum over all points ${x}$ evaluated at point ${y}$ .

If ${f}$ is closed and convex, then ${f^{**} = f}$ ; conjugate of conjugate is the function itself. Also, ${x \in \partial f^* (y)}$ if and only if ${y \in \partial f(x)}$ if and only if ${x \in argmin_x f(x) - y^Tx}$ ; if ${x}$ is the subgradient of the conjugate at ${y}$ , if and only if ${y}$ is the subgradient of the original function at ${x}$ .

${x \in \partial f^*(y)}$ if and only if ${x \in argmin_x f(x) -y^Tx}$ , sub-gradient for maximum is the one that achieves the maximum. We also know that achieving maximum is same as achieving minimum; so
${x \in argmin_x -(y^Tx - f(x))}$
${\implies x \in argmin_x (f(x) - y^Tx)}$

Dual problem:
${g(y) = max_y -f^*(-A^Ty) - b^Ty}$ — (1)
Suppose ${G(y) = f^*(-A^Ty)}$
Subgradients of (1) are: ${\partial G(y) - b}$ where;
${\partial G(y) = A \partial f^* (-A^Ty)}$
${x \in \partial F^* (-A^Ty)}$ if and only if ${x \in argmin_x f(x) + y^TAx}$

${x^{k+1} \in argmin_x f(x) + {y^k}^TAx}$
${\implies x^{k+1} \in L(x,y^k)}$
${y^{k+1} = y^k + \alpha^k (Ax^{k+1} -b)}$

Disadvantages of Dual Ascent

Slow Convergence
Even though we may achieve convergence in dual objective value, convergence of ${y^k}$ , ${x^k}$ to solutions require strong assumptions

Advantages of Dual Ascent
Decomposability; which leads to Dual Decomposition

3. Dual Decomposition

If ${f}$ is separable, then dual ascent can be decentralized.

${f(x) = \sum_{i=1}^{N} f_i (x_i)}$

The Lagrangian for this problem would be:

$\displaystyle L(x,y) = \sum_{i=1}^{N} L_i (x_i, y) = \sum_{i=1}^{N} (f_i (x_i) + y^TA_ix_i - (1/N)y^Tb)$

$\displaystyle {x_i}^{k+1} := argmin_{x_i} L_i(x_i, y^k)$

we can compute the x-minimization steps in parallel, thanks to the decomposability.

$\displaystyle y^{k+1} := y^k +\alpha^{k} (Ax^{k+1} -b)$

This essentially involves couple of steps:

Broadcast: Send ${y}$ to each of the other nodes, each optimizes in parallel to find ${x_i}$
Gather: Collect ${A_ix_i}$ from each node, update the global dual variable ${y}$

4. Augmented Lagrangian and Method of Multipliers

To yield convergence without assumptions like strict convexity, we add a new term; so that the function becomes more convex.

$\displaystyle L_{\rho} (x,y) = f(x) + y^T(Ax-b) + \frac{\rho}{2}{||Ax-b||_2}^2$

where ${\rho > 0}$ is the penalty parameter

We can clearly see that the addition of the new term doesn’t change the problem as such, because the constraint is ${Ax=b}$ . The new term provides:

Numerical Stability
Smooths out the primal criterion function
Introduces curvature into the objective function

Augmented Lagrangian can be viewed as the Lagrangian with problem-

$\displaystyle min \quad f(x) + \frac{\rho}{2}{||Ax-b||_2}^2$

subject to ${Ax = b}$

${\rightarrow}$ With the penalty term ${\rho}$ , dual function ${g_\rho}$ can be shown to be differentiable under rather mild conditions on the original problem.

${\rightarrow}$ Applying dual ascent yields:

${x^{k+1} := argmin_x L_\rho (x,y^k)}$
${y^{k+1} := y^k + \rho (Ax^{k+1} -b)}$

${\rightarrow}$ Optimality conditions for original problem are primal and dual feasibility, i.e,

$\displaystyle Ax^* - b = 0, \nabla f(x^*) + A^Ty^* = 0$

Let’s see the motivation behind having ${\rho}$ as the stepsize instead of ${\alpha^k}$ .

${\nabla L_\rho {(x, y^k)}_{x=x^{k+1}} = 0}$ , since ${x^{k+1}}$ minimizes
${\nabla f(x^{k+1}) + A^Ty^k + \rho A^T (Ax^{k+1} - b) = 0}$
${\nabla f(x^{k+1}) + A^T(y^k + \rho(Ax^{k+1} - b)) = 0}$
${\nabla f(x^{k+1}) + A^T(y^{k+1}) = 0}$
${(x^{k+1}, y^{k+1})}$ is dual feasible

${\rightarrow}$ This is the stationary condition for original problem, so to preserve this use ${\rho}$ as stepsize.
${\rightarrow}$ We get much better convergence in Augmented Lagrangian, but lose out on the decomposability; the solution to this is ADMM which gets the best out of both the worlds.

5. Alternating Direction Method of Multipliers

ADMM combines Decomposability of Dual Ascent and the convergence of Method of Multipliers.

$\displaystyle min \quad f(x) + g(z)$

where ${f}$ and ${g}$ are convex, subject to ${Ax+Bz=c}$

The Lagrangian for this problem is:
${L_\rho (x,z,y) = f(x) + g(z) + y^T(Ax+Bz-c) + \frac{\rho}{2}{||Ax+Bz-c||_2}^2}$

${x^{k+1} := argmin_x L_\rho (x, z^k, y^k)}$ — X- minimization step
${z^{k+1} := argmin_z L_\rho (x^{k+1}, z, y^k)}$ — Z- minimization step
${y^{k+1} := y^k + \rho (Ax^{k+1} + Bz^{k+1} -c)}$ — dual variable update step

${\rightarrow}$ ${x}$ and ${z}$ are updated in an alternating fashion.
${\rightarrow}$ We can clearly see that the Z- minimization step depends on the updated value ${x^{k+1}}$ ; so these steps cant be parallelized completely.

The method of multipliers for this problem would have the form:

$\displaystyle (x^{k+1}, z^{k+1}) := argmin_{x,z} L_\rho (x,z,y^k)$

$\displaystyle y^{k+1} := y^k + \rho (Ax^{k+1} +Bz^{k+1} -c)$

we can clearly see that, here we try to minimize ${x}$ and ${z}$ together and lose out on the decomposability.

${\rightarrow}$ We force decomposability in ADMM, we dont have complete power of decomposability as ${z^{k+1}}$ depends on ${x^{k+1}}$ but we can compute all the ${x_i}$ s and ${z_i}$ s parallely.

${\rightarrow}$ We can see that ADMM doesnt converge as much as the Method of Multipliers and doesnt decompose as much as Dual Decomposition. But, it takes the best of both worlds.

${\rightarrow}$ ADMM in general requires lot of iterations for highly accurate solutions, but obtains relatively accurate solution in few iterations.

Scaled Form
We can have a scaled dual variable as: ${u= \frac{1}{\rho}y}$ and this makes the computations efficient.

$\displaystyle x^{k+1} := argmin_x (f(x) + \frac{\rho}{2}{||Ax+Bz^k-c+u^k||_2}^2)$

$\displaystyle z^{k+1} := argmin_z (g(z) + \frac{\rho}{2}{||Ax^{k+1}+Bz-c+u^k||_2}^2)$

$\displaystyle u^{k+1} := u^k + Ax^{k+1} +Bz^{k+1} -c$

6. References

http://stanford.edu/~boyd/admm.html
http://www.stat.cmu.edu/~ryantibs/convexopt-F13/lectures/23-dual-meth.pdf

Convex Optimization with Submodular Functions

During our introductory lecture on submodular functions, based on Francis Bach’s paper, we discussed the importance of combinatorial optimization for machine learning applications, reviewed set notation, and defined the key concepts from chapters 2-3 of Bach’s paper.

Importance of Combinatorial Optimization for Machine Learning

Feature selection, factoring distributions, ranking, clustering, and graph cuts were given as examples of machine learning problems that are combinatorial by nature and require either maximization or minimization. When discussing the importance of submodular functions and combinatorial optimization, we referred to the feature selection example from this slide deck produced by professors Andreas Krause and Carlos Guestrin.

In the feature selection example, we desire to choose the k best features that will add the most certainty in predicting the dependent variable (in this example, Y = “Sick”). As Krause and Guestrin point out, this problem is inherently combinatorial and therefore requires combinatorial optimization techniques. The goal of choosing an optimal feature set, |A*| ≤ k, is outlined in the first two slides. The problem is NP-hard because there are (2^p) – 1 constraints so we use a greedy algorithm to solve it. This essentially means that we take the best individual element that minimizes the marginal uncertainty of adding it to the current optimal set, A*. We repeat this process for up to k iterations until our optimal set is finalized. Although this certainly isn’t going to give us the true optimal set every time because some combinations of features may be more optimal when paired together, Krause and Guestrin inform us that it does give us a guarantee of at least 1 – 1/e (~63%) sub-optimal performance.

Background

Before covering the Bach paper material, we reviewed set function notations for those people who hadn’t taken a discrete math or combinatorics class. The slide below outlines the notations covered during our class lecture.

The three notations at the bottom of the slide seemed to be the ones that students were least familiar with. Cardinality (absolute value signs around a set) is simply the number of elements contained within a given set. The indicator vector, 1_A, for the set A is simply a vector with a 1 in a particular dimension if the element corresponding to that vector dimension is contained in the set A and is a 0 if it isn’t. Lastly, the notation V \ A is simply set V with subset A removed.

Submodular Functions and the Lovász Extension

When we introduced submodular function notation, we used both Definition 2.1 and Proposition 2.2 from the book. When first learning about submodularity, we found that the proposition 2.2, which introduces the property of diminishing returns, was more intuitive and made understanding definition 2.1 easier.

To further ring home the concept of submodular functions and the property of diminishing returns, we reviewed the set cover example from the Krause and Guestrin slides below. The goal of this visual aid was to demonstrate how adding a new element to a set A, a subset of set B, can only cover the same or more surface area than adding the same new element to set B.

During the next part of the lecture, we reviewed associate polyhedra which was by far the most confusing material to cover for the class. The differences between submodular and base polyhedra expressed in Bach’s paper is clear to understand. The base polyhedra, B(F), can be thought of as the hollow outer shell of the set function formed by the intersecting constraint hyperplanes and the submodular polyhedra, P(F), can be thought of as the base polyhdra and the discrete, encapsulated, non-empty interior space encompassed by the base polyhedra. What we found confusing was whether Figure 2.1 below was representing just the domain of the set function, F, or if it was also representing the function values. We agreed that we thought the figure was only representing the domain, because it would need an additional dimension to represent the set function values for the elements s1, s2, and s3.

Finally, we had the opportunity to introduce the Lovász extension (also referred to as the Choquet integral) which allows us to extend the vertices of a {0,1}^p hypercube to a continuous [0, 1]^p hypercube (think of taking the convex hull of the vertices). By dividing the continuous hypercube into p! simplices, we can create areas with ordered variable values like in the figures from Bach’s paper below.

One of the last things we were able to cover in class was the properties of the Lovász extension. The Lovász extension is important mainly because it allows us to create a continuous extension of discrete problems. Additionally, it has a unique connection with submodularity and convexity. A set function is submodular if and only if its Lovász extension is convex. We ended the class by proving this theorem.

Theorem: A set-function, F, is submodular if and only if its Lovász extension, f, is convex. 

Proof: Let's assume f is convex. Let A, B ⊆ V. The vector,

has components equal to 0 where [ V \ (A ∪ B) ]
                        2 where [ A ∩ B ] 
                    and 1 where [ A Δ B = (A \ B) ∪ (B \ A) ].

Using the property,

we get

Since f is convex, then

Therefore, F(A) + F(B) = F(A ∪ B) + F(A ∩ B). This satisfies the definition of convexity, therefore if F is submodular, then it is sufficient that for all ω ϵ ℝ^p, f(ω) is a maximum of linear functions, thus it is convex on ℝ^p.