Entropy-SGD: Biasing GD Into Wide Valleys

This is a friendly summary and critical review of the paper “Entropy-SGD: Biasing GD Into Wide Valleys” by Chaudhari et al. (2017) [1].

The post is designed to incrementally deepen the analysis and your understanding of the paper (i.e. you can read the first parts and you will have a birds eye view of what the paper is about).

However, by reading this post in its entirety you will, hopefully, understand the following:

  1. The core problem that motivated this paper.
  2. The historical background and research upon which this paper is based. I highly encourage you to read the papers that I cite. They formulate the foundation of understanding this paper and other methodologically similar papers.
  3. Summarize the important points of the methodology and the algorithm that is utilized.
  4. Important caveats that are not adequately stressed in the paper, especially with respect to the algorithm.
  5. Omissions that should have been included in the paper
  6. Assumptions, especially the one’s that are referenced.
  7. Theoretical properties of the paper and methods of proof.
  8. Resources for the mathematical notation, which is heavily influenced by statistical physics.

Introduction

The authors derive a different Stochastic Gradient Descent (henceforth SGD), which they label as “Entropy – SGD”. The change lies in a modification of the Loss Function of the optimization algorithm. Specifically, they render the the Loss Function “smoother”.

Motivation

Why do they make the Loss Function “smoother”? So that  we have “flat minima” (or “wide valleys” as the authors write, the two terms can be used interchangeably).

Let’s see what this cryptic notion of “flat minima” is all about. Watch carefully the following Figures 1 and 2.

Figure 1

 

Figure 2

 

When we are in a region characterized as “flat minima”, the Loss Function changes more slowly. It is hypothesized that flat minima are “generalized” better. Thus, the motivation is to “bias” the objective function so that we end up with more “flat minima” (or “wide valleys”). This where the title of the paper partially comes from.

Now what do they mean by “generalized”? They refer to the “generalization error” (also known as the “out of sample error”). It is also refers on the concept “stability”, which I will discuss analytically.

Modify (Bias) the SGD

The modification (or biasing) of the SGD is by utilizing the concept of the local entropy. The were motivated by Baldassi et al. (2016) [2] who use “Local entropy as a measure for sampling solutions in constraint satisfaction problems”. 

The goal is to maximize (or minimize its negative) the following Equation 1:

Equation 1

This components of Equation 1 are as follows:

  1. F(x, \gamma) is the log partition function.
  2. The “depth” of the “valley” at a location \textbf{x} \in R^N
  3. The “flatness” is measured through the (local) entropy f(x^{'})
  4. The hyper-parameter  γ looks for “valleys” of specific “width”.

Intuition of Equation 1: “Bias” the optimization towards “wide (flat) valleys”.

The γ parameter

Let’s see the what γ does in practice. Figure 3 illustrates what happens to the loss function, the local entropy, when we increase γ. It becomes more “smooth”. Note that in Figure 3, the x-axis is the parameter space, the y-axis is the negative local entropy.

Figure 3

 

The Probabilistic Foundations

This is the point where we go deeper (learning sic).

The probabilistic foundations of local entropy lie in Equation 2:

Equation 2

This is a conditional probability with the following properties:

  1. For an \textbf{x} \in R^N parameter vector
  2. Consider a Gibbs-Sampling Distribution where:
    • f(x): is the energy landscape
    • β is the inverse “temperature” (see “Langevin dynamics”)
    • Z is a regularization term.

The reason we have these cryptic terms (“temperature”?) is because this is a mathematical modeling that is used in in physics, specifically the mathematical modeling of molecular systems. Not so cryptic.

Intuition: When \beta  \rightarrow  \infty , the probability distribution concentrates to the global minimum of \textbf{x}.

However, this is the gist: the authors want “flat regions”. Subsequently, they need a Modified Gibbs-Sampling Distribution, specifically the Equation 2 is modified into Equation 3:

Equation 3

Equation 3 is a conditional probability with the following terms:

  1. γ which “biases” toward  x .
  2. Large γ   \rightarrow   all the “mass” concentrates near x regardless of f(x^{'})  energy term (which measures “flatness” as discussed
  3. Small γ   \rightarrow f(x^{'}) dominates
  4. The authors set β=1, therefore β is a nuisance term.
  5. Instead, they focus on tuning γ

Revisiting the γ parameter: “scoping”

What Equation 3 tells us is that γ determines the “scope”. Specifically, in a neighborhood (region) around “x” (parameters):

  1. Large γ   \rightarrow   the proposed Stochastic Gradient Lavengin Dynamics (henceforth SGLD) updates in a smaller “neighborhood”.
  2. Smaller γ   \rightarrow     allow SGLD to explore in a  wider neighborhood (from x)

Why “Local” entropy?

Caveat: according to the authors local entropy is not equivalent to smoothing.

Local entropy means more weight on wide local minima even if they are shallower than the global minimum. As the authors stress “minimization of classical entropy would give us an optimal of x_{candidate} “.

ENTROPY-GUIDED SGD (SGLD)

By now you should know that SGLD and Entropy-Guided SGD are interchangeable. The authors use both terms to describe their method.

The ENTROPY-SGD is essentially an “SGD-inside-SGD” with an:

  1. Outer SGD updates the parameters
  2. Inner SGD estimates the gradient of local entropy

Furthermore the classical (vanilla) optimization problem is like this:

whereas an Entropy-SGD optimization:

 

Where Ξ is the data.

The  Gradient of Local Entropy

For \xi_{l_{i}} \in \Xi^{l} : randomly sampled mini batch of m samples:

Where for the computation of  \left\langle \Xi^{l} \right\rangle the current weights are fixed to x.

Moreover, the Gibbs Distribution of the Original Optimization is:

To this end, they utilize the “Langevin dynamics” (SGLD): an MCMC algorithm. The algorithm is as follows. However, there is a caveat in step 7 that is not properly addressed in the paper.

  • This algorithm is for 1 iteration:
  • ε: thermal noise
  • Fix: L, ε,   η
  • Step 7: As the authors stress, γ has to be tuned (scoping). Thus in practice the algorithm is modified as in Step 7.

 

The «Scoping» of γ

The scoping is formulated by the following framework:

  •  \gamma(t) = \gamma_{0} (1+\gamma_{1})^{t}
  • For t_{th} parameter update
  • The General strategy is:

1.Set \gamma_{0} very small

2.Large value for η

3.Set \gamma_{1}: such that the magnitude of the local entropy is comparable to SGD

THEORETICAL PROPERTIES

The authors show that:

  1. Entropy-SGD results in a smoother loss function
  2. and obtains better generalization

With two assumptions:

  1. The original f(x) is β-smooth
  2. no eigenvalue of the Hessian ^2 f(x) lies in [-2γ-c,c]
    1. For small c>0

The proof is based on a Laplace Transformation.

 

However the assumption regarding the bound is not as strong as the authors suggest. Moreover, the authors cite Hardt et al. (2015) who stress:

“This (assumption) is admittedly unrealistic [..] even if this assumption is violated to an extent before convergence happens.”

The authors are worried about the “stability” of the algorithm. Subsequently, they introduce the concept of an “ε-stable algorithm”, that places an upper bound to the algorithm:

Intuition: Under different samples of the general “population” the algorithm gives the same result, within little changes. The upper bound is derived by Theorem 3 and Lemma 2:

Experiments

The experiments were done as follows:

A. Two image classification datasets

  1. viz. MNIST
  2. CIFAR-10

B. Two Text prediction datasets

  1. viz. PTB
  2. text of War and Peace

The authors compute the Hessian at a local minimum at the end of training for the following networks:

  1. small-LeNet on MNIST:
  2. small-mnistfc on MNIST:
  3. char-lstm for text generation:
  4. All-CNN-BN on CIFAR-10

When a Hessian are very close to zero we have an almost flat at local minima. The authors compute the exact Hessian at convergence. The Entropy-SGD model is then compared to the SGD/ADAM. The results are demonstrated in the following Table that reports the the Error/Perplexity(%) per Epoch. The way the following table should be interpreted is that the model that performs better has a lower number of Epochs, for the same Error/Perplexity(%) per Epoch.

The authors also conducted a comparison between SGLD and Entropy-SGD, which is oddly reported in the Appendix, in the last page of the paper. This comparison was conducted for was for LeNET (300 Epochs) and ALL-CBB-BNN after (500 Epochs). The  test error on LeNet of 0.630\pm1 \% after 300 epochs and 9.890 \pm 11 \% on All-CNN-BN after 500 epochs. ARXIV

Assessment

The authors conclude that by using Langevin Dynamics to estimate “local entropy”: “can be done efficiently even for large deep networks using mini-batch updates”. One of the mane problems in the results is that no run-time speeds are reported. It is unclear whether the procedure’s increased computational cost is offset by the faster convergence. In addition, Baldassi et al. 2016 used a method called “Belief Propagation” to also estimate local entropy which is probably not given enough credit in this paper.

Notation

The paper uses Bra–ket notation (or Dirac notation) for probabilities:

https://quantummechanics.ucsd.edu/ph130a/130_notes/node107.html

References

[1] Chaudhari, Pratik; Choromanska, Anna; Soatto, Stefano; LeCun, Yann; Baldassi, Carlo; Borgs, Christian; Chayes, Jennifer; Sagun, Levent; Zecchina, Riccardo.  Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. 2017

[2] Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmicschemes. PNAS, 113(48):E7655–E7662, 2016.

[3] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”. CoRR. 2017

[4] Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.

[6] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv:1509.01240, 2015.

 


 

 

Lagrange dual

1. Lagrange dual problem: standard form, Lagrange dual function, and dual problem

First, we consider an optimization problem in the standard form:

minimize  f_{0}(x)

subject to  f_{i}(x) \leq 0, i = 1, …, m

 h_{i}(x) \leq 0, i = 1, …, p

with variable  x \in R^{n} , domain  D = \bigcap_{i=0}^{m} dom f_{i} \cap \bigcap _{i=1}^{p} dom h_{i} is nonempty. The optimal value is  p^{*} .

The idea of lagrangian duality is to take the constraints into account by augmenting the objective function with a weighted sum of the constraint functions.

Lagrangian  L: R^{n}\times R^{m} \times R^{p}\rightarrow R as:
L(x, \lambda , \nu ) = f_{0}(x)+ \sum_{i=1}^{m}\lambda _{i}f_{i}(x)+\sum_{i=1}^{p}\nu _{i}h_{i}(x),

with  dom L = D \times R^{m}\times R^{p},  \lambda _{i} as the Lagrange multiplier associated with the  i_{th} inequality constraint  f_{i}(x) \leq 0 and  \nu_{i} as the Lagrange multiplier associated with the  i_{th} equality constraint  h_{i}(x) = 0 .

Lagrange dual function is the minimum value of the Lagrangian.

For  \lambda \in R_{m}, \nu \in R_{p}
 g(\lambda, \nu) = inf_{x \in D} L(x, \lambda, \nu) = inf_{x \in D}(f_{0}(x) + \lambda _{i}f_{i}(x)+\sum_{i=1}^{p}\nu _{i}h_{i}(x))

If  \lambda \geq 0 , then  g(\lambda, \nu) \leq p^{*} .

Proof:

If  x\hat{~} is feasible and  \lambda \geq 0, then
 f_{0}(x\hat{~}) \geq L(x\hat{~}, \lambda, \nu) \geq inf_{x\in D}L(x, \lambda , \nu) = g(\lambda , \nu)
minimizing over all feasible  x\hat{~} gives  p^{*} \geq g(\lambda , \nu)

The dual problem: what is the best lower bound that can be obtained from the Lagrange dual function?

maximize  g(\lambda , \nu)
subject to  \lambda \geq 0
Optimal value is denoted as  d^{*}

2. Weak and strong duality

If   d^{*} \leq p^{*}, this property is called weak duality. The weak duality inequality holds when  d^{*} and d^{*} are infinite while If the equality  d^{*} = p^{*} holds then we say that strong duality holds.

3. Geometric interpretation of dual function and lower bound  g(\lambda) \leq p^{*}

For a problem with one (inequality) constraint. Given λ, we minimize  ( \lambda, 1)^{T}(\mu, t) over G =\{(f _{1} (x), f _{0} (x)) | x \in D\} . This yields a supporting hyperplane with slope −λ. The intersection of this hyperplane with the u = 0 axis gives g(λ). Supporting hyperplanes corresponding to three dual feasible values of λ, i.e, optimum \lambda ^{\ast }\lambda ^{1 }, and \lambda ^{2 } . As shown in the figures, strong duality does not hold so the optimal duality gap p^{\ast }-d^{\ast } is positive.

4. Price or profit interpretation

Suppose the variable x denotes how an enterprise operates and  f_{0}(x) denotes the cost of operating at x, then  -f_{0}(x) is the profit made at the operating condition x.

Each constraint  f_{i}(x) \leq 0 represents some limit on resources (e.g., warehouse space, labor) or a regulatory limit (e.g., environmental). The operating condition that maximizes profit while respecting the limits can be found by solving the problem:

minimize f_{0}(x)

subject to  f_{i}(x) \leq 0, i = 1, …, m.

The resulting optimal profit is  - p ^{*} resulting optimal profit is  - p ^{*} .

Now imagine a second scenario in which the limits can be violated, by paying an additional cost which is linear in the amount of violation, measured by f_{i} .  Enterprise for the  i^{th} limit or constraint is  \lambda _{i}f_{i}(x) . Payments are also made to the firm for constraints that are not tight; if  f_{i}(x) \leq 0 represents a payment to the firm.

The coefficient  \lambda _{i} has the interpretation of the price for violating  f_{i}(x) \leq 0 ; its units are dollars per unit violation. For the same price the enterprise can sell any ‘unused’ portion of the  i^{th} constraint. We assume \lambda _{i} \leq 0 , i.e., the firm must pay for violations.

Reference

Boyd and Vandenberghe, Chapter 5–5.5 https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

Don’t Decay the Learning Rate, Increase the Batch Size

Background – 1 Stochastic gradient descent
A dominant optimization algorithm for deep learning. While SDG finds minima that generalize well, each parameter update only takes a small step in later training period because the learning rate would be very small.

Background – 2 Previous work
Previous works declare an optimum fluctuation scale g that can be found to maximize the test set accuracy (at constant learning rate). They introduce an optimal batch size that B << N, so that g = epsilon(N/B -1) where B means batch size and N means training set size.

Background – 3 Large batch training
Instead of decay the learning rate to make the optimization function converge, there is another approach that to increase the batch size. The advantages are that it can reduce the number of paras updates required and can also be parallelized well and reduce training time. There is a backward that when batch size increase, the test set accuracy often falls a little bit.

Reference – 1 Industry need for the training algorithm

Jonathan Hseu, Google Brain Scientist, is introducing the algorithm time complexity tolerance.

Reference – 2 Existing training approach speed on GPUComparing to the existing work on GPU, this paper can train ResNet-50 on ImageNet to get 76.1% accuracy in 30 mins, by making the full use of the characteristic of TPU.

Main content – 1 Existed problem
When one decays the learning rate, one simultaneously decays the g. However, the learning rate is not the necessary factors to decays the g and make the function converge.

Main content – 2 Solution
Do increase the batch size during training instead of decaying the learning rate. When learning rate wants to drop by alpha, it increases the batch size by alpha.

Main content – 3 Advantage
First, This approach can achieve a near-identical model performance with the same number of training epochs. Second, it significantly reduces the parameter updates in the training process. Third, it doesn’t require any fine-tuning process.

Main content – 4 Further Enhancement
After achieving the same quality of result with the learning rate decay approach, the batch size increase approach is also able to further reduce the number of para updates by further increasing the learning rate, scaling B that proportional to epsilon or increasing the momentum coefficient and scale B that proportional to 1/(1-m).

Main content – 5 Consequent
It successfully trains an Inception-ResNet-V2 on ImageNet under 2500 para updates using batches of 65536 images and gets 77% accuracy. It successfully trains a ResNet-50 on ImageNet to 76.1% accuracy in 30 mins.

Main content – 6 Contribution
First, it shows the quantitative equivalence of increasing the batch size and decaying the learning rate. Second, it provides a straightforward pathway towards large batch training.

Result analysis – 1Figure 1 shows an example that how the learning rate and batch size change in the different algorithms. For the blue line decaying learning rate, the learning rate is reduced in the training process (number of epochs increase) but the batch size is always constant. For the red line increasing batch size, the batch size is increased in the training process (number of epochs increase) but the learning rate is always constant.

Result analysis – 2Figure 2 shows that the increasing batch size approach can achieve a qualitatively equivalent result than the decreasing learning rate approach. However, it only needs around 1/3 training updates.

Result analysis – 3 & 4Figure 3 and Figure 4 show that the batch size increasing approach can be applied to many different optimization algorithms, like SGD with momentum, Newterow momentum, vanilla SGD and Adam.

Result analysis – 5Figure 5 shows the truth that after achieving the same quality equivalent of result with the learning rate decay approach, the batch size increase approach is also able to reduce the number of para updates further. Blue: original learning rate decaying method; Green: batch size increasing method; Red: batch size increasing method + further increase initial learning rate; Purple: batch size increasing method + further increase initial learning rate + further increase momentum coefficient.

Result analysis – 6Figure 6 proves the validity and reproducibility of this algorithm by conducting the experiments many times.

Result analysis – 7Figure 7 shows that the influence of momentum coefficient that increasing the momentum can reduce the update times as well as reduce the training time consuming but can let the accuracy fall a little bit.

Backpropagation is not just the chain rule

Author: Jiamin Wang and Spencer Jenkins

Motivation

Basically, machine learning problem is about function approximation. If we have datasets (x, y), is the variables and is the response, and we want to fit some function f, parameterized by some parameter vector θ, f(x), to predict y. In neural network framework, the fitted function is usually nonlinear. We can define the loss function measuring how well function f’s prediction. Finally, some local optimization algorithm can be employed to minimize the loss function. 

minimize

If we want to calculate the minimum of loss function, we need compute the derivative of loss function with respect to the parameter .

But the loss function obtained from neural network model is very large, so that we can’t generally write it down in any closed-form. There is one example provided in the blog:

f(x)=exp(exp(x) + exp(x)^2) + sin(exp(x) + exp(x)^2)

If we compute the derivative using symbolic differentiation,which the ‘flat’ version of this function.  The math expression show:

df/dx=exp(exp(x)+exp(x)^2)(exp(x) +2exp(x)^2) + cos(exp(x) +exp(x)^2)(exp(x)+2exp(x)^2)

It isn’t too hard to explicitly write down this equation. However the actual loss function is even much more complicated, and the derivative will explode.

Automatic differentiation

In a smart way, we can use the automatic differentiation(autodiff), which has two modes forward and reverse. In neural nets framework, reverse mode of autodiff is also called back propagation. Both of these two modes need to calculate f(x) and become slow when the input variable is in high dimension.

What is autodiff, autodiff is compute the derivative by defining some intermediate variables as shown below:

a=exp(x)

b=a^2

c=a+b

d=exp(c)

e=sin(c)

f=d+e

It is convenient to draw the graph to illustrate the relationship between all the intermediate variables. We can work backwards to calculate the derivative of f with respect to each variable by the chain rule.

            

In this way, we can compute the derivative of each variable and make use of the derivatives of the children of that variable.

  • The general formulation of forward propagation. Variable are the input and are the intermediate variables, is the final value. The functionsare defined as the elementary functions evaluated on the ‘parents’ Pa(i) of variable i. The formulation of forward propagation is

For i=n+1, n+2….N:

  • The general formulation of backward propagation. The derivative of function f with respect to the final value

For i = N-1, N-2,….1:

Advantages of Autodiff

Space complexity: Autodiff has fewer thing to memorize. The rules for transforming the code for a function into code for the gradient are really minimal.

Time complexity: The program for the gradient has exactly the same structure as the function, which implies that we have the same runtime.

Autodiff by the method of Lagrange multipliers

Intermediate variables are the equality constraint in the optimization problem. Blog example can be written in this constraint form:

argmax f

s.t.  a=exp(x)

b=a^2

c=a+b

d=exp(c)

e=sin(c)

f=d+e

The general formulation

  • Input variables: 
  • Intermediate variables:the term is the parents of i.
  • Output variable(): we assume the programs have a singled scalar output variable  which represents the quantity we want to maximize.

The standard way to solve the constrained optimization is to converts the constrained optimization problem into an unconstrained formula with parameters , called Lagrange multipliers.

The really interesting result in this blog post comes from applying the method of Lagrange multipliers to the constrained optimization form of our problem as shown above. By introducing additional variables (known as Lagrangian multipliers), we can turn our problem into an unconstrained optimization problem which takes the following form:

We call the above form the Lagrangian. To optimize our Lagrangian, we take its gradient and set it equal to zero. We then solve for our Lagrange multipliersand our intermediate variables .

The original blog post does a good job of deriving the results of this optimization, so we won’t go into too much detail on the process. The main results, however, are summarized below:

There are a few interesting things to observe from the results of the Lagrangian method. First, we can see that satisfying the constraints on our intermediate variables can be interpreted as the feedforward step in backpropagation. Second, we can see that the result we get for our Lagrangian multipliers is of the same form as what we get during backpropagation. We are summing all λ that directly depend on j, scaling each by the derivative that relates i and j. This is an equivalent operation to determining our δ terms in backpropagation. Lastly, we see that we can solve for our Lagrange multipliers and our intermediate variables by performing back-substitution. We cannot, however, obtain a closed-form solution for our input variables. We can only solve for  and thus obtain the gradient of our original function. To solve for the input variables, we will have to perform something like gradient ascent.

The author goes on to briefly describe how cyclic constraints could also be handled by this approach. The feedforward procedure becomes more difficult and a linear solver is required to determine the Lagrangian multipliers, but the procedure essentially stays the same.

The main benefit of thinking about backpropagation in the Lagrangian sense is the ability to apply constrained optimization algorithms to our problem. The author doesn’t go into much detail on use cases and benefits for this approach, but it isn’t hard to imagine that considering other algorithms might prove useful given certain problems. The connection between backpropagation and the method of Lagrange multipliers is more than just a curiosity.

As a final note, we found the following paper helpful in understanding the ideas presented in the blog post: A Theoretical Framework from Back-Propagation (LeCun, 1988). It presents the same idea as the blog post, organized in a very similar manner. The notation in this paper is more closely related to how backpropagation is typically formulated (the blog post deals with a slightly more general form), so the connection between backpropagation and the method of Lagrange multipliers might become more clear after reading this resource.

Training Neural Networks Without Gradients : A Scalable ADMM Approach

Training Neural Networks Without Gradients : A Scalable ADMM Approach

Introduction

 

Many research have been conducted in training Neural Networks. Most common method is using stochastic gradient method combine with backpropagation. The paper notice the weakness of this conventional method and suggest their proposed method which is using alternating direction method of multipliers (ADMM) with bregman iteration.

Why we need proposed method ?

Weakness of conventional optimization rely on stochastic gradient method in Neural Networks is that they do not scale well to large number of cores in a cluster setting. Also as the method is gradient-based, there are problems with convergence such as saturation effect or stock in local optimal point or vanishing gradients. In addition, backpropagation does not easily parallelize over layers.

To solve these problems proposed method separate the objective function at each layer of a neural network into two terms.

First term is measuring the relation between the weights and the input activations. Second term is containing the nonlinear activation function.

This structure allows the weights to be updated without the effects of vanishing gradient and stock in local optimal. Also form of the objective allows the weights of every layer to be updated independently, enabling parallelization over layers.

By separating the objective function, proposed method decomposes with the sequence of minimization sub problems which is ADMM scheme, and it can solve globally & closed form. Also, combining bregman iteration, the solution of proposed method could be stable.

In addition proposed method exhibits linear scaling when data is parallelized across cores.

Notation

Let’s define notation, before checking the model of proposed method.

Neural network consists of L layers and defined by a linear operator W

objective function is shown as above.

Model of proposed method

Now, let’s check the model.

 

we want to minimize loss function. Two constraint is corresponding to the constraint that I explained above. To make unconstrained problem, they use penalty term .

Proposed method use lagrange multiplier only related with objective term. Thus the model became as below.

Compare to classical ADMM which use lagrange multiplier to all the constraint, the proposed method only apply to objective term.

This is because as proposed scheme involves more than two coupled variable blocks and non smooth function, it lies outside the scope of known convergence results for ADMM. Thus they cannot apply to all constraints but just objective term and this part is related with bregman iteration in proposed method. The convergence of bregman iteration is well understood in the presence of linear constraints.

Before checking bregman iteration, let’s see the sub-problem which is the property of ADMM scheme in proposed method.

There are several parameters to update in neural network. Weight, input activation and output activation and lagrange multiplier are parameter to be updated.

  • Weight update (LSE)

  • Activation update (LSE)

  • Output update

Sub problems of updating weight and activation are easy. It can be solved in LSE.

Output update is challenging separable updating in ADMM scheme. However as entries in Z_{l} are decouple, the problem can be decomposes into solving a large number of one-dimensional problems, one for each entry in Z_{l}.

Bregman iteration

Now let’s check the update of largrage multiplier and bregman iteration. We already check that the reason proposed method used bregman iteration.

Bregman distance with a convex function E at the point v is 

Geometrically we can interpret as below

which is the distance between function value and tangent line.

Starting from minimize convex function E and linear constraint H and penalty coefficient lambda, bregman iteration perform as below.

Thus, we checked that how does the subgradient is updated in bregman iteration.

Let’s come back to our problem.

with respect to Z_{l} we can obtain

we can check above equation is identical to update equation of subgradient in bregman iteration. Thus, we could interpreted as largragian multiplier as subgradient in bregman iteration.

I think this is the reason the largrangian multiplier is applied to only objective term. To use bregman iteration, they need the largrangian multiplier term related with objective function to update as subgradient in bregman iteration.

Below is the algorithm for proposed method what I mentioned until now.

Implementation & Data parallel

  • Distributing the algorithm across N worker nodes. (CPU parallel computation)
  • Different nodes store activations and outputs corresponding to different subsets of training data.
  • To start good initial value for converge, they used proposed method by running several iterations without largrangian multiplier updates at first.
  • set large value in penalty coefficients.
  • Training data is binary class label
  • Loss function is hinge penalty form which is convex function.

Experiment Result

  • They compare the proposed method with gradient-based method with GPU computation which is conjugate gradients, SGD and L-BFGS.

SVHN DATA

First, they focus on the problem posed by the SVHN dataset. For this dataset, they optimized a net with two hidden layers of 100 and 50 nodes and ReLU activation functions. The data is relative small and easy to train.

Above is the result. First graph shows that time required for proposed method to reach 95% test accuracy vs number of cores. We can check the advantages of scaling of proposed method. Second graph shows test set predictive accuracy as a function of time in seconds for each method. We can not say that proposed method have outperformance to other method.

Second, they focus on Higgs dataset which is relative larger and difficult to train. They optimized a simple network with ReLU activation function and a hidden later of 300 nodes.

Above is the result. From the first graph, we can still check the scalability of the proposed method. From the second graph, the proposed method ( blue ) have outperformance in this dataset than the methods based on gradient with GPU computation.

 

Conclusion

  • They present a method for training neural networks without using gradient steps.
  • Proposed method avoid many difficulties of gradient method.
  • Performance of the proposed method scales linearly up to thousands of cores.
  • Proposed approach out-perform other methods on problems involving extremely large datasets.

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

Introduction

Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. This work introduces an additional approach: Federated Learning, which leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally computed updates via a central coordinating server.

State-of-the-art optimization algorithms are typically inherently sequential. Moreover, they usually rely on performing a large number of very fast iterations. However, the round of communication is much more time-consuming than a single iteration of the algorithm, which lead to the development of novel algorithms specialized for distributed optimization.

Problem Formulation

This model proposed in this paper are considering the problem of minimizing an objection function that has the form of a sum, which is like this:

The examples for this kind of problem structure covers:

The Setting of Federated Optimization

Since communication efficiency is of utmost importance, algorithm for federated optimization follow the following characteristics.

Massively Distributed:

Data points are stored across a large number of nodes K. In particular, the number of nodes can be much bigger than the average number of training examples stored on a given node (n/K).

Non-IID:

Data on each node may be drawn from a different distribution; that is, the data points available locally are far from being a representative sample of the overall distribution.

Unbalanced:

Different nodes may vary by orders of magnitude in the number of training examples they hold.

Federated Learning

Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices by bringing model training to the device as well.

It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.

Distributed settings and desirable algorithmic properties

We consider two distributed settings in this work. On a single machine, we compute the execution time as

where  is the number of iterations algorithm A needs to converge to some fixed ε accuracy.is the time needed for a single iteration.

On natural distributed machines, the execution time includes the communication overhead  in a single iteration; c >> .

Desirable algorithmic properties for the non-IID, unbalanced, and massively-distributed setting are:

Property (A) is valuable in any optimization setting. Properties (B) and (C) are extreme cases of the federated optimization setting (non-IID, unbalanced, and sparse), whereas (D) is an extreme case of the classic distributed optimization setting (large amounts of IID data per machine). Thus, (D) is the least important property for algorithms in the federated optimization setting.

SVRG algorithm on a single node

The Central idea is the algorithm evaluates two stochastic gradients,∇fi(w) and ∇fi(wt) to estimate the change of the gradient of the entire function between points wt and w, namely ∇f(w)− ∇f(wt).

Under a distributed setting, we modify the objective as below:

The original problem formulation is  If we define  then 

Therefore, the objective is changed to

Distributed Approximate Newton algorithm (DANE)

The main idea of DANE is to form a local subproblem, dependent only on local data, and gradient of the entire function — which can be computed in a single round of communication.

Naive Federated SVRG

This proposition proves the effectiveness of the naive federated SVRG algorithm. However, this algorithm is inherently stochastic, and is valid under identical sequence of samples. This work further improves this algorithm to get the federated SVRG.

The Federated SVRG

The notations used in this work are summarized below:

Federated SVRG improves from naive federated SVRG from four parameters:

Since the local data sizes, nk, are not identical, so setting the stepsize hk inversely proportional to nk, making sure each node makes progress of roughly the same magnitude overall.

If K is more than 1, the values of Gk are in general biased estimates of ∇f(wt).

This motivates the aggregation of updates from nodes proportional to nk, the number of data points available locally.

Because of the un-even distribution of nonzeros for a particular feature,

repeatedly sampling and overshooting the magnitude of the gradient will likely cause the iterative process to diverge quickly. So we scale stochastic gradients by diagonal matrix Sk.

If a variable appears in data on each node, we are going to take average. However, the less nodes a particular variable appear on, the more we want to trust those few nodes in informing us about the meaningful update to this variable — or alternatively, take a longer step. Hence the per-variable scaling of aggregated updates.

Experiments

They provide results on a dataset based on public Google+ posts, clustered by user — simulating each user as a independent node.

The following figure shows the frequency of  different features across nodes. In particular, over 88%
of features are present on fewer than 1000 nodes, which simulate the structure of distribution of the sparsity pattern.

Figure 1: Features vs. appearance on nodes. The x-axis is a feature index, and they-axis represents the number of nodes where a given feature is present.

Figure 2: Rounds of communication vs. objective function (left) and test prediction error (right).

Figure2 show that FSVRG, converges to optimal test classification accuracy in just 30 iterations. Since the core reason other methods fail to converge is the non-IID data distribution, the experiment test the FSVRGR algorithm with data randomly reshuffled among the same number of nodes, which illustrate the algorithm is robust to challenges present in federated optimization.

Proximal Algorithms

Introduction

Proximal algorithms are a class of algorithms that can be used to solve constrained optimization problems that may involve non-smooth penalties in the objective function. Their formulation is rather general in that a number of other optimization algorithms can be derived as special cases of proximal algorithms. At the core of  proximal algorithms exists the evaluation of the proximal operator:

\textnormal{prox}_{\gamma, f} (x)= \textnormal{argmin}_{z} \left\{f(z) + \frac{1}{2\gamma} \left\|x-z\right\|^{2}_{2}\right\}

The proximal operator returns a point in the search space that minimizes the problem defined by the Moreau envelope of the objective function f [2]:

f^{\gamma}(x) = \inf_{z} \left\{f(z) + \frac{1}{2\gamma} \left\|x-z\right\|^{2}_{2}\right\}

The Moreau envelope is a regularized version of the function f. It is continuously differentiable and defined in all of \mathbb{R}^{n}, even though f may not satisfy either condition,  and has the same set of minimizers as f [1,2].

Example 1. Let f(x) = |x|. The Moreau envelope of  f(x) is f^{\gamma}(x) and for \gamma = 1:

Moreau envelope of Abs(x)

Notice that the Moreau envelope is smooth even though f itself is not.

Standard Problem Formulation

Consider the minimization of functions of the form

\textnormal{argmin}_{x \in \mathcal{X}}\, F(x) := g(x) + h(x)\qquad\qquad (1)

where g(x) is convex and differentiable and h(x) is convex but non-smooth. Such objective functions arise when a penalty such as the lasso is added to the original objective function g(x) to induce sparsity in the solution.

Gradient descent can be used to minimize the original objective g(x) as it is smooth and convex, and we can converge to an \epsilon-suboptimal solution at a rate of O(1/\epsilon). Adding a penalty or regularization term h(x) that is non-smooth, however, may cause gradient descent to fail and converge to an incorrect minima. In such a circumstance, subgradient methods [3] are used to minimize the non-smooth function F(x). The drawback of using subgradient methods is that they converge far more slowly than gradient descent. An \epsilon-suboptimal solution can be obtained only after O(1/\epsilon^2) iterations.

When the solution to the proximal operator of h(x) is known in closed form, proximal algorithms can be used to significantly speed up the convergence. Table 1 in [2] lists the closed form solution of evaluating the proximal operator on an extensive list of non-smooth functions. Convergence to the an \epsilon-suboptimal solution can be attained after O(1/\epsilon) calls to evaluate the prox operator. Therefore, if the evaluation of the prox operator is cheap, convergence is linear as in the case of gradient descent.

Proximal Gradient Descent Method

Proximal gradient descent method can be employed to solve optimization problems of type (1). It is composed of the following two steps:

  1. Gradient step: Starting at x_k, take a step in the direction of the gradient of the differentiable part, g(x):
    x_{k}^{+} = x_{k} - \gamma \nabla g(x_k)
  2. Evaluate prox operator: Starting at x_k^{+}, evaluate the prox operator of the non-smooth part, h(x):
    x_{k+1} = \textnormal{prox}_{\gamma, h}(x_{k}^{+}) = \textnormal{prox}_{\gamma, h}(x_k - \gamma \nabla g(x_k))

In [2], the authors give a glimpse into how the above recipe can be motivated in two different ways — first as an MM-algorithm and second by deriving optimality conditions using subdifferential calculus.

Proximal Gradient Descent method will require O(1/\epsilon) calls to the prox operator in order to obtain an \epsilon-suboptimal solution.

Proximal Newton Method

In Proximal Gradient Descent Method, the gradient step is obtained by constructing a quadratic under-approximant to the differentiable function using the scaled identity, I/\gamma, as an approximation to the Hessian of g(x).

Proximal Newton Method can be obtained by using the actual Hessian, H := \nabla^{2}g(x), to build the quadratic under-approximant of g(x). Since g(x) is convex, H \succeq 0. Defining the scaled proximal map as in [4]:

\textnormal{prox}_{H}(x) = \textnormal{argmin}_{z} \frac{1}{2}\left\|x-z\right\|^{2}_{H} + h(x)

where \left\|x\right\|^{2}_{H} = x^{T}Hx, the Proximal Newton Method can be written as [4]:

  1. Newton step: Starting at x_k, take a Newton step:
    x_{k}^{+} = x_{k} - H_{k}^{-1} \nabla g(x_k)
  2. Evaluate prox operator: Starting at x_k^{+}, evaluate the scaled prox operator:
    z_{k} = \textnormal{prox}_{H_{k}}(x_{k}^{+}) = \textnormal{prox}_{H_{k}}(x_{k} - H_{k}^{-1} \nabla g(x_k))
  3. Take a step in direction (z^{k} - x^{k}): Starting at x_k, evaluate x_{k+1} as:
    x_{k+1} = x_{k} + t_{k} (z_{k} - x_{k})

It should be noted that the the proximal operator in the case of Proximal Newton Method relies on the computation of the Hessian. Hence, any line search strategy to identify an appropriate step size will prove expensive.

Proximal Gradient Descent Method can be recovered from Proximal Newton Method by replacing the Hessian with H = I/\gamma. Whereas the Proximal Gradient Descent Method converges linearly, Proximal Newton Method shows local quadratic convergence [4].

Nesterov Acceleration

Nesterov suggests that the optimal convergence rate for first order methods is O(1/\sqrt{\epsilon}) in [6]. In [5], he introduces ideas to accelerate convergence of smooth functions and shows that the optimal rate is achievable. Accelerated Proximal Gradient Descent Method is built upon the foundations of the ideas laid in [6]. The method is structured as follows [4]:

  1. Choose initial point: x_{-1} = x_{0} \in \mathbb{R}^{n}
  2. Compute an intermediate vector x_{k-1}^{+}: Starting at x_{k-2} and x_{k-1}, evaluate:
    x_{k-1}^{+} = x_{k-1} + \frac{k-2}{k+1} \left(x_{k - 1} - x_{k - 2}\right)
  3. Evaluate the prox operator after taking a gradient step using x_{k-1}^{+}: Starting at x_{k-1}^{+}, evaluate x_{k} as:
    x_{k} = \textnormal{prox}_{t_k, h}(x_{k-1}^{+} - t_{k} \nabla g(x_{k-1}^{+}))

The intepretation of

x_{k-1}^{+} = x_{k-1} + \frac{k-2}{k+1} \left(x_{k - 1} - x_{k - 2}\right)

is the same as momentum. If h(x) \equiv 0, then the prox operator is identity and accelerated gradient descent method is recovered [4].

Summary

Proximal algorithms can be used to solve constrained optimization problems that can be split into sum of convex differentiable and convex non-smooth parts. If the prox operator is cheap to evaluate, then linear convergence is recovered in the usual scenario, like in the case of gradient descent. Several other algorithms can be recast in terms of a proximal method [1,2]. Although closed form solutions to prox operator may be required, in [7] the authors study the convergence rates when the prox operator is evaluated in an inexact manner.

References

[1] Parikh, N. and Boyd, S., 2014. Proximal algorithms. Foundations and Trends® in Optimization1(3), pp.127-239. [paper]

[2] Polson, N.G., Scott, J.G. and Willard, B.T., 2015. Proximal algorithms in statistics and machine learning. Statistical Science30(4), pp.559-581. [paper]

[3] Boyd, S., Xiao, L. and Mutapcic, A., 2003. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter2004, pp.2004-2005. [monograph]

[4] Tibshirani, R., 2016. Proximal Newton Method. Slides from 10-725, Carnegie Mellon University, Fall 2016. [presentation]

[5] Nesterov, Y., 1983, February. A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady (Vol. 27, No. 2, pp. 372-376). [paper]

[6] Nesterov, Y., 2013. Introductory lectures on convex optimization: A basic course (Vol. 87). Springer Science & Business Media. [book]

[7] Schmidt, M., Roux, N.L. and Bach, F.R., 2011. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems (pp. 1458-1466). [paper]

Learning to learn by gradient descent by gradient descent

Introduction

Tasks in ML are typically defined as finding a minimizer of an objective function f(\theta) over some domain \theta \in \Theta. The minimization itself is performed using gradient descent -based methods, where the parameters are updated taking into account the  gradient information. \theta_{t+1} = \theta_{t} - \alpha_{t} \nabla f(\theta_{t}). Many optimizations have been proposed over the years which try to improve the descent-based methods by incorporating various techniques utilizing the geometry , adaptive learning rates, momentum .

The paper approaches the optimization from a novel perspective: using learned update rules for our network,  capable of generalizing to a specific class of problems. The update to the current set of parameters is predicted by a neural-network, specifically RNN-based network ( \theta_{t+1} = \theta_{t} + g_{t}).

Framework

Given an  Optimizee: f(\theta) parameterized by \theta and Optimizer f(\phi) parameterized by \phi.

The objective function for the optimizer is

L(\phi) = \mathop{\mathbb{E}}_{f}{ \lbrack \sum_{t=1}^{T} w_{t} f(\theta_{t}) \rbrack }

where \theta_{t+1} = \theta_{t} + g_{t}

and g_t , h_{t+1} = m( \nabla_{t} , h_{t}, \phi)  .

The update steps $ latex g_{t}$ and next hidden state h_{t+1} are the output of the recurrent neural network architecture parameterized by \phi.

w_{t} \in R_{\leq 0} are weights associated with the predicted values of the optimizee for the last t time steps. w_{t} =1 value is used in all the experiments. The optimizee parameters are updated with truncated back-propagation through time. The gradient flow can be seen in the computational graph below. The gradients along the dashed lines are ignored in order to avoid computing expensive second-derivative calculation.

 

Coordinatewise LSTM architecture

Applying per-parameter different LSTM would have been computationally very expensive and would have introduced more than tens/thousands of parameters to optimize on.
To avoid this issue, this paper utilizes a coordinatewise network architecture (LSTM-based architecture) where the optimizer parameters are shared over different parameters of the optimizee and separate hidden state is maintained for each optimizee parameter. The base LSTM architecture is a two-layer LSTM using standard forget-gate architecture.

 

Preprocessing

One key difficulty when passing the gradients of parameters to the optimizer network is how to handle the different scales of the gradients.  Typical deep networks work well when the input to the network is not arbitrarily-scaled.

Instead of passing the gradient \nabla directly,  (\log \nabla, sgn(\nabla) ) the logarithm of the gradients and the sign of the gradients to the RNN architecture is passed to the coordinatewise architecture.

Experiment

The authors used 2 layer LSTMs with 20 hidden units in each layer for optimizer.  Each optimizer is trained by minimizing the loss equation using truncated BPTT. Adam with the learning rate chosen by random line search is used for minimization. The authors compare the trained optimizers with standard optimizers like SGD, RMSProp, ADAM and NAG. The learning rate for these optimizers are tuned while the other parameters are set to default values in Torch7.

The authors evaluated a class of 10 dimensional synthetic quadratic functions whose minimizing function is in form:

where W is a 10X10 matrix and y is 10 dimensional vector drawn from IID Guassian distribution.  As shown in the fig below LSTM based optimizer performs much better than the standard optimizers.

The author has trained optimizer over a neural network with 1 hidden layer of 20 hidden units using sigmoid activation on MNIST training set. They have experimentally shown that the optimizer generalizes much better than standard optimizers for different architectures with more layers (2) and hidden units (40) as shown below.

However, when the architecture is changed drastically like using relu activation instead of Sigmoid, the LSTM optimizer does not scale well as shown below.

The author has used LSTM optimizer for neural art to generate multiple image styles using 1 style image and 1800 content images. They found that LSTM optimizer outperform standard optimizers for test content images at training resolution as well as twice the training resolution as shown below.

Conclusion

The authors have shown how to cast the design of optimization algorithms as learning problems. Their experiments have shown that learnt neural optimizers perform better than state of art optimizers for deep learning.

References

[1]  Learning to Learn by gradient descent by gradient descent( https://arxiv.org/abs/1606.04474)

A Bayesian Perspective On Generalization And Stochastic Gradient Descent. Part 2

Section 4. Bayes Theorem and Stochastic Gradient Descent

In section 4, the main idea is about the idea “Bayesian principles account for the generalization gap”:

  1. The test set accuracy often falls as the SGD batch size is increased (holding all other hyper-parameters constant).
  2. Since the gradient drives the SGD towards deep minima, while noise drives the SGD towards broad minima.
  3. Expect the test set performance to show a peak at an optimal batch size, which balances these competing contributions to the evidence.

To evaluate the idea that there is the “generalization gap”, they used a shallow neural network with 800 hidden units and RELU hidden activations, trained on MNIST without regularization. They use SGD with a momentum parameter of 0.9. They use a constant learning rate of 1.0 which does not depend on the batch size or decay during training. The model trained on just 1000 images, selected at random from the MNIST training set. This enables us to compare small batch to full batch training.

In figure 3, they exhibit the evolution of the test accuracy and test cross-entropy during training. Our small batches are composed of 30 images, randomly sampled from the training set. Looking first at figure 3a, small batch training takes longer to converge, but after a thousand gradient updates a clear generalization gap in model accuracy emerges between small and large training batches.

In figure 4a, they exhibit training curves for a range of batch sizes between 1 and 1000. They find that the model cannot train when the batch size B less than 10. In figure 4b they plot the mean test set accuracy after 10000 training steps. A clear peak emerges, indicating that there is indeed an optimum batch size which maximizes the test accuracy, consistent with Bayesian intuition.

Section 5. STOCHASTIC DIFFERENTIAL EQUATIONS AND THE SCALING RULES

Based on the above evaluation, they conclude that:

  1. The test accuracy peaks at an optimal batch size, if one holds the other SGD hyper-parameters constant.
  2. The authors argued that this peak arises from the tradeoff between depth and breadth in the Bayesian evidence.
  3. However it is not the batch size itself which controls this tradeoff, but the underlying scale of random fluctuations in the SGD dynamics.
  4. They now identify this SGD “noise scale”, and use it to derive three scaling rules which predict how the optimal batch size depends on the learning rate, training set size, and momentum coefficient.

To get the relationship between noise scale and other four parameters: “Bach size, Learning rate, training set size and Momentum coefficients”. Based on the following states, we could get the relationships as between:

1. Central limit theorem:

2. Model the gradient error with Gaussian random noise:

3. Stochastic differential equation:

Based on these above three rules, we get:

The noise scale falls when the batch size increases, consistent with our earlier observation of an optimal batch size Bopt while holding the other hyper-parameters fixed. Notice that one would equivalently observe an optimal learning rate if one held the batch size constant. When we vary the learning rate or the training set size, we should keep the noise scale fixed, which implies:

This scaling rule allows us to increase the learning rate with no loss in test accuracy and no increase in computational cost, simply by simultaneously increasing the batch size. We can then exploit increased parallelism across multiple GPUs, reducing model training times.

In figure 5a, the authors plot the test accuracy as a function of batch size after (10000/) training steps, for a range of learning rates. Exactly as predicted, the peak moves to the right as increases. In figure 5b, they plot the best-observed batch size as a function of learning rate, observing a clear linear trend. The error bars indicate the distance from the best-observed batch size to the next batch size sampled in our experiments.

In figure 6a they exhibit the test set accuracy as a function of batch size, for a range of training set sizes after 10000 steps. Once again, the peak shifts right as the training set size rises, although the generalization gap becomes less pronounced as the training set size increases.

In figure 7a, they plot the test set performance as a function of batch size after 10000 gradient updates, for a range of momentum coefficients. In figure 7b, the authors’ plot best-observed batch size as a function of the momentum coefficient, and fit our results to the scaling rule above; obtaining remarkably good agreement. We propose a simple heuristic for tuning the batch size, learning rate and momentum coefficient.

A Bayesian Perspective On Generalization And Stochastic Gradient Descent. Part 1

1.Background

Zhang et al. (2016) observed a very “surprising” result. They trained a deep network and they took the same input images, but randomized the labels, and found that their networks were not able to predict the test set, they still memorized the training labels. This result raise a question: why they can work well even though the models assign the label randomly?

Some useful results:

  • hold the learning rate fixed and increase the batch size, the test accuracy usually falls.
  • a linear scaling rule between batch size and learning rate
  • a square root rule on theoretical grounds

2. Empirical principle

  • “broad minima”>”sharp minima”:curvature of  “broad minima” is small
  • stochastic gradient descent (SGD) finds wider minima as the batch size is reduced.
  • the curvature of a minimum can be arbitrarily increased by changing the model parameterization

In this paper, they consider the following two questions from a Bayesian perspective

a) how can we predict if a minimum will generalize to the test set?

b) why does stochastic gradient descent find minima that generalize well?

3.Bayesian Model Comparison

We consider a classification model with a single parameter ω

The first equation is the formula of the posterior distribution; the second equation is the likelihood and we choose Gaussian as prior.

Therefore, we can write the posterior distribution as below:

Given a new input $x_t$ to predict an unknown label $y_t$, we need to compute the integrals, however, these integrals are dominated by the region near$w_0$, so the  integral can be approximately computed as above.

Probability ratio:

The second factor on the right is the prior ratio, which describes which model is most plausible. To avoid unnecessary subjectivity, we usually set this to 1. Meanwhile the first factor on the right is the evidence ratio, which controls how much the training data changes our prior beliefs. To calculate evidence P(y|x;M),we need the integral based on w. Since this integral is dominated by the region near the minimum w_0, we can estimate the evidence by Taylor expanding  and “Laplace” approximation. Finally, we get the evidence for a model with a single parameter.

Then we can generalize this to model with many parameters as blow:

“Occam factor” enforces Occam’s razor: when two models describe the data equally well, the simpler model is usually better.

Next, we need to compare model M with Null model where all the label are randomly assigned with equal probability. We can get the evidence ratio:

E(w_0) is the log evidence ratio in favor of the null model.

From these formulas, we can get:

  1.  broad minima generalize better than sharp minima, but not depend on the parameters
  2. need to rescale \lambda_i,\lambda to keep the model same
  3. hard to evaluate for deep networks for there are millions of parameters

4.Bayes Theorem and Generalization

  • Consider a simple: logistic regression with label 0/1.

Task a: the labels of both the training and test sets are randomized.

Task b: the labels are informative, matching the true label

  • replicating the observations of Zhang et al. (2016) in deep networks

Experiment results:

Figure 1 shows us that the model perfectly memorizes the random model.

Figure 2(a) shows us that the model can never be expected to generalize well, 2(b) the model is much better than the NULL model according to the Bayesian evidence and cross-entropy.

Reference:

  1. A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT
  2. related material