(Nesterov) Accelerated Gradient Descent

We define standard gradient descent as:
$w^{k+1} = w^k - \alpha \Delta f(w^k)$

Momentum adds a relatively subtle change to gradient descent by replace f with z. Where
$z^{k+1} = \beta z^k + \Delta f(w^k)$
$w^{k+1} = w^k - \alpha z^{k+1}$

As such, alpha retains the role of representing the step size. However, we introduce a new constant, beta. Beta adds a decaying effect on momentum, and must be between 0 and 1, noninclusive. We briefly demonstrate the edge cases of beta below:
$\beta = 0: w^{k+1} = w^k + \alpha \big ( 0 \times z^k + \Delta f(w^k) \big ) \rightarrow w^{k+1} = w^k +\alpha \Delta f(w^k)$
$\beta = 1: w^{k+1} = w^k - \alpha \big ( \Delta f(w^k) + \Delta f(w^{k-1}) + \Delta f(w^{k-2}) + \dots \big )$

We now introduce a change of basis which will be convenient for later. Consider a matrix form of gradient descent:
$w^{k+1} = w^k - \alpha (Aw^k - b) \text{ where } \Delta f(w) = Aw^k-b$

Which has optimal solution
$w^{\star} = A^{-1}b$
we consider the eigenvectors of A which is a natural space of viewing A where all dimensions act independently.

Given:
$latex A = Q \text{ diag }(\lambda_1 \dots \lambda_n)Q^T $
$x^k = Q^T (w^k - w^{\star}) \leftarrow \text{new basis}$

We may derive this by:
$w^{k+1} = w^k - \alpha (Aw^k - b)$
$w^{k+1} -w^{\star} = w^k - w^{\star} - \alpha (Aw^k - b)$
$Q^T\big(w^{k+1} -w^{\star}\big) = Q^T\big(w^k - w^{\star}\big) - \alpha Q^T \big(Aw^k - b\big )$
$x^{k+1} = x^k - \alpha Q^T \big(Aw^k - b\big )$

$w^{\star} = A^{-1}b \implies Aw^{\star} = b$
$x^{k+1} = x^k - \alpha Q^T \big(Aw^k - Aw^\star \big )$
$x^{k+1} = x^k - \alpha Q^T A \big (w^k - w^\star \big )$
$x^{k+1} = x^k - \alpha Q^T Q \text{diag}\big(\lambda\big)Q^T \big (w^k - w^\star \big )$
$x^{k+1} = x^k - \alpha \text{diag}\big(\lambda\big)Q^T \big (w^k - w^\star \big )$
$x^{k+1} = x^k - \alpha \text{diag}\big(\lambda\big)Q^T \big (w^k - w^\star \big )$
$x^{k+1} = x^k - \alpha \text{diag}\big(\lambda\big) x^k$
$x_i^{k+1} = x_i^k - \alpha \lambda_i x_i^k = (1- \alpha \lambda_i )x_i^k$
$latex = \sum_i^n x_i^0(1-\alpha \lambda_i)^k q_i $

Thus, we now can view gradient descent in basis of the eigenvectors of A. We consider the contributions of error as
$latex f(w^k)-f(w^\star) = \sum (1 – \alpha \lambda_i)^{2k} \lambda_i [x_i^0]^2 $
For convergence, we note that
$latex 1 – \alpha \lambda_i $
must be strictly less than 1.

Note: To be continued…

Leave a Reply Cancel reply