(Nesterov) Accelerated Gradient Descent

We define standard gradient descent as:
w^{k+1} = w^k - \alpha \Delta f(w^k)

Momentum adds a relatively subtle change to gradient descent by replace f with z. Where
z^{k+1} = \beta z^k + \Delta f(w^k)
w^{k+1} = w^k - \alpha z^{k+1}

As such, alpha retains the role of representing the step size. However, we introduce a new constant, beta. Beta adds a decaying effect on momentum, and must be between 0 and 1, noninclusive. We briefly demonstrate the edge cases of beta below:
\beta = 0:    w^{k+1} = w^k + \alpha \big ( 0 \times z^k + \Delta f(w^k) \big )    \rightarrow    w^{k+1} = w^k +\alpha \Delta f(w^k)
\beta = 1: w^{k+1} = w^k - \alpha \big (    \Delta f(w^k) + \Delta f(w^{k-1}) + \Delta f(w^{k-2}) + \dots    \big )

We now introduce a change of basis which will be convenient for later. Consider a matrix form of gradient descent:
w^{k+1} = w^k - \alpha (Aw^k - b) \text{ where } \Delta f(w) = Aw^k-b 

Which has optimal solution
w^{\star} = A^{-1}b 
we consider the eigenvectors of A which is a natural space of viewing A where all dimensions act independently.

Given:
$latex A = Q \text{ diag }(\lambda_1 \dots \lambda_n)Q^T $
x^k = Q^T (w^k - w^{\star})    \leftarrow \text{new basis}

We may derive this by:
w^{k+1} = w^k - \alpha (Aw^k - b) 
w^{k+1} -w^{\star} = w^k - w^{\star} - \alpha (Aw^k - b) 
Q^T\big(w^{k+1} -w^{\star}\big) = Q^T\big(w^k - w^{\star}\big) - \alpha Q^T \big(Aw^k - b\big ) 
x^{k+1} = x^k - \alpha Q^T \big(Aw^k - b\big ) 

w^{\star} = A^{-1}b    \implies    Aw^{\star} = b
x^{k+1} = x^k - \alpha Q^T \big(Aw^k - Aw^\star \big ) 
x^{k+1} = x^k - \alpha Q^T A \big (w^k - w^\star \big ) 
x^{k+1} = x^k - \alpha Q^T Q \text{diag}\big(\lambda\big)Q^T \big (w^k - w^\star \big ) 
x^{k+1} = x^k - \alpha \text{diag}\big(\lambda\big)Q^T \big (w^k - w^\star \big ) 
x^{k+1} = x^k - \alpha \text{diag}\big(\lambda\big)Q^T \big (w^k - w^\star \big ) 
x^{k+1} = x^k - \alpha \text{diag}\big(\lambda\big) x^k 
x_i^{k+1} = x_i^k - \alpha \lambda_i x_i^k    = (1- \alpha \lambda_i )x_i^k
$latex = \sum_i^n x_i^0(1-\alpha \lambda_i)^k q_i $

Thus, we now can view gradient descent in basis of the eigenvectors of A. We consider the contributions of error as
$latex f(w^k)-f(w^\star) = \sum (1 – \alpha \lambda_i)^{2k} \lambda_i [x_i^0]^2 $
For convergence, we note that
$latex 1 – \alpha \lambda_i $
must be strictly less than 1.

 

Note: To be continued…