Photo by

### Momentum

1. $$m_k$$ is noted as momentum, also call velocity.
2. $$w$$ is the parameters,
3. $$\nabla f(w_k)$$ is gradient.

The Momentum-based gradient descent is:

\begin{aligned} m_k &= \beta m_{k-1} +\nabla f(w_k) \\
w_k &=w_{k-1} - \alpha m_k \end{aligned}

1. $$\beta=0$$, we recover the gradient descent.
2. $$\beta=0.99$$, this appears to be the boost we need.

### Adam

\begin{aligned} m_k &= \beta_1 m_{k-1} + (1-\beta_1)\nabla f(w_k) \\
v_k &= \beta_2 v_{k-1} + (1-\beta_2)(\nabla f(w_k))^2 \\
\hat m_k &= \frac{m_k}{1-\beta_1^t} \\
\hat v_k &= \frac{v_k}{1-\beta_2^t} \\
w_k &=w_{k-1} - \alpha \frac{\hat m_k}{\sqrt{\hat v_k} +\varepsilon} \end{aligned}

__global__
void adam_update_kernel(float *w, float *g, float *m, float *v,
int N, float beta1, float beta2,
float correction, float eps, const float lr) {
CUDA_KERNEL_LOOP(i, N) {
float gi = g[i];
float mi = m[i] = m[i] * beta1 + gi * (1 - beta1);
float vi = v[i] = v[i] * beta2 + gi * gi * (1 - beta2);
float ng = lr * correction * mi / (sqrt(vi) + eps);
w[i] += ng;
}
}

void adam_update(float *w, float *g, float *m, float *v, int N, float beta1,
float beta2, float correction, float eps, const float lr) {
const dim3 blockSize(CUDA_NUM_THREADS, 1, 1);
const dim3 gridSize(GET_BLOCKS(N), 1, 1);
adam_update_kernel<<<gridSize, blockSize>>>(w, g, m, v, N, beta1, beta2,
correction, eps, lr);
}

void Optimzer::Adam(float *w, float *g, float *m, float *v, int size) {
const float eps = 1e-8, beta1 = 0.9, beta2 = 0.999;
const float correction = sqrt(1. - pow(beta2, _t)) / (1. - pow(beta1, _t));
adam_update(w, g, m, v, size, beta1, beta2, correction, eps, _lr);
}


### Practical Approximations to Newton’s methods

1. Diagonal approximation
2. Limited-memory quasi-Newton (L-BFGS)
3. Barzilai-Borwein approximation
4. Hessian-free method