# Adam (adaptive moment estimation) optimizer

Reviewing week 3 assignment of NLP with deep learning brought Adam optimization algorithm back to my attention. Here I’d summarize what I did in my homework 3 for my future reference:

#### Adam optimizer

From standard SGD, we would use a mini-batch (e.g. single sample) of data in the update rule below for updating the \(J(\theta)\):

\[\theta := \theta - \alpha \nabla_\theta J(\theta)\]where \(\alpha\) is the learning rate and \(\nabla_\theta J(\theta)\) represent the partial derivatives of the cost function wrt \(\theta\).

Adam optimization, in addition, takes 2 additional steps beyond SGD:

##### Update biased first order moment estimate

\[\begin{align*} m & := \beta_1 m + (1 − \beta_1)\nabla_\theta J(\theta)\\ \theta & := \theta - \alpha m \end{align*}\]As \(m\) is set as a weighted average of the rolling gradient average of the **previous** iterations and the gradient of the **current** iteration, we can expect the **momentum** step making the gradient descent update smoother than that of the SGD (which only consider the **current** iteration). The current gradient (\(\nabla_{\theta} J(\theta)\)) will be weighted larger than the individual gradients after k (\(\frac{\beta_1}{1-\beta_1}\)) iterations, as the individual previous gradients has a weight of \(\frac{\beta_1}{k}\) (where k = num of past iterations).

##### Update biased second order raw moment estimate

\[\begin{align*} v & := \beta_2 v + (1 − \beta_2)(\nabla_\theta J(\theta) \odot \nabla_\theta J(\theta))\\ \theta & := \theta - \alpha \frac{m}{\sqrt{v}} \end{align*}\]As \(m\) is divided by the \(\sqrt{v}\) (the gradient of Adam), the gradients that have smaller gradient magnitude (\(v\)) will get larger updates. As \(v\) is derived from squared of \(\nabla_{\theta} J(\theta)\), \(v\) is usually associated with smaller gradient. This might help further smoothing out the gradient descent from the **momentum** step, by giving larger weights to the smaller gradients when the SGD update does not guarantee continuous descending in the gradients.

##### Note: Method of moment

Method of moment in statistics implies the following:

The \(k^{th}\) moment of a random variable X with its pdf, \(f(x)\) can be expressed as:

\[E(X^k) = \int_X x^k f(x) dx\]Therefore, the first moment of X is \(E(X)\), which is the mean of the distribution; and the second moment of X is \(E(X^2)\), which is the sum of mean squared and the variance (\(\text{Var}(X) = E(X^2) - E(X)^2\)).

Original Adam paper is here; Helpful documentation of moment statistics.