# word2vec

Word2vec was published by scientists at Google in 2013 (Mikolov et al. 2013) with the goal to come up with high-quality word vectors from high dimensional data. Two log-linear model archetectures (CBOW and Skip-gram) were proposed.

#### Objective function of word2vec

We are trying to compute how likely a context word (o) and a center word (c) are shown together. As each word can be seen as both a context and a center word, we use 2 vectors to describe each word w:

- When the word w is a
`center`

word, the word vector of w is displayed as \(v_w\). - When the word w is a
`context`

word, the word vector of w is displayed as \(u_w\). - The dot product of the two vectors can help us evaluate how similar/close a center word \(v_{c}\) and a context word \(u_{o}\) are.

##### Skip-gram Model

We compute how likely a context word (o) will be within the defined window (length J) given a center word (c). We can summarize such probability \(P(o|c)\) as \(P(w_{t+j}|w_t)\), when \(w_t\) is the center word and \(w_{t+j}\) is a context word.

Softmax is a common method used:

\[P(o|c)= \text{softmax}(u_o^Tv_c) = \frac{\exp(u_o^Tv_c)}{\sum_{w \in V}\exp(u_w^Tv_c)}\]As we have T words from our corpus, the data likelihood of predict context words within a fixed window m given a center word \(w_t\) can be:

\[L(\theta)= \Pi_{t=1}^T\Pi_{-m \leq j \leq m, j \neq 0} P(w_{t+j}|w_t)\] \[\mathrm{\ell}(\theta)= \Sigma_{t=1}^T \Sigma_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j}|w_t)\]Meaning that the objective function is:

\[\mathrm{J}(\theta) = -\frac{1}{|V|}\Sigma_{w=1}^{|V|} \Sigma_{-m \leq j \leq m, j \neq 0} \frac{\exp(u_o^Tv_c)}{\sum_{w \in V}\exp(u_w^Tv_c)}= -u_o^Tv_c + \log \Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)\] \[\frac{\partial}{\partial v_c}\mathrm{J}(\theta) = -u_o + \frac{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c) u_w}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)} = -u_o + \Sigma_{w=1}^{|V|}\frac{\exp(u_w^Tv_c)}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)}u_w = \boxed{-u_o + \Sigma_{w=1}^{|V|} P(w|c) u_w}\] \[\frac{\partial}{\partial u_o}\mathrm{J}(\theta) = -v_c + \frac{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c) v_c}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)} = -v_c + \Sigma_{w=1}^{|V|}\frac{\exp(u_w^Tv_c)}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)}v_c = \boxed{-v_c + \Sigma_{w=1}^{|V|} P(w|c) v_c}\]And our goal is to find the set of \(\theta\) that minimize the cost function, and we can leverage stochastic gradient descent.

##### Continuous Bag of Words (CBOW) Model

We compute how likely a center word (c) appears together given all context words (v). We can summarize such probability \(P(c|o)\) as \(P(u_c|\vec{v})\).

Similarly, with softmax:

\[P(c|o)= \text{softmax}(u_c^T\vec{v}) = \frac{\exp(u_c^T\vec{v})}{\sum_{w \in V}\exp(u_w^T\vec{v})}\]Meaning that the objective function is:

\[\mathrm{J}(\theta)= -\frac{1}{|V|}\Sigma_{w=1}^{|V|} \Sigma_{-m \leq j \leq m, j \neq 0} \log \frac{\exp(u_c^T\vec{v})}{\sum_{w \in V}\exp(u_w^T\vec{v})} = -u_c^T\vec{v} + \log \Sigma_{w=1}^{|V|}\exp(u_w^T\vec{v})\]Formal notes from CS224N can be referenced here, lecture 1, lecture 2.

I’ll continue on some notebook example tomorrow…