word2vec | Chi-Hsuan Chang

Word2vec was published by scientists at Google in 2013 (Mikolov et al. 2013) with the goal to come up with high-quality word vectors from high dimensional data. Two log-linear model archetectures (CBOW and Skip-gram) were proposed.

Objective function of word2vec

We are trying to compute how likely a context word (o) and a center word (c) are shown together. As each word can be seen as both a context and a center word, we use 2 vectors to describe each word w:

When the word w is a center word, the word vector of w is displayed as \(v_w\).
When the word w is a context word, the word vector of w is displayed as \(u_w\).
The dot product of the two vectors can help us evaluate how similar/close a center word \(v_{c}\) and a context word \(u_{o}\) are.

Skip-gram Model

We compute how likely a context word (o) will be within the defined window (length J) given a center word (c). We can summarize such probability \(P(o|c)\) as \(P(w_{t+j}|w_t)\), when \(w_t\) is the center word and \(w_{t+j}\) is a context word.

Softmax is a common method used:

\[P(o|c)= \text{softmax}(u_o^Tv_c) = \frac{\exp(u_o^Tv_c)}{\sum_{w \in V}\exp(u_w^Tv_c)}\]

As we have T words from our corpus, the data likelihood of predict context words within a fixed window m given a center word \(w_t\) can be:

\[L(\theta)= \Pi_{t=1}^T\Pi_{-m \leq j \leq m, j \neq 0} P(w_{t+j}|w_t)\] \[\mathrm{\ell}(\theta)= \Sigma_{t=1}^T \Sigma_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j}|w_t)\]

Meaning that the objective function is:

\[\mathrm{J}(\theta) = -\frac{1}{|V|}\Sigma_{w=1}^{|V|} \Sigma_{-m \leq j \leq m, j \neq 0} \frac{\exp(u_o^Tv_c)}{\sum_{w \in V}\exp(u_w^Tv_c)}= -u_o^Tv_c + \log \Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)\] \[\frac{\partial}{\partial v_c}\mathrm{J}(\theta) = -u_o + \frac{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c) u_w}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)} = -u_o + \Sigma_{w=1}^{|V|}\frac{\exp(u_w^Tv_c)}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)}u_w = \boxed{-u_o + \Sigma_{w=1}^{|V|} P(w|c) u_w}\] \[\frac{\partial}{\partial u_o}\mathrm{J}(\theta) = -v_c + \frac{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c) v_c}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)} = -v_c + \Sigma_{w=1}^{|V|}\frac{\exp(u_w^Tv_c)}{\Sigma_{w=1}^{|V|}\exp(u_w^Tv_c)}v_c = \boxed{-v_c + \Sigma_{w=1}^{|V|} P(w|c) v_c}\]

And our goal is to find the set of \(\theta\) that minimize the cost function, and we can leverage stochastic gradient descent.

Continuous Bag of Words (CBOW) Model

We compute how likely a center word (c) appears together given all context words (v). We can summarize such probability \(P(c|o)\) as \(P(u_c|\vec{v})\).

Similarly, with softmax:

\[P(c|o)= \text{softmax}(u_c^T\vec{v}) = \frac{\exp(u_c^T\vec{v})}{\sum_{w \in V}\exp(u_w^T\vec{v})}\]

Meaning that the objective function is:

\[\mathrm{J}(\theta)= -\frac{1}{|V|}\Sigma_{w=1}^{|V|} \Sigma_{-m \leq j \leq m, j \neq 0} \log \frac{\exp(u_c^T\vec{v})}{\sum_{w \in V}\exp(u_w^T\vec{v})} = -u_c^T\vec{v} + \log \Sigma_{w=1}^{|V|}\exp(u_w^T\vec{v})\]

Formal notes from CS224N can be referenced here, lecture 1, lecture 2.

I’ll continue on some notebook example tomorrow…