Photo by

Given a list of word vectors: \(x_1,\cdots,x_{t-1},x_t,x_{t+1},\cdots,x_T\)

\[\begin{array}{l} \hat y_t &=\text{softmax}(W h_t) \\ \hat p(x_{t+1}=v_j|x_t,\cdots,x_1) &=\hat{y}_{t,j} \end{array}\]

\(h_t\) is the hidden state output at time step \(t\). \(\hat{y}\in\mathbb{R}^{V}\) is a probability distribution over the vocabulary, same cross entropy loss function but predicting words instead of classes

\[J^{(t)}(\theta)=-\sum_{j=1}^{|V|}{y_{t,j}\log \hat{y}_{t,j}}\]

Evaluation could just be negative of average log probability over dataset of size (number of words) T:

\[\ell=-\frac{1}{T}\sum_{t=1}^T\sum_{j=1}^{|V|}{y_{t,j}\log \hat{y}_{t,j}}\]

Perplexity: \(\exp(\ell)\)

Lower is better !

References