Photo by

one sample:

\[x_i \to [y_i^0,\cdots,y_{i}^{k}]\]

where \(y_i^0\) are true labeled words , and \(y_i^1,\cdots,y_i^{k}\) are noise samples word index, which is generated by unigram distribution \(q(w)\) of the dataset.

  1. the probability of true data:
\[p(y_i^0=1|x_i,\theta)=\frac{\exp(y_i^0,h_\theta)}{\exp(y_i^0 h_\theta) + k*q(y_i^0)}\]
  1. the noise sample probability:
\[p(y_i^t=0|x_i,\theta)=\frac{k*q(y_i^t)}{\exp(y_i^t h_\theta) + k*q(y_i^t)},t=1,\cdots,k\]
  1. the cost function of this sample:
\[\ell_{nce}=\log p(y_i^0|x_i,\theta)+\sum_{t=1}^k{\log p(y_i^t|x_i,\theta)}\]
  1. the overall cost function of the dataset:
\[\ell_{nce}=\frac{1}{N}\sum_i^N{\left\{\log p(y_i^0|x_i,\theta)+\sum_{t=1}^k{\log p(y_i^t|x_i,\theta)}\right\}}\]


  • Noise-Contrastive Estimation of Unnormalized Statistical Models with Applications to Natural Image Statistics
  • Word2vec Parameter Learning Explained
  • Efficient Estimation of Word Representation in Vector Space
  • Distributed Representations of Words and Phrases and their Compositionality
  • Notes on Noise Contrastive Estimation and Negative Sampling