Photo by

### one sample:

$x_i \to [y_i^0,\cdots,y_{i}^{k}]$

where $$y_i^0$$ are true labeled words , and $$y_i^1,\cdots,y_i^{k}$$ are noise samples word index, which is generated by unigram distribution $$q(w)$$ of the dataset.

1. the probability of true data:
$p(y_i^0=1|x_i,\theta)=\frac{\exp(y_i^0,h_\theta)}{\exp(y_i^0 h_\theta) + k*q(y_i^0)}$
1. the noise sample probability:
$p(y_i^t=0|x_i,\theta)=\frac{k*q(y_i^t)}{\exp(y_i^t h_\theta) + k*q(y_i^t)},t=1,\cdots,k$
1. the cost function of this sample:
$\ell_{nce}=\log p(y_i^0|x_i,\theta)+\sum_{t=1}^k{\log p(y_i^t|x_i,\theta)}$
1. the overall cost function of the dataset:
$\ell_{nce}=\frac{1}{N}\sum_i^N{\left\{\log p(y_i^0|x_i,\theta)+\sum_{t=1}^k{\log p(y_i^t|x_i,\theta)}\right\}}$
• Noise-Contrastive Estimation of Unnormalized Statistical Models with Applications to Natural Image Statistics
• Word2vec Parameter Learning Explained
• Efficient Estimation of Word Representation in Vector Space
• Distributed Representations of Words and Phrases and their Compositionality
• Notes on Noise Contrastive Estimation and Negative Sampling