How Negative Sampling work on word2vec?

Edward Ma
4 min readMay 12, 2019
Photo by Tim Bennett on Unsplash

During neural network training, it always adjust all neuron weight so that it learn how to do the prediction correctly. In NLP, we may face more than 100 k (or even 1M) words and it will cause performance (in term of time) concern. How can we reduce the training sample in a better way ?We have hierarchical softmax previously but word2vec introduces negative sampling methodology to resolve this problem.

What is that?

When we try to predict word (we call it as context), within a certain window (e.g. 5), those word are considered as positive word. So that we can either use those positive word to predict the context word (CBOW) or using context word to predict positive word (skip-garm).

However, we need some false label to do the prediction. So we need to figure out some negative case for that. So we can pick some non-surrounding words (we call it as negative word) to be a false label.

How do we select negative samples?

“small eggs on table and in the cup” by Gaelle Marcel on Unsplash

To reduce the number of neuron weight updating to reduce training time and having a better prediction result, negative sampling is introduced in word2vec . For…

--

--

Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/