During neural network training, it always adjust all neuron weight so that it learn how to do the prediction correctly. In NLP, we may face more than 100 k (or even 1M) words and it will cause performance (in term of time) concern. How can we reduce the training sample in a better way ?We have hierarchical softmax previously but word2vec introduces negative sampling methodology to resolve this problem.
What is that?
When we try to predict word (we call it as context), within a certain window (e.g. 5), those word are considered as positive word. So that we can either use those positive word to predict the context word (CBOW) or using context word to predict positive word (skip-garm).
However, we need some false label to do the prediction. So we need to figure out some negative case for that. So we can pick some non-surrounding words (we call it as negative word) to be a false label.
How do we select negative samples?
To reduce the number of neuron weight updating to reduce training time and having a better prediction result, negative sampling is introduced in word2vec . For…