Understanding the Continuous Bag of Words (CBOW) Model: Architecture, Working Mechanism and Math Behind It | Natural language processing

3 min readFeb 1, 2023

The Continuous Bag of Words (CBOW) model is a popular method for training word embeddings, which are representations of words in a numerical vector space. The goal of CBOW is to predict a target word given the context words in a sentence. This is in contrast to the skip-gram model, which predicts context words given a target word.

The architecture of the CBOW model is relatively simple, consisting of an input layer, a hidden layer, and an output layer. The input layer is used to represent the context words, the hidden layer is used to learn the word embeddings, and the output layer is used to predict the target word.

The input layer is typically represented by a one-hot encoded vector, where each element in the vector corresponds to a specific word in the vocabulary. For example, if the vocabulary contains 10,000 words, the input layer will have 10,000 elements.

The hidden layer is where the word embeddings are learned. It is a dense layer, with each neuron representing a specific word in the vocabulary. The number of neurons in the hidden layer is the same as the number of words in the vocabulary.

The output layer is also a dense layer, with each neuron representing a specific word in the vocabulary. The number of neurons in the output layer is the same as the number of words in the vocabulary.

The working mechanism of CBOW can be mathematically represented as follows:

Let x(1), x(2), …, x(n) be the context words, where x(i) is a one-hot encoded vector.
Let W(1) be the weight matrix connecting the input layer to the hidden layer, and W(2) be the weight matrix connecting the hidden layer to the output layer.
Let h be the hidden layer, which is the average of the input vectors, h = 1/n * (x(1) + x(2) + … + x(n))
Let y be the output layer, which is the probability distribution over the vocabulary, y = softmax(W(2) * h)
The target word is selected as the word with the highest probability in y.

The training process for CBOW is to minimize the difference between the predicted probability distribution y and the actual target word using a loss function such as cross-entropy loss. This process is repeated for each sentence in the training dataset, resulting in the learned embeddings.

One of the key advantages of CBOW is its efficiency. Because the model only needs to predict a single target word given a set of context words, it can train on a much larger dataset than the skip-gram model, which needs to predict multiple context words given a target word. Additionally, CBOW is generally considered to perform better on smaller datasets.

Overall, the Continuous Bag of Words (CBOW) model is a powerful tool for training word embeddings. Its simplicity and efficiency make it a popular choice for natural language processing tasks such as text classification and language translation.

Understanding the Continuous Bag of Words (CBOW) Model: Architecture, Working Mechanism and Math Behind It | Natural language processing

Written by Code Thulo