The NLP-ista — Feature Engineering with Bag of Words

6 min readDec 10, 2022

Ahoy parrot! In this series, we will be soaring into the world of natural language processing, and how we can build kick-ass use-cases around text!

GPTChat and GPT3 has made incredible waves! GPT3’s codex model and the Github Copilot has made the lives of software engineers easier with reliable code suggestions, and GPTChat has stunned millions of free users in their beta!

In this series, we will be exploring NLP together, starting with the question:

Q: How do we prepare our text data to be understood by machine learning models?

Welcome to Feature Engineering fellow parrots! We will learn how to turn our messy text data into something our machine learning models can understand and use!

We will be exploring one of the simplest Feature Engineering techniques today; Bag of Words!

Bag of words is a simple and widely used approach to representing text data as numerical vectors. It works by creating a vocabulary of all the unique words in the text data, and then representing each document (e.g. an email) as a vector of the counts of each word in the vocabulary.

For example, let’s say we have two sentences: “I love sunflower seeds” and “I hate millet seeds”. The bag of words representation of these sentences would look like this:

{'I': 1, 'love': 1, 'sunflower': 1, 'seeds': 1, 'hate': 0, 'millet': 0}, 
{'I': 1, 'love': 0, 'sunflower': 0, 'seeds': 1, 'hate': 1, 'millet': 1}

As you can see, each email is represented as a vector of the counts of each word in the vocabulary.

The first email has a count of 1 for the words “I”, “love”, “sunflower”, and “seeds”, and a count of 0 for the words “hate” and “millet”.

The second email has the opposite counts for these words.

Here is an example of how we would implement a simple word count for a sentence!

def single_sentence_bag_of_words(sentence: str) -> dict[str, int]:
    """
    Takes in one sentence, and returns a bag of words dictionary for the sentence

    a bag of words dictionary for a sentence
    contains the word (in all passed sentences) as the key, and the value as the count in each sentence.
    """
    # get bag of words for sentence
    bag_of_words: dict[str, int] = {}
    words: list[str] = sentence.split()
    for word in words:
        bag_of_words[word] = bag_of_words.get(word, 0) + 1
    return bag_of_words

if __name__ == "__main__":
    sentence_one: str = "I love sunflower seeds"
    sentence_two: str = "I hate millet seeds"

    # first_bag_of_words: {'I': 1, 'love': 1, 'sunflower': 1, 'seeds': 1}
    first_bag_of_words: dict[str, int] = single_sentence_bag_of_words(sentence_one)
    print(f"first_bag_of_words: {first_bag_of_words}")

    # second_bag_of_words: {'I': 1, 'hate': 1, 'millet': 1, 'seeds': 1}
    second_bag_of_words: dict[str, int] = single_sentence_bag_of_words(sentence_two)
    print(f"second_bag_of_words: {second_bag_of_words}")

The function creates a dictionary, splits the sentence into words, and updates the count of each word in the dictionary. It then returns the dictionary. The main part of the code uses the function to create two bag of words dictionaries for two sentences.

You sharp parrots will notice:

Hey! Each of the bag of words for each sentence, does not contain the words in the other sentence!

And yes, you are right! Lets take it a step further, and edit the function above to take in multiple sentences. We will also set words that appeared in other sentences, to appear in a sentence’s bag of words.

def multiple_sentence_bag_of_words(sentences: list[str]) -> list[dict[str, int]]:
    """
    Takes in a list of sentences, and returns a bag of words dictionary for each sentence.

    a bag of words dictionary for a sentence
    contains the word (in all passed sentences) as the key, and the value as the count in each sentence.
    """
    # keep track of words in all sentences
    all_words: set[str] = set()

    # Enrich default_bag_of_words with all words in all sentences, and set them to 0
    for sentence in sentences:
        words: list[str] = sentence.split()
        for word in words:
            all_words.add(word)

    bag_of_words: list[dict[str, int]] = []

    # get a bag of words for each sentence
    for sentence in sentences:
        current_bag_of_words: dict[str, int] = {}
        # get bag of words for sentence
        words: list[str] = sentence.split()
        for word in words:
            current_bag_of_words[word] = current_bag_of_words.get(word, 0) + 1

        # add words found in other sentences into the same dictionary, set to 0
        for word in all_words:
            current_bag_of_words[word] = current_bag_of_words.get(word, 0)
        bag_of_words.append(current_bag_of_words)
    return bag_of_words

if __name__ == "__main__":
    sentence_one: str = "I love sunflower seeds"
    sentence_two: str = "I hate millet seeds"

    """
    list_of_bag_of_words: [
        {'I': 1, 'love': 1, 'sunflower': 1, 'seeds': 1, 'hate': 0, 'millet': 0}, 
        {'I': 1, 'love': 0, 'sunflower': 0, 'seeds': 1, 'hate': 1, 'millet': 1}
    ]
    """
    list_of_bag_of_words: list[dict[str, int]] = multiple_sentence_bag_of_words(
        [sentence_one, sentence_two]
    )
    print(f"list_of_bag_of_words: {list_of_bag_of_words}")

Okay! We got the output we desired!

Of course, there are many smart parrots out there, who have open-sourced a better production-level implementation, and made it easy for us to use Bag of Words!

For example, we can use sklearn 's CountVectorizer class

from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse._csr import csr_matrix

"""
In scikit-learn, the CountVectorizer class is a commonly used implementation of bag of words. 

It takes in a list of strings (e.g. a list of emails), and returns a matrix of the counts of each word in the vocabulary.

We use the CountVectorizer to create a bag of words representation of a list of sentences:

The resulting bag of words vectors are represented as a sparse matrix, 

where each row corresponds to an sentence,

and each column corresponds to a word in the vocabulary. 

The entries in the matrix represent the counts of each word in the vocabulary for each email.
"""

if __name__ == "__main__":
    # Create a list of sentences
    sentences: list[str] = [
        "I love sunflower seeds",
        "I hate millet seeds",
    ]

    # Create a CountVectorizer object
    vectorizer: CountVectorizer = CountVectorizer()

    # Fit the vectorizer on the emails, and transform them into bag of words vectors
    # X is a compressed, sparse row matrix
    X: csr_matrix = vectorizer.fit_transform(sentences)

    # Print the feature words
    # feature names: {'love': 1, 'sunflower': 4, 'seeds': 3, 'hate': 0, 'millet': 2}
    print(f"feature names: {vectorizer.vocabulary_}")

    # Print the bag of words vectors
    print(f"matrix: {X}")
    """
    matrix:   
    1st sentence: "I love sunflower seeds"
    2nd sentence: "I hate millet seeds",
    feature names: {'love': 1, 'sunflower': 4, 'seeds': 3, 'hate': 0, 'millet': 2}
    
    matrix:      
    word           count
    
    1st sentence: "I love sunflower seeds"
    (0, 1)        1     # love
    (0, 4)        1     # sunflower
    (0, 3)        1     # seeds
    
    2nd sentence: "I hate millet seeds",
    (1, 3)        1     # seeds
    (1, 0)        1     # hate
    (1, 2)        1     # millet

    """

The CountVectorizer class takes in a list of strings (e.g. a list of sentences), and returns a matrix of the counts of each word in the vocabulary.

The resulting bag of words vectors are represented as a sparse matrix, where each row corresponds to a sentence, and each column corresponds to a word in the vocabulary. The entries in the matrix represent the counts of each word in the vocabulary for each sentence.

Hrmm.. but what are the strengths and weaknesses of this approach?

One of the main strengths of bag of words is its simplicity. It is easy to implement and understand, for parrots like us!

It can also be used with a wide range of machine learning algorithms, so you don’t have to be a rocket scientist to make it work.

It also allows us to easily compare and measure the similarity between different documents, based on the counts of the words in the vocabulary.

However, bag of words also has some weaknesses. One of the main weaknesses is that it ignores the order of the words in the text data.

This means that words that appear near each other in the original text may not be close to each other in the bag of words representation.

This can make it difficult for the model to capture the meaning and context of the text data. But don’t worry, there are ways to fix this, and we will chat about them in future articles!

Aight, and thats all parrots!

Please check out this github repository, if you want the same code samples above!

https://github.com/shipitparrotclemz/the_nlpista/tree/master/feature_engineering/bag_of_words

Thank you, and see you parrots in the next article!

The NLP-ista — Feature Engineering with Bag of Words

Written by ShipItParrot