Working With Text Data In Machine Learning

Transforming text data into a structured format for machine learning modeling

Published in

Hamoye Blog

9 min readDec 22, 2020

Have you ever wondered if you could train a machine learning model using text as the training data, without having to perform encoding using a one-hot encoder, or the pandas-get-dummies function? Well, to answer that question, it is viable.

Working with text data in machine learning falls under the field of natural language processing (NLP), which is a field devoted to algorithms and methods for processing human languages for computers. Some examples of machine learning applications using NLP include topic modeling, language translation, and sentiment analysis.

Generally, to build a machine learning model, we need data to be in a structured format; that is, having rows and columns. This would be a major problem while working with text data because they come in an unstructured format, i.e. in form of sentences, paragraphs, and words. Hence, it is not clear what is considered an observation, or feature. Consequently, we would have to carry out a large amount of preprocessing on the raw text data, to transform it into a structure that is recognized by a machine learning algorithm. To better understand this concept, let us define some terms that would be used extensively in NLP.

i. Corpus: in NLP, this is what is considered as the body of text that is to be analyzed. A corpus can be an article from a web page, tweet, story, etc., it’s typically an extensive body of the text.

ii. Document: a document here is what we consider as a single unit of observation from our corpus. The size of a document is strictly at your discretion, and is mostly influenced by the size of the corpus. You could choose to make each sentence, paragraph, or even a single word (not advisable) your document.

Working With Spacy in NLP

Spacy is a very powerful python package that concerns itself with the tools and functionality for NLP, just like the way we use Scikit-Learn in building machine learning models. Spacy can be used to analyze almost any language that can be written in text, all you have to do is install the language module and import it. Let’s demonstrate how spacy works. Say we have a corpus ( a couple of sentences), and we want to define our documents from the corpus, we could pass the sentence into spacy. Spacy analyses the sentence and returns suitable documents based on punctuations in the corpus, and what the user considers to be a document from the corpus. Let us demonstrate this in some simple sentences:

nlp = spacy.load(‘en’)

doc = nlp(‘i would come to the house. please, leave me alone. would you come back home?’)
for sent in doc.sents:
print(sent)
Outputs
I would come to the house
Please, leave me alone
Would you come back home?

Interesting, right? Another interesting functionality that spacy provides is parts of speech tagging, which is the process of identifying if a word is a noun, adjective, or pronoun. This can be done by using the attribute “pos_”, which identifies a noun, and “tag_”; which identifies the more detailed parts of speech like a pronoun. The output of the tag_ and pos_ attributes are in an abbreviated format, e.g ADJ for an adjective. The full meaning can be seen here, or by running the spacy.explain function on the analyzed text.

The Bag of Word Model

As earlier stated, machine learning models need to ingest data in a structured format having rows and columns. For example, let’s assume we want to build a classifier that can distinguish between words in a homograph like maybe a corpus for “Bat” the animal and “Bat” the object; or “Python” the animal and “Python” the programming language. If we go ahead and collect an article on these two items, we would first pass them into spacy to get our documents as a string from the body of the text. However, at this point, all we have are just words in an unorganized form.

How do we transform these words into a structured format? One thing we know is, we would have words like “prey”, “food”, “Habitat”, appear frequently in documents referring to python the animal, and words like “memory”, “byte”, “variable”, appear in documents referring to python the programming language. One technique we can adopt is to transform text data into a matrix of the count of appearances of each word in the documents. This technique is called the bag of words model.

The name bag of words is derived from the fact that each document is viewed as a bag holding all the words, disregarding word order, context, and grammar. After applying the bag of words model to a corpus, the resulting matrix will exhibit patterns that a machine learning model can exploit. Below is the output of applying the bag of words model to a corpus of two documents describing python the animal and python the programming language.

The table shows that the word “although” appeared once in document 1 and has no appearance in document 2. There are two variants of the bag of words model, which are the CountVectorizer and the HashingVectorizer transformers.

CountVectoizer

The CountVectorizer transformer is found in scikit-learn, and scikit-learn uses the word vectorizer to refer to transformers that convert a data structure (like a dictionary) into a NumPy array. Since it is a transformer, we need to first fit the object and then call the transform method which returns a sparse matrix. A sparse matrix is a more efficient manner of storing a matrix if a matrix has mostly zero entries, it is better to just store the non-zero entries and their occurrence, their row, and column. Sparse matrices have the method toarray() that returns a full matrix, but doing so may result in memory issues. Below are some key hyperparameters of the CountVectorizer:

i. min_df: this hyperparameter only counts words that appear in a minimum number of documents.
ii. max_df: this only counts words that do not appear more than a maximum number of documents.
ii. max_features: this limits the number of generated features, based on the frequency.

After fitting a CountVectorizer object, the following method and attribute help with determining what index belongs to what word.

get_feature_names(): Returns a list of words used as features. The index of the word corresponds to the column index.
vocabulary_: A dictionary mapping a word to its corresponding feature index.

HashingVectorizer

The CountVectorizer requires that we hold the mapping of words to features in memory. In addition, document processing cannot be parallelized because each worker needs to have the same mapping of word to column index. CountVectorizer objects are said to have state, they retain information of previous interactions and usage. A trick to improve the CountVectorizer is to use a hash function to convert the words into numbers, a hash function is a function that converts an input into a deterministic value. In our context, we will use a hash function to convert a word into a number. The resulting number determines which feature-column the word is mapped to. Ideally, no two inputs result in the same hash value, but this is impossible to avoid. When different inputs generate the same hash, it is referred to as a “hash collision”.

The HashingVectorizer class is similar to the CountVectorizer, but it uses a hash function to render it stateless. The stateless nature of HashingVectorizer objects allows it to parallelize the counting process.

There are two main disadvantages of HashingVectorizer

i. Hash collisions are possible but in practice are often inconsequential.
ii. Because the transformer is stateless, there is no mapping between word to feature index.

Term Frequency Inverse Document Frequency

The CountVectorizer and HashingVectorizer creates a feature matrix of raw counts. Using raw counts has two problems, documents vary widely in length, and the counts will be large for common words such as “the” and “is”. Therefore, we need to use a weighting scheme that considers the aforementioned attributes. The term frequency-inverse document frequency, tf-idf for short, is a popular weighting scheme that is used to improve the simple count based data from the bag of words model. It is the product of two values, the term frequency and the inverse document frequency. There are several variants, but the most popular is defined below:

tf(t,d)=counts(t,d)∑t∈dcounts(t,d)2−−−−−−−−−−−−−−√,

With the idf weighting, words that are very common throughout the documents get weighted down, while the counts of rare words get weighted up. With the tf-idf weighting scheme, a machine learning model will have an easier time to learn patterns to properly predict labels.

There are two ways to apply the tf-idf weighting in scikit-learn, differing in what input they work on. TfidfVectorizer works on an array of documents (e.g., list of sentences) while the TfidfTransformer works on a count matrix, like the outputs of HashingVectorizer and CountVectorizer. TfidfVectorizer encapsulates the CountVectorizer and TfidfTransformer into one class.

When we use the TfidfTransformer, We no longer have raw counts but re-weighted counts in our feature matrix. We can use the “idf_”attribute of the fitted tf-idf transformer to inspect the top idf weights and their corresponding terms.

Improving The Signal

Apart from using the Tfidf counts instead of the raw counts to try and improve the signal of our model, there are several other approaches that can be employed to help boost signal strength, some of this approach are;

Lemmatization: lemmatization is the process of reducing a word to its lemma, or the dictionary form of the word. Spacy does not have a stemming algorithm, but does offer lemmatization. Each word analyzed by spacy has the attribute “lemma_”, which returns the lemma of the word. Words like (give, given, gave), when they appear in a document would have generated individual counts, but applying a lemmatization function in the bag of words model would count all those words as “give”, which is the lemma of the three words. To use lemmatization, you have to pass a defined lemmatizer function to the keyword tokenizer inside any of the text vectorizers you are using.
Stop Words: stop words are words that are to be excluded from the counting process. Words such as “the”, “a”, and “or” are likely to be common throughout a corpus, that they would not contribute any signal to the data set. Further omitting these words will reduce an already high dimensional data set. It is best to not have these words as features and not be counted in the analysis. Spacy provides a set of around 300 commonly used English words as stop words. When using stop words, it is advisable to first examine the entries in case there are certain words you want to be included or excluded. Since the words are provided as a Python set, we can use methods available to set objects, and add or remove words from the stop words object.
Tokenization and ngram_range: Tokenization refers to dividing up a document into pieces to be counted. Sometimes, it may be useful to count a sequence of words such as “higher dimension” and “virtual reality” as one, instead of as single words. Counting these bigrams sometimes may boost performance. More generally, an n-gram refers to the n sequence of words. In scikit-learn, n-grams can be included by setting the hyperparameter ngram_range = (min_x, max_x) for the vectorizer, where min_x and max_x are the lower and upper bound of the range of n-grams to include. For example, ngram_range = (1,2) will include monograms and bigrams while ngram_range = (2,2) will only count bigrams.

We have come a long way from collecting plane texts to having a structured data that is ingestible by a machine learning model. With all these transformations in place, we can go ahead and use the numerical counts of words to train a word usage classifier. This in turn takes a body of text, and predicts what those texts are referring to; provided you have labels for the observation.

In the Hamoye data science open source project stage G, my group and I worked on a word usage classifier on movies, which could tell if a movie is an animation movie or not. This is simply achieved by passing the movie’s crew job title as the argument. We performed similar transformations on the dataset to arrive at the classifier.

Conclusion

Although we can work with text data in NLP without using Spacy, Spacy is a very powerful tool that helps simplify the workload especially when trying to define our documents from a corpus.

I hope this little piece of mine has been educative, and I also suggest you use some other resources to learn more about Natural Language Processing. Until next time, keep learning!