Introduction to classify text using Bag of Words

Vinh Bui
3 min readOct 4, 2019

--

This year I have a research project in the club that predict the score of an essay based on the collected data. So I took the data that I get for free on Kaggle for research (https://www.kaggle.com/c/asap-aes).

Photo by Markus Spiske on Unsplash

Bag of Words

The method is applied bag-of-words. Bag-of-words means we calculate word frequency in the text, regardless the order of words.

In this guide, I use these packages: pandas, nltk, re, sklearn CountVectorizer, sklearn KNeighborsClassifier, sklearn train_test_split

Let’s start by reading the data by the following code:

Before doing any work, it is important to see how the data look like:

I saw that the data have been anonymous, so I think this information is not necessary. Let’s remove this information:

The result:

Stemming

An important step of text mining is stemming. Some word family, such as go — went — goes is the variant of the word “‘go”. Therefore, grouping them together as one word is necessary. There are many algorithms that do the stemming process. I just chose Porter Stemmer to get the job done:

After the final steps of cleaning the data, we have processed data. Now we can start vectorizing them by CountVectorizer

This function builds a matrix with the words and frequency of them, so we can go ahead and use it to fit in the classifier algorithm. There are many classifier that we can use, in this guide I use KNeighborsClassifier

Before we could fit in, we should split data into train and test set, and fit them in. The target label is “Score”:

Conclusion

The accuracy of the method is 32.7%. Quite low, but this is just the start. We need to implement more methods to clean the data, different algorithms, or even combinations of different algorithm. In later guide, I will try to improve the accuracy of the prediction.

--

--

Vinh Bui

Undergraduate student at UC Berkeley, major in Data Science