Natural Language Processing (NLP) is one of the very important concept for Machine Learning as it combines artificial intelligences, linguistics and computer science into one. It learns and recognizes the behavior of the input and uses it to make suggestions or recommendations. Autocorrect, spellcheck and autocomplete are a few examples of NLP we use everyday.
Today, we will use NLP in action to differentiate posts between E46 or E90s in reddit by using Logistic Regression in Python.
In each subreddit of E46 and E90, 2000 of the most recent posts were pulled via Pushshift’s API. Resulted in total of about 4000 posts or data points.
After checking the data, some posts were links. In our analysis today, links will not work. Thus, they were removed via the code below:
In addition, many posts contained HTML artifacts that got left behind during the API process. They were also removed so each post only contain necessary text and numbers.
After data cleaning, dataset had total of 2400 rows. The title and selftext were combined together to form one feature for NLP process. The label, subreddit, was mapped with a ‘0’ for E46 and ‘1’ for E90.
X = [‘comb’]
y = [‘subreddit’]
Train-test-split X and y with default at 75% -25% split:
Before it is ready for Logistic Regression, the text must be converted to a numerical value. To do so, we will use CountVectorizer transformer. It converts a collection of text into a matrix of token, or word(s), counts. (Here for more information).
We will feed Logistic Regression and CountVectorizer through Pipeline and GridSearch for computer to iterate the best parameter and accuracy.
n_gram = (1,3)
stop_words = ‘english’
C = 0.1
Accuracy and confusion matrix:
Accuracy = 86%
Total False = 85
Top words for E46 and E90s:
These are the words that has the more contribution to categorize a post as E46 or E90.
As we can see, after we trained Logistic Regression with words and its relationship to E46 and E90s, it has good accuracy at differentiating new data points between the two series of cars.
The top words were no surprise as those were some of most frequently used words in posts for users to describe their E-Series BMWs. Out of the 85 data points that were misclassified, many of it didn’t use any of the important words. They were posts that were very short and written with very general words that can applied to almost anything.
A NLP model is as good as we train it to be. Thus, this model is not fool proof and can be improved in many ways. This model is only applicable to differentiate post between E46 and E90s since it is train to do only just that. But, this model illustrate some of the important tasks NLP can help us do.
There you go. A simpler example of NLP in action.