Natural Language Processing(NLP) Using Classification Model

Tenzin Wangdu
Geek Culture
Published in
5 min readMar 22, 2021

--

What is Natural Language Processing or NLP? To explain that first, we as humans are able to read, understand, and gather meaning from languages. However, in order for the machine to performs all those actions, it needs “NLP” for getting the computers to understand the language as we do.

NLP has many applications such as voice-to-text services for people who have a hard time hearing, text-to-voice services that we use in messages every day, chatbots that we hate communicating unless it solves our problem, and translation services.

In order for the machine to perform the magic of NLP, we need to preprocess the data. Tokenizing is a method that we can use to remove all the spaces or special characters, and many other things using a regular expression.

tokenizer = RegexpTokenizer(r'\w+')

The code above is an example of a regular expression, it will pick out the sequences of alphanumeric characters as tokens and drops everything else. The next part of the preprocessing would be Lemmatizing/Stemming. This process will shorten the word so we can combine similar forms of the same word. In simple it will return the base/dictionary form of the word.

Another pre-processing step would be to remove all the common English words such as ‘i’, ‘me’, ‘we’, and so on. We can define stop words as words that have little to no meaning at all. Below is a list of English stopwords:

To practice and showcase the new skills I learn from NLP, I had a project on predicting which subreddits the post came from using the classification model. The project was predicting which subreddit the post came from and I chose the ski and snowboard subreddit to see how the model will be predicting between two similar subreddit topic:

Predicting Subreddit using Natural Language Processing:

My project’s problem statement was:

“A ski resort wants to find out which subreddit the posts are coming from using the classification model and look into what type of words are frequently being used in each subreddit.”

The data was collected from Reddit and I scrape it using pushshift’s API from https://api.pushshift.io/reddit/search/submission. After that, I use the request library that allowed me to extract posts from different subreddit. Then, I use .json to format the data into the dictionary and turn that into a data frame for modeling.

EDA and Data Cleaning: In this data set, there were a lot of unnecessary columns that I didn’t need. So, I filter the columns down to three columns “subreddit: tells us which subreddit it came from, “selftext: text of the post, and “title: the post title. After that, I drop all the duplicate posts that cames from the API and combine the selftext and title into one column for the model to predict better. Then I converted ski and snowboard subreddit columns into a binary column.

Preprocessing: As I explained earlier, we need to preprocess the text in order to get better results in the model. I use RegExp Tokenizer: (‘[a-z]\w+’): it returns only the lowercase letter without any punctuation or a special letter. I also Lemmatize the text in order to normalize the text without any derived words. Below is a picture of how preprocessing steps will remove all the whitespaces and some of the special characters that appear in the text before the line.

Modeling: After preprocessing the data, I decided to go with Logistic Regression with CountVectorizer and Random Forest with TfidfVectorizer models to predict which subreddit the post came from.

CountVectorizer: transform a text into a vector of the count of each word that occurs in the entire text.

TfidfVectorizer: Tfidf stands for TERM FREQUENCY INVERSE DOCUMENT FREQUENCY and it transforms a text to feature vectors that aim to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus.

Logistic Regression: models the probabilities of a certain class or an event that is pass/fail, win/lose, or ski/snowboard.

Random Forest: it fits a number of decision tree classifiers on various bootstrapped sub-samples of the data-set and uses voting to improve the predictive accuracy and control over-fitting.

Training Score: How the model fits the training data

Testing Score: How the model generalizes to new data

Cross Val Score: Assessing the effectiveness of your model

Confusion Matrix for the Model

We are able to accurately predict 67 of 100 snowboard posts and 242 of 250 ski posts. The model was able to predict the ski posts better than the snowboard posts because the sensitivity was about 89% as it minimizes false-negative and precision was about 67% as it minimizes false positive.

As you can see, all the top words that predict the ski post was related to ski, and some of them quite interesting. As pas is a word for the medical professionals on a ski patrol. Switzerland is famous for backcountry riding for ski. Hackerone was one word that was weird as it was a company and its CEO was a big ski frantic.

For snowboard, burton was one word that stood as jake burton just passed away as he did so much snowboard as making it into sports that were approved by the Olympics. Japan has one of the best terrain parks and resorts that snowboarders love.

The model was able to accurately predict 88% of the posts. To improve the model, I could dive more deeply into more feature engineering and making the data more balance. Another way to go about is to remove all the words like ski, snowboard, skiing. However, it might make the accuracy drop by a lot.

Below is the link for Github for the project if you want to check it out.

https://github.com/tw1270/Web-APIs-and-Predicting-Subreddit

--

--