I recently embarked on a work project to perform sentiment classification on some customer feedback. This was my first time working with text data for machine learning and I found it quite difficult to get started with such a complex area. I wanted to write a guide for anyone else new to natural language processing (NLP) in python to help others get a quick overview of the end to end process without having to delve into all the complexities right at the beginning.
This post is for anyone that would like to get up and running with some code to perform an NLP task in Python without going into all the details. My preferred learning style is to get something up and running first, and then delve into the nuts and bolts of how it all works. I hope this helps others with a similar learning style.
For this post I am going to be using a data set taken from the Analytics Vidhya website, this can be downloaded here. This is a nice simple data set to practice on and has the added advantage of being part of an ongoing competition with a leaderboard. So you can quickly get an idea of how well your chosen workflow is performing.
The dataset consists of a test and train set. The training set comprises a list of 31,962 tweets, a corresponding id and label 0 or 1 for each tweet. The particular sentiment you are asked to identify in this problem is wether or not the tweet is racist or sexist (in which case it will be labelled as 1).
So to start with I have imported the data sets, and am returning a little information about them.
import pandas as pdtrain = pd.read_csv('train.csv')
print("Training Set:"% train.columns, train.shape, len(train))
test = pd.read_csv('test_tweets.csv')
print("Test Set:"% test.columns, test.shape, len(test))
Most text data will likely need some processing in order for the chosen machine learning algorithm to perform well. In this case each text document is a tweet and therefore will contain lots of characters that will not be meaningful to any machine learning algorithm. You can see below from just viewing the first few rows of the data that the tweets contain characters such as #, @ and punctuation marks.
In order to remove these I am using the Python re library, this provides regular expression matching operations. The following function successfully cleans up most of these characters. Additionally this function makes everything lower case.
This is generally a good idea as many text classification tools rely on counting the occurrences of words. If both upper and lower case versions of the same word are found in the text then the algorithm will count them as different words even though the meaning is the same. Of course this does mean that where the capitalised versions of a word exists, that does have a different meaning. For example the company Apple vs the fruit apple. This could result in poorer performance for some data sets. This is one area of NLP where you may try different methods to see how they affect the overall performance of the model.
import redef clean_text(df, text_field):
df[text_field] = df[text_field].str.lower()
df[text_field] = df[text_field].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", elem))
return dftest_clean = clean_text(test, "tweet")
train_clean = clean_text(train, "tweet")
The output after cleaning looks like this. Not perfect but we can see how it performs later in our model.
Handling imbalanced classes
If we count the number of tweets for each label we can see that there are a significantly larger number of tweets labelled as 0. In fact only 7% are classified as sexist/racist. This is problematic as if we provide an algorithm with this data there is a high chance that it will default to predicting all labels as 0.
There are a number of methods you can use to handle this. One approach is to use either upsampling or downsampling. In the case of upsampling we use a function that repeatedly takes samples, with replacement, from the minority class until the class is the same size as the majority. With replacement means that the same sample can be used multiple times.
from sklearn.utils import resampletrain_majority = train_clean[train_clean.label==0]
train_minority = train_clean[train_clean.label==1]train_minority_upsampled = resample(train_minority,
train_upsampled = pd.concat([train_minority_upsampled, train_majority])
When downsampling we use the same function to take samples from the majority class, without replacement, until it is the same size of the minority class. Without replacement means that each sample is only sampled once.
train_majority = train_clean[train_clean.label==0]
train_minority = train_clean[train_clean.label==1]
train_majority_downsampled = resample(train_majority,
train_downsampled = pd.concat([train_majority_downsampled, train_minority])
I have included my code for this above. I tried both upsampling and downsampling and achieved a better result with upsampling so I have used that in my workflow.
Machines are not capable of reading text in the same way as humans do. In order for a machine learning algorithm to determine patterns in text it must first be converted into a numeric structure. One of the most common techniques for this is called Bag of Words, or BoW.
A BoW model splits the words in a piece of text into tokens disregarding grammar and word order. The model also counts the frequency in which a word occurs in the text, and assigns a weight proportional to this frequency. The output is a matrix of term frequencies where each row represents the text and each column a word in the vocabulary.
Sci-kit learn has a number of built in functions to perform this type of modelling. But for this walkthrough I am going to be using one of the most simple functions which is CountVectoriser. This function works quite well with the default settings so I will use those for the first iteration of my model.
As mentioned earlier in the post the BoW model which we are using to process the text has three steps. CountVectoriser accomplishes the first two, splitting the words into tokens and counting the frequency. We can use another scikit-learn function called TfidfTransformer to apply the frequency weighting.
In any text document there will be a number of words that appear very frequently such as I, we, and get. If we were to build a model without weighting these words they would overshadow less frequent words during training. By weighting these high frequency words we can assign, for example, more importance to less frequent but perhaps more useful words.
For simplicity and reproducibility I am going to use a sci-kit learn pipeline with a SGDClassifier. The below code creates a pipeline object that when used will apply each step to the data.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifierpipeline_sgd = Pipeline([
Before training the model I am splitting the training data into a training and test set.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(train_upsampled['tweet'], train_upsampled['label'],random_state = 0)
The code below fits the model to the training data and computes the F1 score.
model = pipeline_sgd.fit(X_train, y_train)y_predict = model.predict(X_test)from sklearn.metrics import f1_scoref1_score(y_test, y_predict)
You will notice above that I am using the upsampled data set. This is not ideal as the model performs vastly better on the training data than on the new test data. For example on the training data I achieved (after some further optimisation) an F1 score of 0.97 but this translated to a score of 0.75 once I made my submission. Downsampling because of the low volume of data for the minority class performed even more poorly. There are other more complex techniques for handling imbalanced classes such as weight balancing but that subject in itself warrants a whole other post.
Natural language processing is a vastly complex subject and there is so much more that I could cover in this article. My aim here is to give enough information, and code, to get up and running with your first text classification model. The final score is not too bad for a first attempt but there is much scope for improvement. There are many other models besides BoW for processing the text data for example that may work better. There is more work that could be done with the re library to further clean the data, and perhaps the removal of stop words may help to improve the model further.