Natural Language Processing (NLP) Project Example for Beginners
When the area of interest becomes machine learning, it takes steps for various studies. And these investigations can take a really long time. But if you continue without giving up, after a while, there will be steps you take without noticing and details you think about.
As someone who has taken some of these steps, I wanted to share a simple but instructive machine learning project that I have already completed, especially for beginners.
This sample app is the competition posted on Kaggle. To explain briefly: it posts various tweets on Twitter. These are about serious issues, jokes, criticism, insults, disasters, and many more.
The data set Kaggle offers includes many tweets. There are 3243 unique data in the data set prepared for the test, and there are 7503 unique data in the train data set. There are id, keyword, location, text (tweeted) columns in train and test data sets. We will make various developments to improve forecasting on the train data set. Therefore, there is a target column that will benefit us as a plus. There is no target column in the test data set yet, we will create this column and we will see how well we can guess.
There is target column 1 and 0 data in the train data set. So we have a target column with a categorical variable with discrete values. If Target = 1, the tweet in the same column is a disaster tweet. If Target = 0, the tweet in the same column is not a disaster tweet.
That the problem is an NLP problem and the data we need to predict is only 1 or 0 means that we can use a classification algorithm as a solution. More information is in Step 8.
The competition is here: Real or Not? NLP with Disaster Tweets
Step 1: Concatenating data sets to make them available
The train and test set separately, but it is more accurate and efficient to operate on a single data set during training. So, as a first step, we must combine the data sets. In doing so, the target column has been dropped but will be added back later.
Step 2: Visualizing Data
Expressing the data visually provides a better understanding of the subject being dealt with.
There are a lot of modules and methods available for data visualization in Python. I chose a simple method, word cloud. Significant words can be highlighted using the word cloud. In this method, the size of the word is larger or smaller depending on the frequency it is used.
Step 3: Clean The Text Body
In transactions with texts, it is necessary to highlight the words that can touch specific issues. Removing unnecessary, nonsense, or meaningless words that won’t help guess makes it even easier for us to classify tweets.
The most important of these are stop words. Stop words are a set of commonly used words in a language and removed from search engine queries because of their ineffectiveness. There is an NLTK module in Python that lists stop words.
There are too many meaningful or meaningless characters that non-alphanumeric as an expected situation on the Twitter platform where people write their ideas instantly. Along with these characters, mail addresses, shared links, etc. structures are also available. Getting rid of these structures will increase our efficiency and performance, right? So what should we do? We write the equivalent of regular expression and use it in the replace function to find and delete these unnecessary or meaningless structures.
Step 4: Dropped the unhelpful columns
Location and id columns in data sets are not helpful for our predictions. Therefore, dropped the location and the id columns.
Keyword, a column where tweets are tagged according to the theme they contain. So, it will be more efficient to encode it so it can go through the classification process instead of deleting it. Let’s look at the number of unique data in the keyword column at first and choose an encoding solution accordingly;
We get the result (221, ). In other words, there are 221 unique data in the keyword column.
Print the tags/categories in the keyword column;
Let’s review what we can do:
- Label encoder can be a suitable solution for the problem, but label encoder usually works well on tree-based algorithms and we will not use a tree-based algorithm.
- One-hot encoding takes up a lot of memory as it means adding columns for each data
- For now, the best solution looks like Binary Encoding. Binary encoding works really well when there are lots of categories. It is a more efficient method of using memory because it uses fewer features than one-hot encoding.
Step 5: Analyzing Word and Document Frequency
There are statistical methods for measuring text groups called documents in Data Mining and NLP problems. The name of this method for calculating the weight factor is TF-IDF (Term Frequency — Inverse Document Frequency).
TF (Term Frequency): Measures how frequently a term occurs in a document. This calculation is actually a normalization process. Because each document is different in length and a term can appear more often in a long document and a term can appear more frequently in a long document. If we look at all terms in all documents, normalization becomes essential to calculate term frequency. For this reason, the number of times a term appears in a document divided by the total number of terms in the document.
TF = (Number of times term appears in a document) / (Total number of terms in the document)
IDF (Inverse Document Frequency): In short, measures the importance of the term. The words in almost all documents and most of them serve as conjunctions are not important according to this calculation. Rarely used or unique words are important and have a high IDF value.
IDF = log_e(Total number of documents / Number of documents with term in it)
Let’s think of an example word and name it ‘cat’. The higher the number of documents containing the term ‘cat’, the lower the TF-IDF value. Besides, the more the cat term is in its own document, the higher the TF-IDF value.
In Python, Count Vectorizer is used for TF-IDF operations;
Step 6: Imbalanced Data Set Control
Predictive models on data sets with imbalanced classes need to develop. Imbalanced data sets adversely affect the results of classification based predictive models. While the model predicts the larger class well, it will make a bad prediction on the data class that makes up the minority, it ignores the minority class.
If we look at our model, it does not have an extremely imbalanced class. But still, there is an imbalance that affects the prediction model;
The most common way to solve this problem is the method called re-sampling. It is a fast and simple technique. The re-sampling technique is divided into two; bringing the data set to a more balanced level by adding samples to the minority class or removing samples from the majority class.
- Oversampling: This method adds a copy of the samples to the class seen as a minority, resulting in a balanced data set.
- Undersampling: This method provides a balanced data set by deleting the majority of the samples using various methods.
We can get a more balanced data set by over-sampling the classification model. Python’s package called imbalanced-learn (imblearn) can be used for over-sampling. Used the SMOTE method in imblearn package.
SMOTE (Synthetic Minority Over-sampling Technique): Copies nearby samples by the neighborhood of samples that already exist in the minority class. SMOTE finds the K-nearest neighbors of each sample in the minority class. Then, new data points are added on the lines connecting the original sample and detected neighboring samples.
Step 7: Splitting Data
Passed the data through various processes such as deletion, encoding. This provides a more efficient environment for classification. This is necessary for good predictive performance.
At the very beginning of all these operations, we combined the train and test data sets and created the all_data data frame. Now it’s time to get the train and test data sets again.
Step 8: Model Selection
If it is necessary to start with a very basic question, what is the purpose of the learning model? We try to find the best pattern for the data we have with various mathematical learning algorithms. With our train set, it helps us find the pattern we need to fit. When we find the closest pattern, the learning model, we get a successful result in our test set.
At the beginning of the article, we stated that it would be appropriate to use a classification algorithm, which is a supervised learning approach, to solve this problem.
When choosing a model, it is sometimes good to experiment with a few models. Thus, it can be observed which model gives better results. I’ve tried the performance of my model with Naive Bayes, Random Forest, and SVM. As a result, I got the best forecast data with SVM (Support Vector Machine). I suggest you do these trials too. Besides achieving the best result, we also gain experience. Machine Learning area becomes more understandable with experience as in many other fields. As you experiment, you can have more different and practical ideas in your mind.
SVM (Support Vector Machine): Represents a data set that cannot be separated linearly. A curve can separate the two categories but not by a line. It represents a linearly inseparable dataset like most real-world datasets. We can export this data to a higher-dimensional space, for example, map it to a three-dimensional space. After transformation, a bridge can define the boundary between the two categories. Since we are now in three-dimensional space, we represent it as a separating plane.
This plane can be used to classify new or unknown target data. Therefore, the SVM algorithm generates an optimal hyperplane that categorizes new samples.
This video will be helpful for better visual understanding:
Step 9: Calculating The Accuracy of The Model
After training our model, we made target predictions in the test set. So what percentage of the predictions are correct? After we calculate the accuracy, if we obtain an undesirably low result, it is returned to processing the data and reviewed. The data is tried to be optimized for the prediction model or we change the model. In Python, accuracy score is calculated from the sklearn library with the accuracy_score metric. Accuracy score takes a value between 1 and 0. Accuracy score reaches the best value at 1, and the worst value at 0. We will express the accuracy score as a percentage.
It can be used to the confusion matrix as one of the ways to obtain the accuracy of classifiers. This matrix shows corrected and incorrect estimates compared to real tags. Each confusion matrix row shows the correct labels in the test set, and the columns show the estimated labels by a classifier. This method is called F1-Score. We can visually see more clearly how many instances of which class were predicted correctly and how many instances wrong.
If we need to rearrange our data set or choose another model, we will understand at this step. So if we cannot get the result we want at this step, it is likely that we will repeat all the previous steps.
The forecast results came out 99.15%. This is a good result.
Interpretation of the Table: According to the F1 score table, 10 of the samples with a target value of 1 are estimated as 0, 991 as 1. In other words, wrong guesses were made over 10 tweets. 7 of the samples with a target value of 0 were estimated as 1, and 992 as 0. In other words, wrong guesses were made over 7 tweets.
Here are the summary notes we will draw from the conclusion we achieved success:
- When you encounter a problem that you need to solve, first review the data and try to understand the general outline of the problem. In this way, you will understand which models you can work with.
- Don’t be afraid to visualize the data you have, it will allow you to better understand this data.
- If you are working with texts (like this example we are working on). Think about the words that will help you during the guessing phase and try to highlight those words or groups of words that have a say on the results you will get. We can call it cleaning the text for short.
- If the classification is possible on the data, check whether the data set is balanced or unbalanced. It can become a problem that greatly affects the results.
- Do not hesitate to experiment while preparing data for the model or choosing a model. If a step you add is affecting the results badly, remove it, then consider another option. Machine learning is an area where you can move forward if you’re really patient.
Thank you for reading!!