Yet another ML recipe

Pratish Mashankar
13 min readJan 26, 2024

--

Hands-on Machine Learning Basics: Part 1

We are living in the big bloom of Machine Learning (ML). Almost every major conglomerate and company out there is finding all sorts of ways to integrate ‘AI’ into their service. For many aspiring developers, a question arises — where do we start? In this first of my many upcoming blogs, I am sharing with you a simple code setup to learn the very core of Machine Learning.

I touch base on a few questions below, if you have the setup ready, you can jump directly to the code.

Who should read this? If you are a new developer looking to write your first code, a seasoned professional wishing to explore this domain, a movie buff willing to understand how people reacted to a new show or just another curious mind, then this blog is for you.

What will you be doing? You will predict whether a given sentence is positive or negative using a simple ML algorithm. This class of problems is called sentiment analysis and is performed by almost all major companies (including McDonald’s!) to improve their product.

Why another beginner's blog? True that a gazillion blogs like this one, and a simple ChatGPT prompt will give you a good start on ML. However, in my experience with learning so far, the quality of knowledge profoundly improves when it comes from a personal source. The topics taught to me by my friends right before an exam have etched deeper than day-long classes. Hence, this blog is dedicated to the people I have met along the way—those who wish to know what I do, or how I do it. In the end, this blog merely satiates the curiosity: what happens in ML?

What will you need? A stable internet connection and patience. It is great to have a working knowledge of Python programming language, but I will walk you through the simplest of steps.

Setting things up

ML is essentially playing with data. And isn’t it always helpful to see what you have done with the data at hand? Jupyter Notebooks do just that for you! In these ‘Notebooks,’ you have a user interface with a code cell (where you write the code) and an output block (to view your output).

While it is encouraged to set up the Jupyter web application on your local system, we will be using Colab, a service offered by Google to run Jupyter notebooks, without any prior setup. Head to Google Colab and create a ‘New notebook’.

Let's write your first code in the code cell. Copy this and click on the run icon

print("I am learning Machine Learning!")
Output of your code. The yellow highlighted icon is the “run” button to execute your code.

Jupyter notebooks allow you to treat your code, well, like a notebook. You can include a title or a subheading to better understand what the code block does. These texts are written in a language called Markdown. Simply click on the ‘+Text’ option at the top of the notebook to add the Markdown cell. Colab also offers a simple guide to play with these text cells.

Dive in!

To reiterate our goal: we will be performing sentiment analysis, a way to predict whether a movie review (or any text) is positive or negative. Such a classification is called binary classification in ML.

Step 1: Data Collection

The rule of thumb in any ML development (like this one) is that we need data. And luckily, there's plenty of it. For our purposes, we will be downloading the IMDB Dataset of 50K Movie Reviews from a dataset platform kaggle. You may be asked to create an account here. Kaggle is one of the best sources for obtaining and uploading data, executing Jupyter notebooks (like Colab), and participating in a variety of ML competitions.

File explorer highlighted in Colab. Upload the IMDB Dataset.csv here.

Once downloaded, you can unzip the file and upload the IMDB Dataset.csv to Google Colab. It may take some time to upload.

Remember, Colab has a runtime, which means if we are to leave our system idle for a long time, we will lose our files and will have to re-upload every time we wish to work with our data. This is one of the major reasons why it is better to run Jupyter Notebook on your local system.

Step 2:Importing the data

Once uploaded, we need to store our data in a variable through which we can interact with it further.

Python allows us to store such tabular Excel or CSV (comma-separated value) data in dataframes. These dataframes can be created and managed using a Python library (a collection of codes) called Pandas. Copy the below code to import your data and view the first five rows of your data:

# Importing 'Pandas' library under the name 'pd' for processing the data
import pandas as pd

# Reading our CSV file as a dataframe and storing in a variable df_imdb
df_imdb = pd.read_csv("/content/IMDB Dataset.csv")

# Prints the number of rows in the dataframe
print("Length of IMDB File:",len(df_imdb))

# Prints the first five rows of the dataframe
df_imdb.head()

Your output should look something like this:

Step 3:Visualizing the data

Once we have imported and stored our data, it is a good practice to visualize the data. This gives us an idea of how the data distribution looks — the number of positive and negative records in the data.

Copy and run this code below in a new code cell to plot a pie chart distribution of the data.

import matplotlib.pyplot as plt

# Counting the occurrences of each sentiment
sentiment_counts = df_imdb['sentiment'].value_counts()

# Creating the pie chart
plt.figure(figsize=(8, 6))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140, colors=['green', 'red'])
plt.title('Sentiment Distribution')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Just like pandas, the Python library used for visualization and plotting of graphs is called Matplotlib. Your output should look like this:

We see that there are an equal number of positive and negative classes. Such a distribution is called a balanced distribution of data.

CONCEPT ALERT❗

Our task was to predict the sentiment of a movie review, but our data seems to have already been annotated (labeled) for all the texts. Such data, which has already been labeled (often manually) is called labeled data. Using labeled data to build machine learning models is called supervised learning.

Essentially in supervised learning, we use historical data to train our model. The model tries to ‘learn’ the underlying patterns in the data and tests itself on unseen data. Imagine this testing as one of your final exams and training as your preparation for the test. This training and testing form the foundation of Machine Learning.

Step 4: Data Preprocessing

All ML models follow a basic principle — GIGO (Garbage In Garbage Out). Before we can use our data to train our model, it is essential to ensure that our data is ‘clean’. If we had multiple columns, we would’ve deleted the unnecessary ones. If we had empty rows, we would've deleted them as well.

Often low-level ML models rank the importance of words by their frequency. In the case of our text data, we can get rid of any stopwords (a, an, the, etc.) that, even though are present in abundance, add little meaning to the sentiment of the sentence. We can also remove any numbers, punctuation, accented strings (à, è, î, ô, æ, etc.), or special characters (@, +, /, etc). We will also be converting our text to lowercase to ensure that words like ‘angry’ and ‘Angry’ receive equal importance.

There are two other important methods — stemming and lemmatization for cleaning the data which are beyond the scope of this article.

Copy and run the below code to download and import the libraries which will help us clean our data:

#imports and downloads
import re

#to remove html tags
from bs4 import BeautifulSoup

#to remove accented strings
!pip install unidecode
import unidecode

# Import stopwords with nltk.
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

Copy and run the below code to clean our data:

#1. Convert all text to lower case
df_imdb['review'] = df_imdb['review'].str.lower()

#2. Removing html tags
df_imdb['review'] = df_imdb['review'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

#3. Remove URLs, numbers and emails
def clean_text(text):
text = re.sub(r'http\S+', '', text) # remove URLs
text = re.sub(r'\d+', '', text) # remove numbers
text = re.sub(r'\S+@\S+', '', text) # remove email addresses
return text
df_imdb['review'] = df_imdb['review'].apply(lambda x: clean_text(x))

#4. Remove punctuation and numbers
df_imdb['review'] = df_imdb['review'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

#5. Removing stopwords with Python's list comprehension and pandas.DataFrame.apply
stop = stopwords.words('english')
df_imdb['review'] = df_imdb['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

#6. Remove accented strings
df_imdb['review'] = df_imdb['review'].apply(lambda x: unidecode.unidecode(x))

# View your clean data
print("Cleaning of data completed!")
df_imdb.head()

Your output should look like this:

Every step of the ML model development follows the no free lunch theorem. That is, there is no shortcut to knowing which method will work best for our data and model. Maybe we are losing important data upon removing stopwords? We will never know until we determine the performance of our model, and redo the cleaning process.

Step 5: Train Test Split

Remember the “CONCEPT ALERT❗”? We assess the performance of our ML model based on how well it does on an unseen test set. The most common performance metric is accuracy. The accuracy of our model defines the percentage of correctly predicted sentiments of all available records. It is as simple as saying that if I asked my model to predict the sentiment of 100 sentences, and it was able to predict 80 of them correctly, the accuracy of my model would be 80%.

To obtain this test set of 100 records, we can use our already available data. To ensure that the test set is not seen by our model, we split the data into two parts — the training set and the testing set. The model will be trained only on the training set, trying to understand the patterns that make a sentence positive or negative. Then, from its understanding, it will try to predict the sentiment of the test set. Based on how well it has performed, we will determine the accuracy of our model.

A good split is 80–20, where we reserve 80% of the data for training and 20% of the data for testing. Moreover, we don't want to choose only the first 80% of the records, as in extreme cases they all may belong to the same label. So we randomly sample 80% of the data as training and the rest as testing.

Copy and run the code to split the data:

# split the dataset in train and test
from sklearn.model_selection import train_test_split

X = df_imdb['review']
y = df_imdb['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Vectorization

This step is specific to dealing with text data. Computers only understand numbers, hence we convert our text to numbers. It could be as simple as each word is replaced by its frequency in the entire data. While there are many ways to do so, we will be using the term frequency-inverse document frequency (TF-IDF) vectorization. Quite a mouthful but in its essence, it measures how important a word is in a record (row) compared to how often it appears in other records (rows).

Since these numbers define the property of the record, they are called features. At all times, there exists only ONE column for the label, and the rest of the columns are called the features of the label.

Copy and run the below code to perform vectorization of the train and test data:

from sklearn.feature_extraction.text import TfidfVectorizer

# Define TFIDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Vectorize your X_train and X_test
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

Step 7: Training of the Model

We have finally reached the step of training our machine-learning model! While there are a plethora of models to choose from, we will be using a logistic regression model.

Logistic regression predicts how likely a piece of text is positive or negative. For example, it might say that the sentence “If you like original gut-wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it” has a 95% chance of being positive. If this chance is higher than a set threshold, like 50%, the text is labeled as positive.

We train the model using only 80% of our data, the train set. Copy and run this code to declare and train your logistic regression model

from sklearn.linear_model import LogisticRegression

# Defining the Logistic Regression Model
logistic_regression = LogisticRegression(random_state=42)

# Training the logistic regression Model
logistic_regression.fit(train_vectors, y_train)

Step 8: Testing the Model

Once we have trained our model, we can test its performance on our test set which we had kept aside earlier.

As mentioned earlier, the performance of a model is measured using performance metrics. Along with accuracy, we have precision, recall, F1, AUC-ROC, confusion matrix, etc. We can go ahead and choose accuracy as our metric.

Copy and run this code to test your logistic regression model

from sklearn.metrics import accuracy_score

# Predicting the test data
predictions = logistic_regression.predict(test_vectors)

# Calculating the accuracy
accuracy = accuracy_score(y_test, predictions) * 100
print(f"Accuracy of Logistic Regression: {accuracy}")

You should receive an output of around 89%, which is great news! Your model is 89% accurate in predicting the sentiment of a movie review.

We can also compare our Logistic Regression model with a different Machine Learning model. Let's take the k-Nearest Neighbors (kNN) model as our example.

Imagine k-NN as plotting all the training examples on a multidimensional graph. Each example is like a point in space, with its features determining its position. When we want to classify a new point, k-NN checks its k nearest neighbors, those points closest to it in the graph. By seeing which category those neighbors belong to, k-NN predicts the category of the new point. Here k is any real number, and a 5-NN would mean that kNN will check 5 closest neighbors of a test point. It’s like guessing based on the company you keep on the graph. The k is called the hyperparameter of the model.

Since kNN calculates the distance between the points for each test example, it may take a long time to execute its code (~30mins in this case)

Copy and run the code below to find the accuracy of a kNN model to predict the sentiment of a movie review.

from sklearn.neighbors import KNeighborsClassifier

# Defining the KNN Model
knn = KNeighborsClassifier(n_neighbors=5) # You can adjust n_neighbors as needed

# Training the KNN Model
knn.fit(train_vectors, y_train)

from sklearn.metrics import accuracy_score

# Predicting the test data
predictions = knn.predict(test_vectors)

# Calculating the accuracy
accuracy = accuracy_score(y_test, predictions) * 100
print(f"Accuracy of KNN: {accuracy}")

You should receive an accuracy of around 73% in this case.

Step 9: Hyperparameter tuning

This is the most time-consuming step of our process. Whenever we use a Machine Learning model, it is associated with a set of hyperparameters (k in the case of KNN). We never know what combination of those hyperparameters will give us the best result (no free lunch!). The only way out is to try as many combinations as possible, try to identify a pattern, and choose the best set of parameters that will maximize our performance metric (accuracy in this case).

Step 10: Model Comparision

The final step is to compare the best accuracies of our models after hyperparameter tuning and using the best model for any of our future tasks. A good way to visualize the accuracy is using a bar graph.

Copy and run the below code to compare our logistic regression and kNN models:

import matplotlib.pyplot as plt

def plot_accuracies(accuracy1, accuracy2, labels=('Logistic Regression', 'kNN')):
# Bar positions
x = range(len(labels))

# Bar heights (accuracies)
accuracies = [accuracy1, accuracy2]

# Plotting the bar graph
plt.bar(x, accuracies, color=['blue', 'orange'])

# Adding labels and title
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Comparison of Accuracies')
plt.xticks(x, labels)
plt.ylim(0, 100)

# Display the plot
plt.show()

# change to the accuracies you received
accuracy_model1 = 89.39
accuracy_model2 = 73.94

plot_accuracies(accuracy_model1, accuracy_model2)

Your output should look like this:

Comparison of accuracies for the Logistic Regression and KNN models

Toying around with the model

We can see that our logistic regression model performs better than the kNN model. We can use this logistic regression model probably in our mobile application or website or just as a fun game to play with! You can try for yourself and see how well your sentences are predicted by your model:

Copy and run the following code to play with your model:

# New sentence to classify
new_sentence = "This movie is great!"

# Preprocess new sentence and convert to TF-IDF feature vector
new_sentence_tfidf = vectorizer.transform([new_sentence])

# Use trained model to predict sentiment
prediction = logistic_regression.predict(new_sentence_tfidf)

# Get probability scores
probability_scores = logistic_regression.predict_proba(new_sentence_tfidf)

print("Predicted class:", prediction[0])
print("Probability scores:", probability_scores)

We have reached the end of our recipe for ML! This should give you a good start on your ML journey. While this blog touches upon the key structure of any ML setup, thee list of topics to learn is extensive. From multi layer neural networks to foundations of ChatGPT — from medical to finance — the reach of machine learning goes above and beyond. Watch out for this space as we explore more such data science experiments!

You can find my colab notebook here. The GitRepo for the data and code can be found here. Connect with me on LinkedIn here.

--

--

Pratish Mashankar

Tech enthusiast, educates with fervor. Master's in Computer Science. Innovates data solutions. Passion for teaching, writing.