Building a Machine Learning Pipeline

Published in

Tesseract Coding

8 min readJun 5, 2020

Have you ever got a message or Mail exclaiming that you have won a Prize Money, for a competition that you never participated in, or a trip to some foreign destination, just out of the blue? It happens with every one of us, but the thing that underlies beneath them is a malicious intent to dupe us and to make easy money from a gullible public. The rise of Internet and Mobile Networks have heralded a new revolution in the Cyberspace where advertisers are utilizing new means to reach the public, and sometimes these means are placed in the wrong way.

People often get fooled by these contents which usually carries an advertisement that places a misplaced notion on the product or service they are offering. To counter this, messages were often classified as “Spam”, which served an explicit warning to people about their content. While the messages which were regarded as useful for the receiver were classified as “Ham”.

With the rise of Artificial Intelligence and the usage of Machine Learning in multiple use-cases, many corporations and E-Mail service providers started utilizing Machine Learning Models to classify these messages or Mails as “Spam” or “Ham”. In this article, we will be covering how we can develop an End-to-End Machine Learning Model to serve as a use case to help us classify messages into “Spam” or “Ham”.

What is an End-to-End Machine Learning Model?

In simple terms, End-to-End Machine Learning is a term used to denote on how we can serve our Machine Learning Models into Production. It allows us to serve our model directly onto a server to reach the customers through the client-side while bypassing all of the procedures that we do in Development.

Majority of Machine Learning Practitioners and Students, prefer to develop and model their Data on an interactive environment like Jupyter Notebook or Azure Notebook. They formulate their hypothesis, load the Dataset, clean and pre-process it before they move ahead with using specific libraries to model the Data. This means that an original input, let’s say X, goes through multiple processes to reach the end-result Y. So how does an End-to-End Machine Learning Model helps us?

It allows us to directly take an input X to make a prediction Y, without needing to go through any intermediate process in the midst. Developing an End-to-End Machine Learning Model, allows us to develop scalable models which can be served to multiple users and allows real-time processing.

While in traditional Machine Learning, the historic data is collected, pre-processed, modelled using a specific algorithm and then compared for the Scores, an End-to-End Machine Learning Model prefers to collect the data, stream it for features and use the existing models to generate some results. In this project, we will be developing a Machine Learning Pipeline which can be directly served to an application and help us check if a message is “Spam” or “Ham”.

Data Collection and Exploratory Data Analysis

We will be using the SMS Spam Collection Data Set from UCI Machine Learning Repository which is a collection of 5574 Text Messages for Classification and Clustering purposes. The messages are suitable labelled as “Spam” or “Ham” and we can use it to model our Machine Learning Model.

Let’s first load our Dataset using Pandas Library that is a popular library used for generating Dataframes and preparing the Data for final modelling.

We have divided our Dataset into two columns: labels and messages. The label denotes whether the message is “Spam” or “Ham” while the message is the actual message for which the label has been kept. Once we have loaded our Dataset, we can move forward with some Exploratory Data Analysis to help us better analyze our dataset and move forward with the final implementation.

Let’s do some Exploratory Data Analysis before we move forward with our final implementation. We can check out some of the parameters in our Data using Pandas. Let’s check out the length of the text:

This is what we get as the output:

Let’s go forward and check out the length of “Spam” and “Ham” messages each:

In the below graph, we can see that the length of the message is an essential parameter to judge whether a message is a spam or not. Here we can see that Spam messages tend to have more characters compared to Ham messages:

So we have sorted out our Feature Engineering, so we can move forward with pre-processing our Dataset and fine-tuning it for our Data Modelling.

Data Pre-Processing

In this Dataset, all the input variables are in the form of Strings. Machine Learning Algorithms, on the other hand, tend to use numerical vectors to model the data and generate the predictions. In this case, we will be converting our whole dataset into a numerical value that can be processed by our Machine Learning Model easily.

To make this possible, the raw messages which are a sequence of characters is converted into a sequence of numbers and for this, we will use the Bag-of-Words approach where all the characters are denoted by one number. To make this possible, we will first go through removing Stopwords from our strings. Stopwords are words that are used by people in general and do not hold any significance in Data Modelling Operations.

Let us define a function that can take in any string and remove the Stopwords and Punctuations from it. We will be using the NLTK library for the same to remove the stopwords for this purpose.

Now that we have removed the Stopwords and Punctuation from the Dataset, we can move forward towards normalizing the Dataset. We will be now converting all the words in the Dataset to a form which can be operated on by Scikit-Learn Machine Learning libraries.

We will be doing this, in the form of three well-defined steps: First, we will calculate the term frequency which denotes the number of times a word occurs in a dataset. Next, we will weigh the counts so that the frequent tokens get the lower weigh, which is called the inverse document frequency. Now we will convert the vectors to unit length, to abstract from the original text length which is L2 Norm.

Let’s use Scikit-Learn’s CountVectorizer to convert the Dataset into a sparse matrix and visualize how we have converted the String into a Vector:

We have used the transform() function the entire Dataframe of the messages into a large, sparse matrix. We can now use another methodology named TFIDF or term frequency-inverse document frequency to check and highlight the importance of any word in a collection of the corpus. So, let’s go ahead and implement this with our Dataset:

We have now normalized our whole Dataset and we can finally move forward with our Data Modelling. We will also be building a pipeline to serve our Model finally which is the final purpose of our article.

Data Modelling

For the purpose of Data Modelling, we will be using a Naive Bayes classifier. The reason why we are using Naive Bayes Classifier is that it is quite simplistic and it helps us to achieve results that outperform many complex algorithms, especially in Classification Problems. Here we can use the Naive Bayes classifier for our purpose, by developing a distribution over words that can help us simplify our task at hand.

We will implement the Naive Bayes using the Scikit-Learn implementation for the same which will help us write minimal code for the task at hand:

Pretty easy with that? We modelled our Pre-Processed and clean Dataset in just two lines of code! Let us now evaluate the performance of our model that we have developed using yet another Scikit-Learn’s Library: classification_report.

We can now check out the Classification Report for the model that we have prepared. However, we have just evaluated on the same dataset that we have just trained the model on. We will now follow another methodology where we will test our model by dividing (or splitting) the dataset into two parts: Train and Test.

Let’s use another Scikit-Learn Library, known as train_test_split, to split our dataset into two parts. We can use the 20% of the dataset for testing while the rest 80% can be used for training the dataset.

Preparing the Pipeline

Let us now arrive at the main part: Preparing our Pipeline to serve our Model. In the previous steps, we have loaded the dataset, pre-processed and cleaned the Dataset but now we need to move ahead with creating a Pipeline. A pipeline will help us automate all our workflow in one single steps and a few lines of code, so let’s try it out.

We will be leveraging the Scikit-Learn’s Pipeline library to model our dataset and develop a pipeline. This can be treated as an Application Programming Interface (API) which can be directly served to some Web Application and which will allow us to perform the computations real quick.

In a few easy steps, we have now set our Pipeline Model which can now be served with an API, to some Client-Side Interface and we don’t need to take care about all that background processes anymore.

Conclusion

In this article, we have covered a lot of aspects about why Pipelines are needed to serve End-to-End Machine Learning Models and how you can classify “Spam” or “Ham” messages using a Naive Bayes Algorithm. The reason why Naive Bayes works so well is that it works on the assumption that the features of the dataset are independent of each other. We have also got to know on how you can serve the Model using a Pipeline and in further articles, we will get to know on how you can directly serve it using a Flask API on a Client-Side Interface while learning the basics of Flask on the way.

Check out the Code here.

~ Harsh Bardhan Mishra