Basics of Machine Learning and a simple implementation of the Naive Bayes algorithm

Hugo Ferreira
Hugo Ferreira’s blog
13 min readMar 29, 2018

An overview of what I learned during the workshop Bootstrap ML

Venue of the workshop “Bootstrap ML”

Last Thursday morning I went to a workshop called “Bootstrap ML”, a two-hour introduction of Machine Learning (ML) directed to a maths-oriented audience. It was part of “Jornadas da Matemática 2018”, a full day event organized by n{math}, the Mathematics students association of Instituto Superior Técnico (IST), the institute where I did my undergraduate studies.

This was a fantastic opportunity to learn the basics of ML without having to spend too much time dealing with the usual reviews of linear algebra and statistics which I find in many introductory expositions of the topic. We could go directly to the main ideas and quickly explore the maths behind the algorithms.

In this post, I’ll describe what I’ve learned in this workshop: namely, the basic ideas behind ML and some of the most common algorithms. At the end, I’ll show a simple application of one of these algorithms, the Naive Bayes Classifier, to spam filtering. I’m going to assume that you are comfortable with the fundamentals of linear algebra and probability theory.

Bootstrap ML and basics of machine learning

The workshop was given by Sérgio Agostinho, a PhD student in the Signal and Image Processing Group of the Institute for Systems and Robotics, an RD&I institution of IST.

The session was split in two parts. In the first hour (or so) we were given a broad overview of the field and explored the most popular families of algorithms. At the end, we saw how to implement one of the algorithms using Python and scikit-learn.

Below, I’ll give an overview of some of the things I learned in this workshop, ending with a simple implementation of the Naive Bayes algorithm to filter email spam using scikit-learn. You can find the slides of the presentation and Jupyter notebooks in Sérgio’s GitHub repository:

Main components of ML

The introduction to ML started with the following famous quote from Tom. M. Mitchell in his book Machine Learning (1997), which gives an operational definition of learning from data by a machine:

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Let’s briefly explore each of these three components of ML.

Experience

The type of experience that ML algorithms have during the learning process allows us to categorize them into two types:

  • Supervised algorithms: the datasets used for these algorithms contain several features and have an associated label or target.
  • Unsupervised algorithms: the datasets also contain several features, but no targets; they instead learn properties of the structure of the datasets.

To better understand this distinction, let’s consider an example in which we use on the oldest datasets studied by statisticians, the Iris dataset. It consists of measurements of different parts of 150 iris plants performed by R. A. Fisher in 1936. The dataset can be found in the UCI Machine Learning Repository. Here are a few of the datapoints.

The features of this dataset are the sepal length, sepal width, petal length and petal width (if you don’t know what these are, as I didn’t, see the image below). The target is the species each plant belongs to.

Source: https://www.datacamp.com/community/tutorials/keras-r-deep-learning

A supervised algorithm can use this dataset to learn to classify the plants into each of the species (target) based on the measurements (features). An unsupervised algorithm will not have access to the target and may try, for example, to divide the dataset into clusters of similar examples.

Tasks

When talking about “tasks”, we don’t refer to the process of learning itself, but what we want to achieve using ML. Below, I list some of the common tasks used in ML with a very brief explanation. I’ll detail two specific examples later in the post.

  • Regression: predict some numerical value given some input. Mathematically, if we have n features, the desired output is a function
Examples of regression algorithms (in this case, polynomial interpolation). Source: https://goo.gl/gvykrB
  • Classification: predict some category or class given some input. In this case, if we have n inputs and k categories the inputs can belong to, the desired output is a function
Examples of classification algorithms. Source: https://goo.gl/8zHfgm
  • Dimensionality reduction: reduce the dimensions of the data while trying to preserve as much information as possible.
Example of a dimensionality reduction algorithm (Principal Component Analysis) applied to the Iris dataset. Source: https://goo.gl/cxv6BR
  • Anomaly detection: go through a set of objects or events and flag those which are atypical (outliers). A simple example is credit card fraud detection.
An example using an anomaly detection algorithm (Isolation Forest). Source: https://goo.gl/67zscw
  • Synthesis and sampling: generate new examples that are similar to those in the dataset.

A very interesting example of the last type of task given in the workshop was from recent research in text-to-image algorithms to synthesize photo-realistic images conditioned on text descriptions. For example, in the image below, given the text descriptions three different algorithms generated the pictures below.

It is mind-blowing that the pictures were actually synthesized! At first, I thought the algorithm was just picking an appropriate picture from some database (which would be already impressive). If you want a technical overview of the StackGAN (Stacked Generative Adversarial Networks) algorithm which generated the last row of pictures, here is a talk from last November from one of the researchers.

Performance

We need a quantitative measure of the performance of the ML algorithms in order to evaluate their abilities. For example, for classification tasks, an appropriate measure might be the accuracy of the model, that is, the proportion of examples for which the model produces the correct output.

For regression problems, a good measure might be the quadratic error:

where

For complicated functions, there might be more than one minimum, though often it is enough to find a good enough local one. Below, we will apply this to the linear regression problem.

For classification problems, a confusion matrix allows us to visualize the performance of an algorithm. Each row of the matrix represents an instance of the output (predicted) class, whereas each column represents an instance of the target class.

Let’s consider an example with the MNIST dataset of handwritten digits.

Sample images from MNIST dataset. Source: https://goo.gl/RVg4Vi

Suppose that, after using our favorite classification algorithm, we test it against a test set and obtain the following confusion matrix.

Confusion matrix for test set of MNIST dataset.

For example, of the 40 actual instances of the digit 6 in the test set, the algorithm predicted correctly 39 of them and predicted that one of them is the digit 7. All the correct predictions are in the diagonal of the matrix.

In the end, we want to guarantee that our algorithm performs well on new, previously unseen inputs, and not just those which were used to train our model. This property usually goes by the name of generalization.

Take the following simple example which always helps me to visualize the problem. Imagine our training set consists of a random sample of (x, y) which satisfy a quadratic function. We try to fit three different models:

  • a linear function;
  • a quadratic function;
  • a polynomial function of degree 9.
Source: http://www.deeplearningbook.org/

Both the second and third models have very small training errors, but it is clear that the third model will have large test errors since the difference from a new test point (which we know to follow a quadratic relation) and the fitting function will be large. This is a typical case of overfitting. It is a mathematical fact that given n + 1 points there is one polynomial of order n passing through them (you only need basic linear algebra to prove this), so it is unsurprising that the training error decreases as you consider polynomials of higher degree (that is, more complex models to fit the data).

The opposite happens with the first model — underfitting. The simple linear model is not able to fit very well to the data and, therefore, to have a sufficiently low training error.

In general, in order to obtain the lowest test (or generalization) error, we need a model with optimal capacity. A model with low capacity usually struggles to fit the training set, whereas a model with high capacity usually overfits by “memorizing” the training data. In both extreme cases, the test error is large and the goal is to find the optimal middle point for which the test error is minimum.

Source: http://www.deeplearningbook.org/

Linear regression

Having explored the fundamentals of ML, let’s go into a bit more detail on two algorithms, starting with linear regression.

Example of a linear regression application. Source: https://goo.gl/rfXVJs

The output of a linear regression algorithm is a linear function of the input:

where

is a vector of parameters.

We want to find the parameters which minimize the mean squared error:

This is just a simple example of a more general family of linear regression problems, which can be succinctly written as

where we now include the intercept in the vector of parameters:

Note the regression problem is linear in β, not in x! Hence, the function Φ of x can be nonlinear in particular.

The parameters which minimize the mean squared error can be obtained in closed-form and they are given by

where

It is a short but instructive exercise to derive this by yourself!

Naive Bayes classifier

Now we turn to a classification problem. Suppose, for example, that we have two Gaussian number generators. If a new number x is generated, can we predict which of the generators did it?

Formally, as previously mentioned, a classification problems involves predicting which of k categories best corresponds to n features:

Hence, we want to find

To do it, we can use Bayes’ theorem:

The theorem states that the posterior probability of the j-class, given the x-features, is equal to the probability of the x-features, given the j-class, multiplied by the prior probability of the j-class, normalized by the probability of the x-features. Note that the denominator does not depend on the class, therefore it is sufficient to maximize the numerator.

However, it is usually difficult to compute P(x|j) in full generality. The Naive Bayes classifier adds the simplifying assumption that the features are conditional independent of the class:

Let’s look at an example. Suppose our dataset consists of measurements of the number of three different fruits which have the properties of being long and sweet, as given in the following table.

If we pick a new fruit, which happens to be long, but not sweet, can we predict which fruit it is?

For that, let’s compute some probabilities!

Now, using Bayes’ theorem, we can compute the probabilities (up to a common normalization) for each type of fruit, given that it is long and not sweet:

Therefore, we are about 82.8% sure that the fruit is a banana. Even though we had to assume that the features are conditionally independent of the type of fruit, as out speaker Sérgio Agostinho would say:

“Turns out it works surprisingly well!”

Sérgio also mentioned other algorithms, including a very brief introduction to neural networks, but I think this is enough for today’s post. Let’s move on to the practical implementation of the Naive Bayes algorithm.

Implementing Naive Bayes for spam detection

At the end of the workshop, we were shown how to implement some of these algorithms using Python and scikit-learn. Here, I’m going to show how to implement the Naive Bayes algorithm in a simple example with the objective of filtering spam. You can find the Jupyter notebook I wrote in my GitHub:

GitHub repository: https://github.com/hugorcf/data-science-nano-projects/

Below, I give an overview of its contents. Notice that the focus here is in the implementation of the algorithm, and as such the data I’m going to use was already preprocessed, so that we can go straight to the ML part. I know that data scientists spend most of their time cleaning the data, but fortunately I won’t have to deal with this today.

The dataset I am going to use, which can be found in this page, was used in the paper (the arXiv version may be found here):

I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, and C.D. Spyropoulos, “An Evaluation of Naive Bayesian Anti-Spam Filtering”. In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9–17, 2000.

Reading the data

The dataset, given in an uncompressed .npz file, consists of 2893 emails and their categorization as spam or not. After reading the source file, the collection of emails was stored in a list of lists of strings, each containing a word of the emails.

All the words with less than 3 letters were previously filtered out and all the numbers were replaced with a ‘#’. Furthermore, stop words such as ‘and’ and ‘but’ were also removed and related words like ‘walk’, ‘walks’ and ‘walking’ were merged into one (‘walk’).

In these 2893 emails there were 779547 words, out of which 49932 were unique. That’s quite a lot of unique words for almost three thousand emails! To see which words were the most common, I produced the following bar plot.

We see that numbers are by far the most common words in the emails, followed by ‘language’, ‘university’, ‘mail’ and ‘linguistic’. It seems this dataset came from emails from a language department!

In any case, out of the 49932 unique words, you can find that 9451 capture 90% of the total word occurrences in these emails. This is a much smaller number! These are the ones we are going to use for the rest of this analysis, so we store it in a list called unique_words_90.

Creating the features

In order to implement the algorithm, we first need to create the feature vectors, together with the target vector which tells which emails are spam or not. Given each list of strings corresponding to each email, we construct an array in which each element is the number of occurrences of each unique word in the list unique_words_90. For example, for the first email, the feature vector is of the following form:

[12  5  0 ...  0  0  0]

This means that in the first email there are 12 instances of ‘#’, 5 instances of ‘language’, no instance of ‘university’, etc. With alls the emails, the feature matrix becomes:

array([[12,  5,  0, ...,  0,  0,  0],
[11, 0, 2, ..., 0, 0, 0],
[20, 1, 30, ..., 0, 0, 0],
...,
[10, 0, 0, ..., 0, 0, 0],
[ 1, 0, 0, ..., 0, 0, 0],
[ 0, 0, 0, ..., 0, 0, 0]], dtype=uint32)

Splitting the dataset

Before applying the algorithm, we split the dataset into a training set and a test set, in a 70:30 proportion. This can be done using the train_test_split function of sklearn.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Implementing the Naive Bayes Classifier

Here, we are going to use MultinomialNB, which implements the Naive Bayes algorithm for multinomially distributed data. First, we use the training set to fit the model.

from sklearn.naive_bayes import MultinomialNBnb = MultinomialNB()
nb.fit(X_train, y_train)

Then, we predict the output using the test set and compute the confusion matrix to evaluate the performance of the algorithm.

from sklearn.metrics import confusion_matrixprediction = nb.predict(X_test)print(confusion_matrix(y_test, prediction))

The output is:

[[724   7]
[ 1 136]]

From the confusion matrix, we see that the Naive Bayes classifier got the following results:

  • Out of the 731 actual instances of ‘not spam’, it predicted correctly 724 of them;
  • Out of the 137 actual instances of ‘spam’, it predicted correctly 136 of them.

You may wonder why we computed the confusion matrix, instead of obtaining the accuracy of the model. Suppose that our dataset had 99% real emails and 1% spam and that we built a classifier that predicted that all emails were real. Then, this algorithm would be 99% accurate, but horrible at classifying spam!

And that’s it for today! I hope you weren’t expecting a groundbreaking new application of ML! This was the first time I actually implemented an algorithm in practice, and I think I’ve learned a lot for such a simple exercise.

The Bootstrap ML workshop was a fantastic opportunity for me to learn the basics of machine learning, without hiding the mathematical details. I’d like to thank Sérgio Agostinho again for the effort! And writing this post forced me to digest all the information and try to convey it to you in a structured way. Let’s hope I succeeded (if not, please send me feedback!).

You can find me at:

--

--

Hugo Ferreira
Hugo Ferreira’s blog

Data Scientist and Machine Learning enthusiast; physicist and maths geek.