Understanding Machine Learning: a practical example in Python

Jonathan Barsotti
5 min readSep 28, 2020

Machine Learning (ML) is nowadays encountered very frequently in many technology articles, newspaper and magazines. Experts and technology passionates do not require any presentation! However, many people not expert in the field, often hear about this “black box”, since more and more frequently used in algorithm adopted from social networks, virtual assistants and search engines, to mention some of the most notorious cases, without having a precise idea of what ML is. Far away from the purpose of this brief story is to provide an exhaustive treatment all aspects of ML. However, I want to provide and explain a simple ML application, mimicking a real ML case. It provides a concrete and tangible example about how to train a linear regression model, test it and finally use it to make predictions.

Task presentation:

A company sells clothing online. However, they also provide in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

Note: the data used are fake and just simulate real company data.

Goal:

The company is trying to decide whether to focus their efforts on their mobile app or their website, and this analysis should help them to figure it out!

As a first step we load the libraries, specific tools which contain a collection of programs and software packages that permit to perform several operations. In particular, we will use pandas to manage the data, numpy to perform some mathematical operations and two libraries for advanced data visualization, seaborn and matplotlib.

Once imported our libraries, we need to read and import the data. To do so, we use pandas. We load the data in a specific data structure called DataFrame, calling it “customers”. Such a structure permits to manage data and perform advanced operation on them as as a sort of “special table”. As many say, can be tought as a sort of “doped version” of Excel :-).

We give a first look to the data to have an idea of them:

customer DataFrame content.

Here we can explore the data, showing the content (columns). We can see some personal and profile data. However, what we are interested in, are the statistical data about each customer (i.s. each row). We will be particularly interested in the “Yearly Amount Spent” by each client.

After a first look at the data face, it is very usefull to peform a visual exploration to get more information. We have to remember our goal: we are mainly interested in the correlation between variables and how much clients spent.

Pairplot of the customers data: correlation between each couple of variables.

The data visualization provides many useful information. As an example we clearly see that the variable which mainly correlates with the “Yearly Amount Spent” (last line) is membership duration (variable name: “Length of Memebrship”). Owever also “Time on app” shows a correlation. We can then go deeper:

After the visual data exploration it’s finally time to develop our ML model: we will use a linear regression model. It worth to notice that chosing the proper model in ML is an art which requires solid knowledge of the mathematical theory and a lot of expertise in its implementation. In this case, things are much easier and a linear model is the most natural choice, as shown by the previous fit of the data. The library we will use for this task is sklearn. Now is time to create a train dataset and train our model, that in this case means to set the proper coefficient to the linear function we are using. Then we will create a test dataset, that is a subset of the data, that the model has never seen before. We will use the test dataset to check its predictive performance ability. To do so we can use a dedicated library.

In the last line, train_test_split() randomly splits the data contained in customers DataFrame, to obtain two distinct datasets:

Train dataset: X_train, y_train

Test dataset: X_test, y_test

We now create the linear model instance with the linear model we decided to use and train it on the train dataset just created:

Now it’s finally time to make some predictions, testing our trained model on the never seen test dataset. The next step will be plotting predictions against the actual results (i.e. y_test), that is the actual amount of money spent by customers (whose data where contained in the X_test). In the following graph the predicted values are very well correlated with the actual data (provided by the company), showing that the model is working well on these dataset.

We evaluate our model performance using quantitative parameters, tipically the “Root Mean Square Error”. However here we show two more, the “Mean Absolute Error” and the “Mean Square Error”.

The coefficients describe the increment (with sign) of the “Yearly Amount Spent” by the client for the related variable. In other words, a higher coefficient means that the correspondig variable let the amount of money spent by the customers increase, when the variable increase. With this in mind, it is now clear that the company sholud invest on the mobile app and not on the website page.

As we have just seen, Machine Learning is a powerfull tool and thanks to dedicated libraries such as sklearn in Python, it often requires few lines of code to implement it. Actually there still are the model selection and the related data exploration, cleaning and manipulation to get optimal and trustable results, but this is another story…

I hope this short and simple example helped you better understanding what a ML algorithm actually does.

--

--