If you’re a beginner in the field of statistics (like I am), the following article might be of interest to you. Why? Because it gives an overview about a basic machine learning technique described in each and every beginner statistic book that you are able to find out there. It is about Simple Linear Regression and how to implement it using Python. I dedicated some time to dig into this topic and now I want to share some insights with you guys.
What you can expect:
At the end of this article I come up with a simple linear regression model written in Python that predicts the amount of startups of a mobile application during the day (for a specific hour). In more detail:
- I will visualize and evaluate some data representing the fundamentals for creating a regression model. This data comprises information about locations and timestamps of application startups in a specific location.
- Will introduce one approach (gradient descent) to find the parameters (also called weights) of the simple linear regression model that is able to predict the amount of app openings for a specific hour.
- I will evaluate my implementation against a well-known library (scikit learn)
FYI: I used Jupyter Notebook to visualize data and to implement the regression model in python . I gathered the following figures, plots and snippets from it.
I think everyone agrees that the success of a statistical model stands and falls based on the underlying data. Therefore, you should always get familiar with this data BEFORE you start to get things done. In the following I will do exactly that. I will get in touch with the data, perform aggregations, visualize it and finally will (or at least try to) remove outliers.
The data comprises timestamps of startup procedures of a specific app opened in a specific location.
It looks like this:
Just to keep the goal in mind: I want to use this data to predict the amount of app openings for a specific hour. Therefore, I can ignore location details like latitude or longitude for now and can focus on the timestamp column of the data.
What I need to know to make predictions is the amount of app openings for each individual hour for a specific day. These information can be extracted easily from the data.
The first thing that I did was to convert the ISO encoded timestamps to a python date representation. It can be done using:
import dateutil.parser## convert the timestamp to a python date object
Afterwards, I added an hour and day column to the data set.
## add dedicated columns for day, hour and weekday data['hour'] = list(map(lambda v : v.hour, data['timestamp']))
data['day'] = list(map(lambda v : v.day, data['timestamp']))
Based on these columns I’m able to aggregate the data to retrieve the amount of app openings. For this I grouped the data by day and hour. The result looks like the following:
If you now take a careful look at the plot shown above, three things are already visible:
1. There is a trend that users open the app in the late afternoon.
2. There are days where users do not open the app very often (or that our data is incomplete for these days). Either way, it seems that these data points are some kind of outliers.
3. The amount of app openings increases during the day and decreases again after a peek at 3/4/5 PM. It seems that a simple linear model is not suitable for my use-case due to its limited complexity.
And that is the magic of visualizing data: We quickly gathered some knowledge about the underlying data and can predict what may happen if we apply a specific regression approach. The first thing that I assume is, that the data comprises outliers that may skew my model towards a wrong direction.
It would be advisable to remove them or at least try to do so.
Second, I also recognise that I may need to reconsider the simple regression approach due to its limited capability to cover non-linear problems. A “straight line” does not seem to be appropriate to predict the app openings properly. However, I stick to my original plan and use the simple linear regression approach.
Before we tackle the regression problem, I want to remove outliers. They tend to skew parameters like the sample variance or the sample mean. Same applies to the (in my case: linear) parameters of a model.
To get rid of extreme values you can use for example the Interquartile Range (IQR) (https://en.wikipedia.org/wiki/Interquartile_range) to calculate the upper and lower fence. All values that are below or above these fences are considered as outliers and are removed from the data set. The implementation is fairly simple using numpy. First you have to get the first and the third quartile. The IQR is the difference between these two. The lower and upper fence can be calculated as follows:
As you can see in the output above our lower fence is smaller than 0. Therefore, all suspicious values that I highlighted above are indeed valid ones. At least we tried. Let’s carry on.
The Simple Linear Regression Model
I already covered a lot in the section above but I’m not tired of mentioning again that data processing is an important aspect. However, the next part deals with the (in my opinion) most interesting part, the regression model.
We will create a linear regression model by performing supervised learning. Based on the data we have (the “supervisor”) we try to find a hypothesis h (a linear function with parameters 𝛉_0 and 𝛉_1) that maps an input (in this case a specific hour) to an output or target value (amount of app openings) given its weights 𝛉 (theta). The formula can be represented as follows:
with x_0 = 1 (and therefore removed from the formula above).
The goal is to find a hypothesis h that fit our data best. What does “fit the data best” means? It means to find values for 𝛉 of hypothesis h in order to estimate the app openings for a given hour as good as possible. The difference between our estimated value based on h and the real solution y should be as small possible.
In order to measure how “good” we get, we will use a cost function. More precisely, the so called least-squares cost function. It is defined as the following:
What we do here is to calculate the squared difference between the predicted value specified by our hypothesis h(x) and the target value y given a specific input. As a result, we get costs for a specific hypothesis h given its weights 𝛉. Accumulated over the complete data set used for training, we end up with a value that determines how good or bad the hypothesis h performs for the data.
Speaking of training: Before we go on we need to clarify the terms training set and test set. In the field of machine learning it is common to split the complete data set into (often times disjunct) partitions of different size and for different purposes. To learn the weights of a model we use a dedicated training set. It should comprise the biggest chunk of data to obtain as much information of the underlying population as possible. To evaluate our model at the end we use another completely disjunct set of data, the test set.
To recap: our overall goal of is to find a linear model where the costs are as small as possible (we try to minimize J(𝛉)). One approach is to repeatedly change the parameters 𝛉 until we converge to a value of 𝛉 that minimizes J(𝛉). More precisely, we can use the gradient descent approach. I could now subjugate you with the mathematical foundations but we will skip this part and focus directly on the update rule for a single training example. It is represented by
( This rule is also known as the Widrow-Hoff learning rule. See https://blog.oureducation.in/learning-rule/ for more information)
In my opinion the rule is pretty straight forward and it really does not feel like we are doing some gradient descent magic. However, in a few moments you will see that it really works.
Before I present my implementation let us summarize what is described int the Widrow-Hoff learning rule: For each training example i, we update our weights j in regard to the difference between our predicted and the real value multiplied by a learning rate 𝞪 and the current input value. Not that complicated right? So let’s code.
First, I started to implement the cost function. We don’t need it for our gradient descent approach but it could be nice to see how our model improves for each iteration (or also called epoch) through the complete training set. The cost function can look like this:
I also added the part where we use numpy to represent the input, the output and the weights as matrices. The costs for the initial linear function where all weights are zero has a value of ca. 100 000 (which is not very good). Let’s see if we are able to change that.
What you can see above is the implementation of the stochastic gradient descent approach. For each training example we update our weights in regard to the Widrow-Hoff learning rule described above. We do this 10000 times for our training set with a learning rate of 0.00001. Our result converges to a linear function with weights specified by 𝛉_0 = 54.57221825 and 𝛉_1 = 2.67104843. The cost progression is shown in the following:
The plot shows that the costs approach a value of ca. 11000. But it seems that we are stuck at this specific level. Our model cannot improve further. Maybe we did something wrong…. let’s evaluate.
In the previous section I introduced test sets as foundation for model evaluation. So at the beginning we split up the data and could end up with a test set as the following:
It comprises less data than the training set but (I think it) represents the underlying population pretty well. We now can use our model to estimate the app openings given a specific hour. The result is the following:
Mh Ok. I think our model was able to capture the most obvious characteristic of the given data: The amount of app openings increases during the day. There is really nothing more that we are able to say about the current model despite the fact that is is maybe not powerful enough to express the other fluctuations during the day.
Imagine another scenario where you are dealing with a discrete target space: Based on data with attributes of different people (e.g. weight, height, sizes of hips or shoulders) you want to estimate their cloth size (S, M, L etc.) Your data comprises different attribute values for each person and the resulting cloth size. It’s your task to estimate the cloth size of a new person given his/her different body parameters. A classification problem.
You are able to easily evaluate the resulting model of the described scenario by using a test set and some simple metrics. You can count how often the model prediction is correct and could end up with a percentage value (precision and recall, a good description can be found here https://towardsdatascience.com/precision-vs-recall-386cf9f89488) that represents a measure of goodness of your model. However, this is not feasible for a regression problem where you are dealing with a continuous target space. We need another approach.
The first thing that came into my mind was a direct comparison with already existing solutions for building a linear regression model. I decided to use the python library scikit-learn for this.
To create a linear regression model with scikit-learn you need to call the LinearRegression constructor (how fitting) and can train the model using
model.fit. To estimate the target value for a given input, we use
model.predict . Thus, the code is pretty straightforward :
After training I visualised the results for the test set:
The red line represents the predictions made by scikit learn. The green line is based on predictions of our model. Pretty similar right? It seems that we did well. But which of the shown models fits best? All decisions that could be made based on the visualization above are too inaccurate. We need a better evaluation approach.
R-squared statistic to the rescue
A standard goodness-of-fit metric for linear regression models is the R-squared statistic. As stated by Agnes Ogee et. al “it is the percentage of the response variable variation that is explained by a linear model”. In other words, the R-squared statistic represents the proportion of variance covered by our linear model compared to the overall variance in the data. A value of 1 means that everything is covered, 0 indicates that our model is not able to explain any variability.
Fortunately, scikit learn provides the R-squared metric already. The results are shown below.
We can examine two things:
- Our R-squared values are very small in general.
- The R-squared value of our linear model is slightly smaller than the one of the scikit model.
What does this mean? That the approach in general is not sufficient to cover the concept of the target space? That our model fits slightly worse than the one of scikit learn? Yes and Yes.
First of all, we already discussed the fact that a simple linear regression model is not able to come up with a good concept that is able to cover the target space of our use-case. The overall small R-squared value indicates that. More evidence for this can be found if we take a close look at the scatter plots above and the residual error (the difference between the estimated and the target value) below.
If you are able to recognize a pattern within your residual plot that allows you to anticipate different properties of future values given a specific input, than you know that you are
- Missing variables within you model
- Missing a higher-order term of a variable in the model to explain the curvature, or
- Missing interaction between terms already in the model
And we are able to do exactly that. With high probability, we can anticipate that at 5 pm the residual will be negative. That is bad. We should switch to a nonlinear regression model! (Next Time)
Second, the fact that the R-squared statistic of the scikit model is slightly higher than the one of our linear model indicates, that our model is not as good as the one proposed by scikit learn. I don’t know which tweaks and algorithms they use to build up and to train their simple linear regression model but it seems that they might use a different approach. However, the difference between both models is very small and I think that our model can compete pretty well with already existing solutions.
Within this article I described a supervised, gradient descent based learning approach that comes up with a simple linear regression model able to predict app openings given a specific hour during the day. We covered each step for building up this regression model in great detail, from cleaning up data to the mathematical background of simple linear regression models up to evaluation approaches. We found out that our model performs well compared to already existing solutions provided by for example scikit learn. Nevertheless, we also recognized that a simple linear regression approach is not powerful enough to come up with a good concept able to represent our target space.
This article is the first of its kind. It is the first of a series of articles coping with machine learning models and the first medium article for me in general. I hope it helps you to create a bit more understanding of the overall topic. For me it really worked. Looking forward to your feedback :)