Linear Regression With Go
For a long time now, I have been interested in machine learning. It amazes me how a machines can learn and predict information, all without someone having to actually program it to do so — Mind blowing!
Truth be told, even though being always fascinated by machine learning, I have never looked deeper than the surface. Time is a scarce resource, and each time I tried, I got overwhelmed by the number of available information. It would have been difficult and time-consuming to go through all of it. Also, I convinced myself that I don’t have the required maths background to even get started.
But I have finally decided to take another look at machine learning with a different approach. Little by little, I will attempt to program my way through different concepts, starting from basic and moving to more advanced, trying to grasp all the foundations as much as possible. I will use Go as my language of choice, since it is one of my favorite and I am not familiar with typically used languages for ML like R or Python.
It is time to get started.
We are going to start by building a simple ML model using an example, so we can get the grasp of what steps you would follow for any other case.
Suppose our goal is to predict house prices in King County, located in the U.S. state of Washington.
To be able to do this, we need some dataset with historical information. Based on this we will build up our model.
The dataset comes as a csv file with the following structure:
Inside the file you will find a complete dataset. As you can see, each row has a lot of information, we will need to figure out what information will be helpful to predict house prices.
We will need to do some steps to train the model for this example:
1. Choosing a model
To recap, our goal is to predict house prices.
For our example, we are going to use one of the simplest well-known models, Linear Regression.
Linear Regression model is the base for a lot of other models, and also a good model to start with when doing a new data analysis. If the resulting model has good enough predictive power, we do not need to move to a more complex one.
Let’s take a closer look at a Linear Regression:
The graph above represents a relationship between two different variables.
The vertical y-axis represents a dependent variable, in our case house prices. The horizontal x-axis represents the so-called independent variable, in our case could be any of the other data we have in the dataset e.g. bedrooms, bathrooms, sqft_living….
The circles on the graph are y values for a given x value (y depends on x). The red line is the linear regression. It is a line that goes through the values and can be used to predict what possible value could be y for a given x.
Our goal here is to train a model to find out what this red line looks like for our case.
As a refresher, in maths, a line function looks like
y = ax + b. This function is the one we want to find. To be precise, we want to find
b values that best fit our data.
If you are lost, don’t worry, all of this will be more clear once we start applying this.
2. Understanding our data
So, we know what we want the model to predict and we also know the approach we are going to use. The only part that is missing is to analyze our data to see if it fits our use case.
Let’s plot our data to see if it is distributed in this way. To do this, we are finally going to write some code!
Here are some useful packages we are going to be using:
- encoding/csv from the standard library, will help us to load the dataset and parse its contents
- github.com/gonum/plot will help us to draw plots
This snippet will open our CSV, and plot histograms for all the columns except the ID and the Date. This way we can choose which data column we want to use to train our model:
The code above generates a set of different graphs. From these graphs we need to find the one closer to a normal distribution (the best bell shape).
Here is the one I picked from the ones generated by the code above (if you want to see all of the histograms, you can find them here):
This is the graph for the Grade column of our dataset. It does not follow perfectly a normal distribution, but for now, this close enough. In our particular case, if you check all the histograms, there are several other columns that could also be good choices. If the resulting model ends up behaving poorly, we could reevaluate our choice and use some other.
Just for reference, the Grade is a reference to the quality level of the building. You can find more information about the Grade levels here under the
BUILDING GRADE section.
3. Prepare the data
So, to train our model we will use the dataset we downloaded. But, how do we know our model is actually accurate enough?
To answer this question we will need to test our model, and to do this we will use the same dataset. To be able to train and test our model with the same dataset we are going to split the dataset into two subsets, one for training and another for testing. This is a common practice.
We are going to use 80% for our data for training, and 20% for testing. There are other common split ratios, but at the end it is more of a trial and error when it comes to picking the right split.
The aim is to find a balance between having enough data to train the model properly and having enough to test it with, to be sure we do not over fit it.
Let’s get to it:
The code above will generate 2 files:
- training.csv holds the records we are going to use to train the model
- testing.csv holds the records for testing
4. Train the model
After all of this work, we are finally ready to train the actual model. There are several packages that we could use to do this. We are going to use github.com/sajari/regression, which implements all we need.
Of course, we could write everything from scratch, but I will keep things simple for now and will cover how we can do this ourselves in future posts.
First, we are going to load records from the
training.csv file, iterate over them and put the
Grade columns data into the regression. Then, we will train the regression to get our line function.
Here is the snippet of code that does all of this for us:
After executing this, we obtain the formula:
Predicted = -1065201.67 + Grade*209786.29
5. Test the model
The formula we generated above supposedly should be able to predict sale prices based on the Grade.
To check how good this formula is, we need to test it. To do this, we will use
testing.csv, the file we created earlier.
Even though we have our testing data, we still need some output that tells us if the formula is good or not. We are going to use the R-squared value for that.
The R-squared value tells us what proportion of the dependent variables we will be able to predict with the independent variables. In our case, it means how many House prices we can predict using data in Grade column.
The R-squared value generated will be between 0 and 1 (higher is better).
Here is the code that will generate the R-squared value for us:
Notice that the
predict function in the above code snippet is simply the line equation we were able to generate using the training dataset.
Here is the R-squared value generated by executing this code.
R-squared = 0.46
As you can see, it is not ideal.
Let’s try to visualize what the regression looks like and then try to improve it.
6. Visualising the model
To visualise the regression model we need to write a bit more code:
At last, we have obtained our first regression:
If we observe the plot, we can see that blue dots would fit better with a curve line rather than the straight one.
To try to get a better fit, we can update our current linear formula to use a parabola one instead. A parabola formula looks like
y = ax + bx^2 + c.
If we still use Grade column to represent x variable, the new formula will look something like this:
price = a * grade + b * grade^2 + c
Let’s update the code for our training function to look like this:
This gives us a new formula:
Predicted = 1639674.31 + Grade*-473161.41 + Grade2*42070.46
We can now re-calculate the R-squared value after we have updated our
predict function. And, this is the new value we will get:
R-squared = 0.52
As we see it improved a bit! Remember we want to get the value as close to 1 as possible.
Now we have to update our code for plotting the regression again:
Here is our updated regression graph:
As you can see, this is now a bit better.
We can make a number of further improvements to get R-squared value even closer to 1, but I will let you try implementing them in your own time. You can try to improve the model by changing or combining different variables, for example.
Assuming we are happy enough with our prediction formula, we can now start using it. Imagine we want to know for how much money we could sell our house with Grade 3 in King County. When we plug in our Grade value into the prediction formula, we get this price:
I hope this was a useful introduction to Linear regression model and Go packages you could use when starting out with ML.
You can find the code related to this post in this repository
If you would like to share some feedback or suggestions please leave a comment below. I would also love to see how you are using ML in your projects :)
And, if you have ideas of what you would like to see covered in future posts regarding ML or any other topic, all suggestions are more than welcome.