Linear Regression in Ruby
A bit of math and a bit of code, machine learning in ruby
Linear Regression is one of the simplest machine learning algorithms, but can still perform pretty well. More importantly, it is easy to pick up some of the core concepts of machine learning with it, without being distracted by the more complex algorithms out there.
What to expect?
We will go through the basics of linear regression, implementing all the necessary code in ruby and use it to predict some things. There will be just enough math to understand what is going on, but the focus will be on the code.
Why?
Chances are, since you are reading this, you are interested in machine learning. For a developer it feels like a very different kind of beast to what we are usually dealing with. At least that was the case for me. What really really helped me understand some things was writing my own little linear regression algorithm. No, don‘t worry, that is not a crazy feat- in fact, that is exactly what we are gonna do in a second. My hope is, that tackling this topic with code first might help out some of my fellow ruby developers getting into machine learning a bit easier.
Linear Regression, a quick overview
Linear Regression is a supervised machine learning algorithm. The predicted output is a continuous value, in contrast to, say, a distinct value.
Supervised: You are telling the algorithm the expected results when training it, you know the desired results basically.
Continuous value: Usually a value like prices, environmental values, in contrast to distinct values that would enable you to classify something. When your algorithm for example can say: There is a car on the picture, it outputs a distinct value. Right, sometimes it is easier to explain something by contrasting it with something that it is not :)
Ok, so the basic flow (and terminology) is as follows: You have some values x, which you will use to predict your y values. More specifically, x is often called your features, while y are your labels. For example, you are skynet and sending your terminators against those pesky humans. You have following features:
number of terminators: 10, number of humans: 1000, number of dogs: 10
Maybe skynet is interested in the resulting human losses. This would be your label.
[terminators: 10, humans: 1000, dogs: 10] -> human losses
Hypothesis
The part that is doing something with the features so that it can calculate the label is called the hypothesis.
Linear regression is pretty simple, which means that the hypothesis is too. All we do is assign a “weight” to each feature. That weight is being multiplied with the feature and all of those products are summed up.
The result is the value of your label.
Let’s assume we have following weights: terminator/weight: 200, humans/weight: -0.1, dogs/weight: -10
The resulting human losses: 1800
Hmm, that doesn’t sound like our current weights are doing a great job, can’t have more human losses than there are humans after all!(thanks to @mediafinger for pointing that out :) )
Since this is a supervised problem, we will already have a few feature values and known labels in the beginning. In our example skynet has maybe recorded 3 battles against humans and dogs and also the resulting human losses (I hope the example is not too gruesome - I started with this example and now I gotta stick with it. On the bright side, there are no dog losses!):
10 terminators, 1000 humans, 10 dogs: 900 human losses
5 terminators, 800 humans, 2 dogs: 600 human losses
12 terminators, 2500 humans, 3 dogs: 1800 human losses
Each of those rows are called instances.
I know I promised code and we haven’t written a single line yet! Stick with me a bit longer, I would like to introduce matrices to you. Did you know, ruby has matrices!
Anyways, the reason why I am mentioning matrices is, that the calculation to predict the labels for the whole dataset above, can be done with one matrix multiplication. Assuming we use the weights we picked before: 200, -0.1, -10
The result would be [1800, 900, 2120]. We are off quite a bit for all the battles.
Finally, some code!
A machine learning algorithm would use these instances to come up with the best values for the weights. The best values means: The labels predicted with these weights should be as close as possible to the recorded label. “As close as possible” is where the cost function comes in.
After we got the best weights, we can use them to predict unknown labels (skynet might be planning a large scale attack on the humans, sending in 3000 terminators against 19000 humans and 220 dogs. It would like to know its chances beforehand.).
Cost Function
Ok, so now we know how to predict values when we have our weights and features. But how do we come up with these weights in the first place? Well, for that we need to measure how well our weights are predicting the labels. The part that is measuring the performance of our weights is called Cost Function in machine learning.
In linear regression we use the mean squared error as cost function. Scary math incoming!
Alright, let‘s go through this step by step:
theta: this is the vector with all your weights. J(theta) means: The cost when using these weights.
m: number of instances. In our example above that is 3 (rows).
error: predicted label – label
squared: why? Well, for one thing, you get rid of the sign if the error is negative. That simplifies the cost calculation.
Weird mathematical symbol in front of the error: sum. Summing up all the squared errors. In ruby that would be a call to reduce for example.
I hope it will become a bit more understandable once you see the ruby code for the cost function:
If we use this cost function to calculate the cost for our initial weights [200, -0.1, -10] we get the following result: 167066.67(rounded). This is the sum of all our Js (cost function).
Ok… and now what? There is only one thing left: We need to do something to reduce the cost, to get it as close to zero as possible. Which means, we need to find the optimal weights for our data.
Normal Equation
This is going to be last missing part for the linear regression algorithm. The normal equation. This will reduce our cost, coming up with the best weights for our data set. One caveat: The normal equation is going to get slower and slower the larger your dataset gets- at that point, you will switch over to gradient descent. Gradient Descent is an iterative approach to figuring out the best weights.
Anyways, back to the normal equation: I will throw in here the formula and then a ruby implementation but not go into anymore detail, since it will require quite a bit of math to properly explain this. I would rather recommend you dig into gradient descent and try to figure out how to implement it in ruby if you want to go a bit deeper. See, I even got some homework for you ;)
( Gradient descent would use the cost function, in case you are wondering where it is actually used.)
The normal equation will give us the following weights: [5.0, 0.675, 17.5]. Time for a sanity check, go ahead, use these values to predict the labels. I think you will be at least slightly impressed! Now, don’t expect to always find such perfectly matching weights (btw I did just pick some random values when I was selecting the features and labels. Yes, imagine this, I did not use data from real, apocalyptic battles).
To summarize:
You “train your model”, which means, you minimize the cost function by finding the best weights, with gradient descent or the normal equation. You predict values with your hypothesis function. You can measure your predictions with your cost function.
Here is the whole code, together with some tests that reflect the things we were checking before:
How can I use this?
Now comes the fun part! Armed with your linear regression super powers you can go out and hunt for fun datasets. Kaggle has tons of them (and is a great place in general, you can learn so much, jsut by studying the solutions out there), just keep in mind that you want to solve regression problems with this algorithm, not classifications.
Some great websites to find datasets:
https://www.kaggle.com/datasets
A few of the things that I did not mention
I left out quite a few things, to avoid being to overwhelming. Gradient descent, linear algebra, plotting data,cleaning up data, overfitting, underfitting , training and test sets — just to name a few. Machine learning is a large and complex field, but one that is very very much enjoyable!
What‘s next?
Compared to say, python or R, ruby might not be the best language to pursue machine learning. That is not due to the language itself of course, but rather the libraries available. I would personally recommend playing around with jupyter notebooks, the ability to plot graphs among a lot of other cool things you can do, makes the whole thing a lot more fun.
Now, that said, there are plenty of machine learning resources even for ruby:
The one online resource I would like to recommend the strongest tho, is the machine learning course on coursera. It will give you a great foundation for all things machine learning. If you want to get into machine learning, or even just want a good refresher on some concepts, that is the go to resource:
And there are some cool data science/ml/ai podcasts:
shout out to @mediafinger for finding an error in the text!