Machine Learning: Linear Regression
We’ve previously gone over Logistic Regression using Python and Spark. This time we’ll focus on Linear Regression. Linear Regression is another regression model type; it’s widely used in part due to the fact it’s one of the oldest models. It’s also one of the simplest algorithms, which is why it’s available in software like Microsoft Excel, and Google Sheets.
Linear regression is a predictive model used in supervised learning. For example, you could use it to:
- Calculate how many employees to hire to produce a certain amount of output.
- Look at some random data and see if there is some correlation between data points.
In its simplest form, linear regression can be represented by this formula:
Y = B0 + B1 * X
Where “X” is the input variable (independent variable), “Y” is the output variable (dependent variable), and “B” (Beta) are the coefficients that multiply the input variables. In this problem, we only have a single Y and X variable. When we add more input variables to the problem, we’re just expanding the formula
Y = B0 + B1 * X1 + B2 * X2 + … + Bn * Xn
Let’s take a deeper dive and review some basic algebra. Fortunately, linear regression is the simplest predictive model, so the math is quite simple too.
A line is a linear relationship between two points. It is written as y = f(x) = mx + b. m is the slope of the line, meaning how steep it is. That is also called the coefficient. b is where the line crosses the y axis, called the y intercept. In the example below, the intercept (x,y)=(0,b). The other data points we have are o1, o2, o3, and o4.
The goal of linear regression is to find a function that best fits this set of data points o1, o2, o3, o4. Technically speaking, that is the line that minimizes the distance from the line to each of these points. That line becomes our best estimate for predicting the value of y given some value of x. In such a model, x is called the independent variable and y is the dependent variable.
Like any other statistical model, we have to give some thought to calculating the error of the model to see whether we should use it or throw it away. That would mean that our assumption of there being a relation between two variables is false.
If we add another independent variable to the model, then we have multiple regression. We can visualize this in three dimensions, x, y, z. If we add a third input variable, we could not graph that as we cannot visually see four dimensions. But we can show three by plotting x, y, and z as shown below.
To illustrate, suppose we are trying to figure out how many people should work in our crafts shop making woven baskets. We assume that more basket weavers will produce more baskets. But how many more? How are the two numbers related? Is there some positive correlation?
We have this sample data (shown below) that we can put into Google Sheets. When we have only one input variable, Sheets and Excel can do our linear regression. If we wanted to get fancier, we could use a machine language framework, like Spark ML or scikit-learn.
So the goal is to find m and b in y=f(x) = mx + b where x is the number of hours worked, b is the coefficient, and y is the number of baskets made.
In Google Sheets we can do that using the SLOPE(x-range, y-range) and INTERCEPT(x-range, y-range) functions. We have slope = 0.5 and intercept = -0.2. So our model is:
y = 0.5x + (-0.2) = 0.5x — 0.2
We then calculate the predicted value and subtract the absolute value (so that it is always positive) of that to give the error for each data point. We could then calculate the error of our model by summing all of those errors and dividing by the sample size. Six is the sample size in this example.
If you are following along, you might wonder why the coefficient is not zero? After all if you work zero hours you would think you would make zero baskets. But remember that mx + b is a line. Only if the x and y variable were perfectly correlated would it cross the point (0,0). For example, the two variables would be perfectly correlated when, y = 2x + 0.
Training Set and Testing Set
In actual machine learning we would feed these values into an LR algorithm to get our training set. That would then give us an array of coefficients (It’s an array because we would most likely be dealing with many input variables, which is why we would use machine learning and not Google Sheets). Then we feed in actual data and the model makes its prediction. The actual data is called the testing set.
Kirill Fuchs is a passionate developer at Fuzz Productions in Brooklyn, NY. He builds APIs and data-driven applications for clients such as CBS and Anheuser-Busch. Fuzz is a New York based mobile app development company that specializes in designing and developing IOS, Android, and Data Driven applications.
PS: Fuzz is hiring :)