Machine Learning Taught Me High School Math—All Over Again
or, Regression for Amateurs (and Humanities Majors)
“It’s just a line.”
There’s a chance I paraphrased that a bit, but it’s something my instructor quipped to my cohort as we started the dreaded “statistical analysis” module of my (ongoing) data science boot camp. Was it a deliberately reductive non-sequitur about the mathematical underpinnings of linear regression, intended as a callback to our first experiences with one-variable equations in middle-school algebra? Sure, but it was also a kind of reassurance: that these concepts, stripped down to their skeletons, are not beyond the scope of my understanding, even as someone whose last formal experience with mathematics was an online stats course I took as a prerequisite for my MSW program.
For some context, I mentioned in a previous blog post that my entry into data science has felt like a hard turn from my background in history and social work, and the shift has rarely felt so stark as when we dove headlong into probabilities. The first (three-week) module of this particular course taught us the tools of the trade: basic Python, data manipulation and analysis with pandas, and plotting data with matplotlib. Some of my classmates had cursory experience with Python or other programming languages, but I felt that we were largely on the same page.
Then “Phase 2” reared its head, the formulae came out to play, and I started panicking. In the academic environs where I’d cut my teeth, “ML” stood for Marxism-Leninism, not Machine Learning. Now, I felt out of my depth, in more ways than one. Our second group project rolled around, and I was considering withdrawing from the course, wholly convinced that I wasn’t going to make it.
Now, before I write anything further, I’d like to disclaim that I am, without question, one of the “amateurs” referenced in the subtitle of this blog post. I do not have a degree in math or statistics, and I have no background in computer science whatsoever. I’ve been studying data science for a meager two months, and machine learning for even less time than that. I have a fraction of the expertise that’s probably necessary to write a post like this with any degree of confidence — possibly less relevant experience than you do, if you’re reading this post — but I’d also like to believe that this lack of technical expertise puts me in a position to articulate, in lay-reader’s terms, one of the most bare-bones examples of machine learning. In turn, I hope this blog post will calm the nerves of readers who might be fascinated with data and interested in a career transition, but worried about the potential barriers to entering a field like data science.
In the process of trying to “dumb down” this material, the probability that I’ll write something that’s flat-out wrong is, to put it charitably, non-negligible. If that happens, feel free to send me an email and call me a hack. That said, let’s turn back the clock.
Feature engineering… deep learning… gradient boosting… forget all the jargon, if only momentarily, and think back to Algebra I, or even Pre-Algebra, if that was how your school did things. It started with linear equations — if you’re anything like me, the simple formula y = mx + b is wired somewhere in your neuromuscular system, unlikely ever to fade. You plug in a number for x and get an answer y in return. The result is dependent, of course, on what you input as your independent variable — that is, x — but it also depends on the values of m and b, which represent the slope and the intercept of the line, respectively. Among the most rudimentary methods of visualizing a linear function is a table of values like this (crude) one:
# Sample of x and y values for
# the function y = 3x - 3| x | y |
| 0 | -3 | = (0, -3)
| 1 | 0 | = (1, 0)
| 2 | 3 | = (2, 3)
| 3 | 6 | = (3, 6)
| 4 | 9 | = (4, 9)
Python can handle simple arithmetic operations on its own, and the addition of libraries like NumPy and matplotlib enable us to write out and plot a simple algebraic function like y = 3x – 3 with relative ease:
Then, to get a visual representation of the relationship between x and y, we can type something like…
import matplotlib.pyplot as plt
%matplotlib inlineplt.plot(x, y)
(note: the above code is truncated for brevity; reproducing the figure on display below will require additional argument inputs, which can be found in the full notebook for this blog post)
…which gives us a figure that looks like this:
A little more glamorous than a TI-84 printout, no?
That’s a sleek line, sure, but therein lies the problem: how many practical situations can you think of that can be modeled with something as unsophisticated as a single-variable algebraic equation? When inspecting data — even cleaned, “toy” data used for instructional purposes — relationships between variables will never be this perfectly linear. In practice, we’ll be trying to determine the line of best fit — that is, how can we best express y as a function of x? But even that calculation, when fully deconstructed, isn’t any math you haven’t seen before; it’s just a little more tedious. That’s where machine learning kicks in.
Columns are just variables.
In its broadest definition, machine learning is the process by which algorithms learn from training data (i.e. existing data) in order to make predictions. The subject of this blog post is simple linear regression, a straightforward and easy-to-understand form of supervised learning (hyperlinks provided here and there so you can further explore subjects I don’t really address in any substantial fashion). That might read like precise, technical language — and it is, to an extent — but it’s also not impossible to wrap your head around if you have a baseline understanding of algebraic relationships. Let’s dive into some sample data to tease out what I mean.
For demonstrative purposes, I’ll be working with the
iris dataset that comes pre-loaded with seaborn. It has a digestible number of rows (150), and four of its five columns — petal length, petal width, sepal length, and sepal width — list continuous, numerical data, which makes it extraordinarily convenient for this kind of example. Even better, the relationships between its numerical features (i.e. columns) are linear in nature, meaning that a change in one feature corresponds with a change in another feature.
I’ll go out on a limb and assume that if you’ve navigated to this webpage, you have some kind of programming environment — I used a Jupyter Notebook — that allows you to write and run Python code, and that you’re somewhat familiar with (and have installed!) popular libraries for data science like pandas and Seaborn.
Loading in the data is fairly straightforward:
# Standard aliases for package imports
import pandas as pd
import seaborn as sns# Load in toy data, assign to variable `df`
df = sns.load_dataset('iris')
iris dataset has five columns — four that describe the width and length of the flower’s petals and sepals, and a fifth that classifies the flower into one of three iris species. We’ll only be using two columns from the dataset for this simple example:
sepal_length. If we imagine that
petal_length is our x (independent variable) and
sepal_length our y (dependent), we can plot the values of those respective columns like so:
Looking at that spread of points, we can conclude pretty safely there does exist some kind of linear relationship between these two variables; as we move along the x-axis and petal length increases, sepal length tends to increase on the y-axis as well. This linear relationship indicates that simple linear regression might be a good fit for a situation like this one. We could calculate this regression by hand, but it’s a little exhausting, especially when scikit-learn contains as many tools as it does for expediting the process.
No crunch necessary.
In some instances, data analysis software does a little too good of a job. seaborn’s
regplot (shorthand for regression plot) method allows us to visualize the line of best fit in a “quick and dirty” fashion — everything happens under the hood — but it doesn’t allow us to ascertain any information about the line itself, or the line’s relationship to the values!
Thankfully, using scikit-learn’s
LinearRegression class is simple:
- Import the relevant class,
sklearn.linear_model(classes are always written in ThisFashion; this is also known as “CamelCase”).
- Instantiate the object using parentheses (i.e.
LinearRegression()) and assign it to a variable. I like to use something short but self-explanatory & easy to understand, like
- We can now access methods associated with the
LinearRegressionclass with dot notation. The
.fitmethod takes in at least two arguments —
y— and trains the model
lron that data. In other words, the model we created in Step 2 learns from the data stored in
The block of code below illustrates how to execute those steps in Python. I’ll go out on a limb and assume that if you’ve navigated to this webpage, you have some way of writing and running Python code; I did this all in a Jupyter Notebook, which you can inspect here.
Now that the model,
lr, has been fitted on the data, we can do loads of other stuff with that object. The
LinearRegression class has three different functions:
- Estimator: it can use a
.fit()method to learn from data (already done above!)
- Predictor: it can use a
.predict()method to make predictions based on what it learned while fitting
- Model: it can use a
.score()method to evaluate its predictions
…and it’s just one of numerous algorithms that scikit-learn offers, each of which has its own properties and hyperparamaters. Calling
.score() on the
lr object and passing in your desired x and y data returns what’s called an r² score, a floating point number that ranges from 0.0 (x does not explain any of the variance in y) to 1.0 (x explains 100% of the variance in y), while
.predict() returns predictions made based on the knowledge the model developed in the training process.
If we reframed our problem and examined the dataset a different way, we could use
sepal_length as the (loosely) independent predictor variables we use to guess the
species of an iris — species, in this case, would be our target. One popular classification algorithm is the
Like I mentioned up top, this material can get awfully complicated, but it helps me ground myself when I recall that these formulas, no matter how verbose, are ultimately built on mathematical operations and concepts that have been swimming in my head for years.