Machine Learning Isn’t Data Science
Too often, Machine Learning is used synonymously with Data Science. Before I knew what both of these terms were, I simply thought that Data Science was just some new faddish word for Machine Learning. Over time though, I’ve come to appreciate the real differences in these terms. I’ve always wondered how misconceptions like these endure for so long — my current working hypothesis: people are deathly afraid of looking stupid. Too afraid of asking someone “what is machine learning? What is data science? What is the difference?” So, for those too afraid of asking, I’m going to pretend that you asked. Now, what follows is my hypothetical answers to your hypothetical questions :-). Enjoy.
Machine Learning is the set of techniques concerned with getting a program to perform a task better with respect to some metric as the program gains more experience. Amazon’s recommendation engine is an example of a machine learning system. The program is the recommendation engine. The task is to provide you with recommendations of things you’re likely to buy. Let’s say that the metric is the number of recommended purchases you’ve made over the number of recommendations the system sent you. The recommendation engine gets experience from monitoring what you view, what you buy. Machine Learning has three distinct areas that fully describe it: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning is the process of trying to approximate a function. Predicting next year’s home prices in San Francisco based on the previous ten years of housing prices. The function you’re attempting to approximate is the price of a San Francisco home next year. This function is probably impossible to compute exactly. We are beholden the data we can obtain and that data is rarely perfect. For instance, the ten years of historical prices may not track all the information we’d need to make perfect predictions. A historical house pricing data set that only has pricing information is very different from a data set that has pricing, geographic, number of bedrooms, last kitchen update, etc. The price of a home next year can be affected by all kinds of things outside of any individual’s control (e.g., natural disasters, economic boom/bust). It would difficult to construct a model that could perfectly predict the future in this way. Thankfully, for most use cases we are satisfied approximation predictions of the future and more generally approximations of the function we wish to find.
Unsupervised learning is the process of exploiting the structure of data to derive “interesting” summaries. Let’s assume that we have all statistics associated with each NFL team. Furthermore, let’s say we want to know how similar teams are because we think once we find these similarities, we might find certain attributes correlate with (un)successful franchises. Before we could embark on this path, we’d have to define what we meant by similar by defining what statistics we wanted to measure distances between (e.g., years of experience of offense, years of experience of head coach). We’d also have to make sure that euclidean distance was the type of distance we were interested in too. We’d apply some algorithm that cause the teams to form clusters of 1 or more teams based on their distance to each other. Teams that are closer to each other will tend to end up in up the same cluster, teams the further from each other will tend to not be in the same cluster. These clusters constitute summaries of the original NFL data. Now, here’s the important part: it will now take human judgement to determine if the obtained clusters are in fact “interesting”.
Reinforcement learning is the process of learning from delayed reward. There’s a notion an agent (or program), and it is taking action in the world toward some objective. However, the agent doesn’t get immediate feedback for the action it takes in its world. It doesn’t find out until many steps in the future whether or not the 1st, 2nd, or 3rd action it took was a fatal one or a glorious one. Think of the game checkers. The reward there is winning the game. After playing many games with a formidable opponent, the agent may realize that certain moves lead to certain failure and will tend to avoid those moves. The good agent will eventually learn to make better moves that will increase its odds of winning against a formidable opponent.
Now, for data science. Data science is the newer term and thus more ill-defined. My definition of data science is derived from Johns Hopkins Data Science Specialization. Data science is the process of obtaining, transforming, analyzing, and communicating data to answer a question. If you’re the type of person that craves linear processes, one follows:
However, as you might guess, this linear picture doesn’t quite capture reality. That said this depiction isn’t completely useless. These are in fact the steps you’re moving through when doing data science. Now that you’ve been prepped with the fake, let’s take a look at the real:
This bus architecture captures the messiness of the process more accurately. Any future step can influence some previous step. Any previous step can influence some future step. For ease of discussion, we’ll use the linear process depiction. Let’s walk through each step.
The data question is the question that can be answered with data. It’s essential that the question asked can, in fact, be answered by the data you have or the data you can obtain in a reasonable timeframe. The question may be given to you, or it may be a question you develop.
The raw data is exactly what it sounds like. This is data required to answer your question, but in a “raw” state. In order for you to engage in the data analysis you want, you need to convert the raw data into tidy data. The process of turning raw data into tidy data is called cleaning the data. Suppose you downloaded the graduation rates for the past five years for males and females from universities around the country as a CSV file. This CSV file is the raw data. Beyond downloading raw data from a server with the click of a button, web scraping or programmatically pulling data from a distributed file system or database are also common. People rarely mention Sneakernet, but it’s also a thing.
The tidy data is the data after you’ve cleaned it for subsequent analyses. Continuing with the previously mentioned CSV file, on graduation rates what is likely is that the file wasn’t created specifically to support your analysis. Therefore, it’s likely to have other bits of information that are unlikely to be of interest to you like the ID of the person that entered in the data, or a last accessed timestamp. Moreover, it’s possible the file will have missing or invalid values in some entries (e.g., the value 432 as a graduation rate). For these reasons, you’ll need to rectify these issues as part of your custom script to get the tidy data. I’ll note that people have taken time to define what tidy data is and it’s worth checking out.
The data analysis is the result of the analysis performed. And this is the part that everyone tends to think about when they think of Data Science. It’s where things start to get sexy. Broadly speaking, there are a finite number of analyses one might engage in at this stage. So, let’s walk through them.
In this phase, you’re trying to understand the shape of your data. You’re principally interested in being able to summarize the properties of your data. Think min, max, mode, average, range, etc.
In this phase, you’re trying to find what relationships exist in the data. You’re usually constructing a lot of quick and dirty plots to determine what type of analyses you might like to try next on the data. Think histograms, box plots, and good ol’ x-y plots.
If you’re interested in making a claim about a population based on a sample of that data, then this is the type of analysis you’ll want. Inferential analysis is often desirable because it communicates the estimated uncertainty associated with a claim. Think statistical hypothesis testing and confidence intervals.
If your question has to do with predicting phenomena, then you’ll eventually find yourself in this phase. Here, you’re trying to identify the best set of features that will allow you to make predictions about something else. Think supervised learning.
If you wish to make claims such as “X causes Y” you’ll really need to be able to perform randomized controlled experiments. If this is unavailable to you and all you have is observational data (the common case), you may consider leveraging a Quasi-Experimental Design (but its validity is questionable). Things like moderation analysis tend to come up when people are thinking causal analysis as well. Fundamentally though, think randomized controlled experiments.
This analysis requires you have a mathematical model (equation) to represent some phenomenon. This model isn’t chosen for statistical convenience (e.g., Gaussian Model) but for scientific reasons. With this model chosen for scientific reasons, you subsequently aim to determine exactly how a variable influences another variable with the data that you have. Think doing statistical analysis with scientifically chosen models.
Data product is how you communicate the answer to your question. This can take the form of a presentation, a literate program, a blog post, a scholarly article, an interactive visualization, or a web/mobile/desktop/backend application. Who you are trying to communicate results to will influence what type of data product you end up creating.
The Difference Between Machine Learning and Data Science
If you read everything above, you definitely know the answer to this question now. Machine Learning is a type of analysis you *might* perform as part of Data Science. Stated another way, Machine Learning isn’t a necessary condition of Data Science (Statistics is though!). If you happen to be doing a predictive task, you’re reaching for supervised learning. If you happen to be doing descriptive/exploratory analysis, you *might* reach for unsupervised learning. As for reinforcement learning, it’s not as popular as supervised learning or unsupervised learning, and even less popular in Data Science.