# What’s in the Box?! — Chipy Mentorship, pt. 3

## Understanding Predictive Modeling and the Meaning of “Black Box” Models

Since embarking on my study of data science, the most frequent question I’ve gotten has been: “What is Data Science?” A reasonable question, to be sure, and one I am more than happy to answer concerning a field about which I am so passionate. The cleanest distillation of that answer goes something like this:

Data Science is the process of using machine learning to build predictive models.

As a natural follow-up, our astute listener might ask for definition of terms:

- What is machine learning?
- What do you mean by model?

This post focuses on the answer to the latter. And the answer in this context — again in its simplest form — is that a model is a function; it takes an input and generates an output. To begin, we’ll look at one of the most transparent predictive models, linear regression.

#### Linear Regression

Let’s say we are trying to predict what a student, Taylor, will score on a test. We know what Taylor scored on past tests, *and *we know how many hours she studied. When she doesn’t study at all, she gets a 79/100 and for every hour she studies, she does three points better. Thus, our equation for this relationship is:

Taylor's_Test_Score = 3 * Hours_Studied + 79

This example is about as transparent a model as possible. Why? Because we not only know *that* hours studied influences Taylor’s scores, we know *how* it influences Taylor’s scores. Namely, being able to see the coefficient, 3, and the constant, 79, tells us the exact relationship between studying and scores. It’s a simple function: Put in two hours, get a score of 85. Put in 6 hours, get a score of 97.

The real world is, of course, more complicated, so we might begin to add more inputs to our model. Maybe we also use hours slept, hours spent watching tv, and absences from school. If we update our equation, each of these new features will have its own coefficient that describes the relationship of that input to the output. All of these inputs are known in data science parlance as the *features *of our model, the data we use to make our predictions. This linear regression is now *multi-dimensional* meaning there are multiple inputs, and we can longer easily visualize it in a 2d plane. The coefficients are telling, though. If our desired outcome is the highest possible test score, then we want to maximize the inputs for our large positive coefficients (likely time spent sleeping and studying) and minimize the inputs on our negative coefficients (tv and absences).

We can even use binary and categorical data in a linear regression. In my dataset, a beer either is or isn’t made by a macro brewery (e.g. Anheuser Busch, Coors, etc). If it is, the value for the feature is_macrobrew is set to one and in a linear regression, our predicted rating for the beer goes down.

If linear regression is so transparent can can handle different data types, isn’t it better than a black box model? The answer, unfortunately, is no. As we introduce more features, those features may covary. Thinking back to Taylor, more time studying might mean less time sleeping, and to build a more effective model, we need to account for that by introducing a new feature that accounts for the *interaction of those two features. *We also need to consider that not all relationships are linear. An additional hour of sleep might uniformly improve the score up to 9 or 10 hours but beyond that could have minimal or adverse influence on the score.

#### Tree-Based and Ensemble Modeling

In an attempt to overcome the weaknesses of linear modeling, we might try a tree-based approach. A decision tree can be visualized easily and followed like a flow chart. Take the following example using the iris dataset:

The iris dataset is a classification rather than a regression problem. The model looks at 50 samples of each of three types of iris and attempts to classify which type each sample is. The first line in every box shows which feature the model is splitting on. For the topmost box, if the sepal length is less than ~0.75 cm, the model classifies the sample as class Setosa, and if not, the model passes the sample onto the next split. With any given sample, we can easily follow the logic the model is taking from one split to the next. We’ve also accounted for the covariance and non-linearity issues that plague linear models.

A random forest is an even more powerful version of this type of model. In a random forest, as the name might suggest, we have multiple decision trees. Each tree sees a random subset of the data chosen with replacement (a process known as bootstrapping) as well as seeing a random sample of the features. In our forest, we might have 10, 100, 1000, or more trees. Transparency has started to take a serious dip. I do not know the relationship of a feature to the target nor can I follow the path of an individual sample to its outcome. The best I can do now is crack open the model and see the percentage of splits each feature performs.

#### A Brief Word on Neural Nets

As we continue up the scale of opacity, we finally arrive at neural networks. I’m going to simplify the various types here for the sake of brevity. The basic functioning of a neural network goes like this:

- Every feature is individually fed into a layer of “neurons”
- Each of these neurons is an independent function (like the one we built to predict Taylor’s test) and feeds the output to the next layer
- The outputs of the final layer are synthesized to yield the prediction

The user determines the number of layers and the number of neurons in each layer when building the model. As you can imagine, one feature might undergo tens of thousands of manipulations. We have no way of knowing the relationship between our model’s inputs and its outputs. Which leads to the final question…

**Do we need to see inside the box?**

Anyone engaging in statistics, perhaps especially in the social sciences, has had the distinction between correlation and causation drilled into them. To return to our earliest example, we know that hours studied and test scores are *strongly* correlated, but we can not be certain of causation.

Ultimately for the data scientist, the question we need to ask ourselves is, *do we care about causation? *The answer, more often than not, is probably no. If the model works, if what we put into the machine feeds us out accurate predictions, that might satisfy our success conditions. Sometimes, however, we might need to understand the precise relationships. Building the model might be more about sussing out those relationships than it is about predicting what’s going to happen even if the accuracy of our model suffers.

These are the sorts of questions that data scientists ask themselves when building a model and how they go about determining which method to use. In my own project, I am most concerned with the accuracy of my predictions and less interested in understanding why a beer would have a high or a low rating, particularly since it’s largely a matter of taste and public opinion.

I might not be able to tell you what’s happening in the black box, but I can tell you that it works. And that’s often enough.