How to Define a Machine Learning Problem Like a Detective

The common refrain among machine learning practitioners is that it’s as much an art as a science. True enough, but in this discipline, you can only appreciate the former if you understand the latter.

Let’s see if we can start down that direction by laying the groundwork for the very fundamentals of our understanding of machine learning — namely, what actually constitutes a machine learning problem? It’s kind of strange question that you might think you know the answer to, but it actually has a very formal definition that we’ll outline here.

Defining the machine learning problem

The most important step you can take is to start by asking yourself: do I think there’s a pattern?

The fundamental assumption that underlies all machine learning problems is that there is a pattern. You can’t make omelets without eggs; if there isn’t any pattern, then we’re already done. Ask yourself this question before you decide to embark on a project -; it might save you some heartbreak down the line.

If you still think there’s a pattern, then we can proceed. That pattern we’re looking for is some function f, which maps some input X onto an output Y. In the notation, this is f:XY.

Of course, you don’t know f — you can’t know f, or you wouldn’t be reading this. This is the entire crux of machine learning: if I have some pattern that I can’t directly observe, how can I at least approximate it?

Define the problem the way a detective would

Do it like a detective. If f has left behind evidence, or observations, we can begin to reconstruct what f looks like. Each observation is an input xi =[x1,x2,…xd] Ɛ Rd(that is, each xi is an array of real numbers of length d) and an observed output, yi. Collectively, these observations {(x1,y1), (x2,y2)… (xn,yn)}are our data set, D.

The problem is that even with those observations, an infinite number of possibilities could explain them. For example, take a D with only the following two observations.

We have some good guesses about what kind of function f would explain these observations. In all likelihood, it’s a straight line — right?

That would make the most sense. But why can’t it be a polynomial function as well?

Why can’t it be something even more complicated?

The fact of the matter is that there is no reason that any one of these explanations is completely implausible. They are all equally valid hypotheses.

So why even bother? If there’s no way of nailing down the truth, then aren’t we shot from the get-go?

If that were true, then you probably wouldn’t be looking at this page, and most crimes probably wouldn’t be solved. We’ll formalize this concept later on, but for now, take it at face value: we’re trying to formulate our hypothesis based on what is probably the case. We’re trying to pluck some hypothesis that we think works best, g, from an infinite space of different hypotheses, H.

How do we go about that? With our detective, of course — the algorithm, A. The algorithm will be responsible for picking through our infinitely large hypothesis space H and figuring out which one makes the most sense. The one we land on is g, and we say it approximates our real function f well enough that we can use it: gf.

This is all there really is to formalizing a machine learning problem. We have some unknown target function f:XY. We don’t know what f is, but we have examples of inputs and outputs that we call D. We also have an infinite number of possible hypotheses H, which we’ll slowly whittle down until we find a hypothesis that looks good enough, g. We’ll hone in on g by feeding D to A, which is able to guess which is the closest to the real thing.

Solving the mystery

In our detective analogy, Agent A has a pile of evidence D to figure out who the killer f is. A will look through a massive pool of suspects, H, until he thinks he has one, g, that fits the profile. There’s no guarantee that g is f — and in machine learning, we almost never have that guarantee — but this is his best guess based on the evidence that is available.

If there was more evidence available, we might arrive at a different conclusion. There’s a chance we’ll pin the wrong guy altogether, and that’s catastrophic. In practice, you’ll need to determine how much room for error you can allow for in your algorithm based on your requirements (something we’ll talk about later).

If it were a different agent looking at the evidence, they might suggest a totally different set of suspects. This is what happens when we switch the algorithm we’re using to evaluate the data. There’s a price to pay for this too, which we’ll discuss. For the meantime, choose your algorithm wisely.

All sorts of subtleties bubble to the surface as you spend more time delving into the theory. It can be a slog sometimes, but it will make you a better machine learning practitioner in the long run and will be vital if you ever want to use it in the real world.

Next time I’ll talk a bit about what it means to explore H, why it’s so hard to pick out the correct answer from this pile and why we can do it at all.

Read more data science articles on OpenDataScience.com