Machine Learning without Tears, Part 2

In the first post of our non-technical ML intro series we discussed some general characteristics of ML tasks. In this post we take a first baby step towards understanding how learning algorithms work. We’ll continue the dialog between an ML expert and an ML-curious person.

Ok I see that an ML program can improve its performance at some task after being trained on a sufficiently large amount of data, without explicit instructions given by a human. This sounds like magic! How does it work?

Let’s start with an extremely simple example. Say you’re running an ad campaign for a certain type of running shoe on the NYTimes web-site. Every time a user visits the web-site, an ad-serving opportunity arises, and given the features of the ad-opportunity (such as time, user-demographics, location, browser-type, etc) you want to be able to predict the chance of the user clicking on the ad. You have access to training examples: the last 3 weeks of historical logs of features of ads served, and whether or not there was a click. Can you think of a way to write code to predict the click-rate using this training data?

Let me see, I would write a program that looks at the trailing 3 weeks of historical logs, and if N is the total of ad exposures, and k is the number of those that resulted in clicks, then for any ad-opportunity it would predict a click probability of k/N.

Great, and this would be an ML program! The program ingests historical data, and given any ad-serving opportunity, it outputs a click probability. If the historical (over the trailing 3-weeks) fraction of clicked ads changes over time, your program would change its prediction as well, so it’s adapting to changes in the data.

Wow, that’s it? What’s all the fuss about Machine Learning then?

Well this would be a very rudimentary learning algorithm at best: it would be accurate in aggregate over the whole population of ad-exposures. What if you want to improve the accuracy of your predictions for individual ad-opportunities?

Why would I want to do that?

Well if your goal is to show ads that are likely to elicit clicks, and you want to figure out how much you want to pay for showing an ad, the most important thing to predict is the click probability (or CTR, the click-through-rate) for each specific ad opportunity: you’ll want to pay more for higher CTR opportunities, and less for lower CTR opps.

Say you’re running your ad campaign in two cities: San Francisco and Minneapolis, with an equal number of exposures in each city. Suppose you found that overall, 3% of your ads result in clicks, and this is what you predict as the click-probability for any ad opportunity. However when you look more closely at the historical data, you realize that all ad-opportunities are not the same: You notice an interesting pattern, i.e. 5% of the ads shown to users in San Francisco are clicked, compared to only 1% of ads shown to users logging in from Minneapolis. Since there are an equal number of ads shown in the two cities, you’re observing an average click-rate of 3% overall, and …

Oh ok, I know how to fix my program! I will put in a simple rule: if the ad opportunity is from San Francisco, predict 5%, and if it’s from Minneapolis, predict 1%. Sorry to interrupt you, I got excited…

That’s ok… in fact you walked right into a trap I set up for you: you gave a perfect example of an ad-hoc static rule: You’re hard-coding an instruction in your program that leverages a specific pattern you found by manually slicing your data, so this would not be an ML program at all!

So… what’s so bad about such a program?

Several things: (a) this is just one pattern among many possible patterns that could exist in the data, and you just happened to find this one; (b) you discovered this pattern by manually slicing the data, which requires a lot of time, effort and cost; (c) the patterns can change over time, so a hard-coded rule may cease to be accurate at some point. On the other hand, a learning algorithm can find many relevant patterns, automatically, and can adapt over time.

I thought I understood how a learning algorithm works, now I’m back to square one!

You’re pretty close though. Instead of hard-coding a rule based on a specific pattern that you find manually, you write code to slice historical data by all features. Suppose there were just 2 features:city (the name of the city) and IsWeekend (1 if the opportunity is on a weekend, 0 otherwise). Do you see a way to improve your program so that it’s more general and avoids hard-coding a specific rule?

Yes! I can write code to go through all combinations of values of these features in the historical data, and build a lookup table showing for each (city, IsWeekend) pair, what the historical click-through-rate was. Then when the program encounters a new ad-opportunity, it will know which city it’s from, and whether or not it’s a weekend, and so it can lookup the corresponding historical rate in the table, and output that as its prediction.

Great, yes you could do that, but there are a few problems with this solution. What if there were 30 different features? Even if each feature has only 2 possible values, that is already 2³⁰ possible combinations of values, or more than a billion (and of course, the number of possible values of many of the features, such as cities, web-sites, etc could be a lot more than just two). It would be very time-consuming to group the historical data by these billions of combinations, our look-up table would be huge, and so it would be very slow to even make a prediction. The other problem is this: what happens when an ad opportunity arises from a new city that the campaign had no prior data for? Even if we set aside these two issues, your algorithm’s click-rate predictions would in fact most likely not be very accurate at all.

Why would it not work well?

Your algorithm has essentially memorized the click-rates for all possible feature-combinations in the training data, so it would perform excellently if its performance is evaluated on the training data: the predicted click-rates would exactly match the historical rates. But predicting on new ad opportunities is a different matter; since there are 30 features, each with a multitude of possible values, it is highly likely that these new opportunities will have feature-combinations that were never seen before.

A more subtle point is that even if a feature-combination has occurred before, simply predicting the historical click-rate for that combination might be completely wrong: for example suppose there were just 3 ad-opportunities in the training data which had this feature-combination: (Browser = “safari”, IsWeekend = 1, Gender = “Male”, Age = 32, City = “San Francisco”, ISP = “Verizon”), and the ad was not clicked in all 3 cases. Now if your algorithm encounters a new opportunity with this exact feature-combination, it would predict a 0% click-rate. This would be accurate with respect to the historical data your algorithm was trained on, but if we were to test it on a realistic distribution of ad opportunities, the prediction would almost certainly not be accurate.

What went wrong here? Suppose the true click-rate for ads with the above feature-combination is 1%, then in a historical sample where just 3 such ad-opportunities are seen, it’s statistically very likely that we would see no clicks.

But what could the learning algorithm do to avoid this problem? Surely it cannot do any better given the data it has seen?

Actually it can. By examining the training data, it should be able to realize, for example, that the ISP and Browser features are not relevant to predicting clicks (for this specific campaign), and perhaps it finds that there are a 1000 training examples (i.e. ad-opportunity feature-combinations) that match the above example when ISP and Browser are ignored, and 12 of them had clicks, so it would predict a 1.2% click-rate.

So your algorithm, by memorizing the click-rates from the training data at a very low level of granularity, was “getting lost in the weeds” and was failing to generalize to new data. The ability to generalize is crucial to any useful ML algorithm, and indeed is a hallmark of intelligence, human or otherwise. For example think about how you learn to recognize cats: you don’t memorize how each cat looks and try to determine whether a new animal you encounter is a cat or not by matching it with your memory of a previously-seen cat. Instead, you learn the concept of a “cat”, and are able to generalize your ability to recognize cats beyond those that exactly match the ones you’ve seen.

In the next post we will delve into some ways to design true learning algorithms that generalize well.

Ok, looking forward to that. Today I learned that generalization is fundamental to machine-learning. And I will memorize that!

This post originally appeared on the MediaMath blog.