Machine Learning 1: Lesson 6

My personal notes from machine learning class. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 123456789101112


Video / Powerpoint

We’ve looked at a lot of different random forest interpretation techniques and a question that has come up a little bit on the forum is what are these for really? How do these help me get a better score on Kaggle, and my answer has been “they don’t necessarily”. So I wanted to talk more about why we do machine learning. What’s the point? To answer this question, I want to show you something really important which is examples of how people have used machine learning mainly in business because that’s where most of you are probably going to end up after this is working for some company. I’m going to show you applications of machine learning which are either based on things that I’ve been personally involved in myself or know of people who are doing them directly so none of these are going to be hypotheticals — these are all actual things that people are doing and I’ve got direct or secondhand knowledge of.

Two Groups of Applications [1:26]

  • Horizontal: In business, horizontal means something that you do across different kinds of business. i.e. everything involving marketing.
  • Vertical: Something you do within a business or within a supply chain or a process.

Horizontal Applications

Pretty much every company has to try to sell more products to its customers so therefore does marketing. So each of these boxes are examples of some of the things that people are using machine learning for in marketing:

Let’s take an example — Churn. Churn refers to a model which attempts to predict who’s going to leave. I’ve done some churn modeling fairly recently in telecommunications. We were trying to figure out for this big cellphone company which customers are going to leave. That is not of itself that interesting. Building a highly predictive model that says Jeremy Howard is almost certainly going to leave next month is probably not that helpful because if I’m almost certainly going to leave net month, there’s probably nothing you can do about it — it’s too late and it’s going to cost you too much to keep me.

So in order to understand why we would do churn modeling, I’ve got a little framework that you might find helpful: Designing great data products. I wrote it with a couple of colleagues a few years ago and in it, I describe my experience of actually turning machine learning models into stuff that makes money. The basic trick is what I call the Drivetrain Approach which is these four steps:

Defined Objective [3:48]

The starting point to actually turn a machine learning project into something that’s actually useful is to know what I am trying to achieve and that does mean I’m trying to achieve a high area under the ROC curve or trying to achieve a large difference between classes. It would be I’m trying to sell more books or I’m trying to reduce the number of customers that leave next month or I’m trying to detect lung cancer earlier. These are objectives. So the objective is something that absolutely directly is the thing that the company or the organization actually wants. No company or organization lives in order to create a more accurate predictive model. There are some reason. So that’s your objective. That’s obviously the most important thing. If you don’t know the purpose of what you are modeling for then you can’t possibly do a good job of it. And hopefully people are starting to pick that up out there in the world of data science, but interestingly what very few people are talking about but it’s just as important is the next thing which is levers.

Levers [5:04]

A lever is a thing that the organization can do to actually drive the objective. So let’s take the example of churn modeling. What is a lever that an organization could use to reduce the number of customers that are leaving? They could call someone and say “Are you happy? Anything we could do?” They could give them a free pen or something if they buy $20 worth of product next month. You could give them specials. So these are levers. Whenever you are working as a data scientists, keep coming back and thinking what are we trying to achieve (we being the organization) and how we are trying to achieve it being what are the actual things we can do to make that objective happen. So building a model is never ever a lever, but it could help you with the lever.

Data [7:01]

So then the next step is what data does the organization have that could possibly help them to set that lever to achieve that objective. So this is not what data did they give you when you started the project. But think about it from a first principle’s point of view — okay, I’m working for a telecommunications company, they gave me some certain set of data, but I’m sure they must know where their customers live, how many phone calls they made last month, how many times they called customer service, etc. So have a think about okay if we are trying to decide who should we give a special offer to proactively, then we want to figure out what information do we have that might help us to identify who’s going to react well or badly to that. Perhaps more interestingly would be what if we were doing a fraud algorithm. So we are trying to figure out who’s going to not pay for the phone that they take out of the store, they are on some 12-month payment plan, and we never see them again. Now in that case, the data we have available , it doesn’t matter what’s in the database, what matters is what’s the data that we can get when the customer is in the shop. So there’s often constraints around the data that we can actually use. So we need to know what am I trying to achieve, what can this organization actually do specifically to change the outcome, and at the point that the decision is being made, what data do they have or could they collect.

Models [8:45]

So then the way I put that all together is with a model. This is not a model in the sense of a predictive model but it’s a model in the sense of a simulation model. So one of the main example I gave in this paper is when I spent many years building which is if an insurance company changes their prices, how does that impact their profitability. So generally your simulation model contains a number of predictive models. So I had, for example, a predictive model called an elasticity model that said for a specific customer, if we charge them a specific price for a specific product, what’s the probability that they would say yes both when it’s new business and then a year later what’s the probability that they’ll renew. Then there’s another predictive model which is what’s the probability that they are going to make a claim and how much is that claim going to be. You can then combine these models together then to say all right, if we changed our pricing by reducing it by 10% for everybody between 18 and 25 and we can run it through these models that combined together into a simulation then the overall impact on our market share in 10 years time is X and our cost is Y and our profit is Z and so forth.

In practice, most of the time, you really are going to care more about the results of that simulation than you do about the predictive model directly. But most people are not doing this effectively at the moment. For example, when I go to Amazon, I read all of Douglas Adams’ books, and so having read all Douglas Adams’ books, the next time I went to Amazon they said would you like to buy the collected works of Douglas Adams. This is after I had bought every one of his books. So from a machine learning point of view, some data scientist had said oh people that buy one of Douglas Adams’ books often go on to buy the collected works. But recommending to me that I buy the collected works of Douglas Adams isn’t smart. It’s actually not smart at a number of levels. Not only is unlikely to buy a box set of something of which I have every one individually but furthermore it’s not going to change my buying behavior. I already know about Douglas Adams. I already know I like him, so taking up your valuable web space to tell me hey maybe you should buy more of the author who you’re already familiar with and bought lots of times isn’t actually going to change my behavior. So what if instead of creating a predictive model, Amazon had built an optimization model that could simulate and said if we show Jeremy this ad, how likely is he then to go on to buy this book and if I don’t show him this ad, how likely is he to go on to buy this book. So that’s the counterfactual. The counter factual is what would have happened otherwise, and then you can take the difference and say what should we recommend him that is going to maximally change his behavior. So maximally result in more books and so you’d probably say oh he’s never bought any Terry Pratchett book, he probably doesn’t know about Terry Pratchett but lots of people that liked Douglas Adams did turn out to like Terry Pratchett so let’s introduce him to a new author.

So it’s the difference between a predictive model on the one hand versus an optimization model on the other hand. So the two tend to go hand in hand. First of all we have a simulation model. The simulation model is saying in the world where we put Terry Pratchett’s book on the front page of Amazon for Jeremy Howard, this is what would have happened. He would have bought it with a 94% probability. That then tells us with this lever of what do I put on my homepage for Jeremy today, we say okay the different settings of that lever that put Terry Pratchett on the homepage has the highest simulated outcome. Then that’s the thing which maximizes our profit from Jeremy’s visit to amazon.com today.

Generally speaking, your predictive models feed into this simulation model but you kind of have to think about how they all work together. For example, let’s go back to churn. So it turned out that Jeremy Howard is very likely to leave his cell phone company next month. What are we going to about it? Let’s call him. And I can tell you if my cell phone company calls me right now and says “just calling to say we love you” I’d be like I’m cancelling right now. That would be a terrible idea. So again, you would want a simulation model that says what’s the probability that Jeremy is going to change his behavior as a result of calling him right now. So one of the levers I have is call him. On the other hand, if I got a piece of mail tomorrow that said for each month you stay with us, we’re going to give you a hundred thousand dollars. Then that’s going to definitely change my behavior, right? But then feeding that into the simulation model, it turns out that overall that would be an unprofitable choice to make. Do you see how this fits in together?

So when we look at something like churn, we want to be thinking what are the levers we can pull [14:33]. What are the kinds of models that we could build with what kinds of data to help us pull those levers better to achieve our objectives. When you think about it that way, you realize that the vast majority of these applications are not largely about a predictive model at all. They are about interpretation. They are about understanding what happens if. So if we take the intersection between on the one hand, here are all the levers that we could pull (here are all the things we can do) and then here are all of the features from our random forest feature importance that turn out to be strong drivers of the outcome. So then the intersection of those is here are the levers we could pull that actually matter. Because if you can’t change the thing, that is not very interesting. And if it’s not actually a significant driver, it’s not very interesting. So we can actually use our random forest feature importance to tell us what can we actually do to make a difference. Then we can use the partial dependence to actually build this kind of simulation model to say okay if we did change that, what would happen.

So there are lots of examples and what I want you to think about as you think about the machine learning problems you are working on is why does somebody care about this [16:02]. What would a good answer to them look like and how could you actually positively impact this business. So if you are creating a Kaggle kernel, try to think about from the point of view of the competition organizer. What would they want to know and how can you give them that information. So something like fraud detection on the other hand, you probably just basically want to know whose fraudulent. So you probably do just care about the predictive model. But then you do have to think carefully about the data availability here. So okay, we need to know who is fraudulent at the point that we are about to deliver them a product. So it’s no point looking at data that’s available a month later, for instance. So you have this key issue of thinking about the actual operational constraints that you are working under.

Human Resources Applications [17:17]

Lots of interesting application in human resources but like employee churn, it’s another kind of churn model where finding out that Jeremy Howard is sick of lecturing, he’s going to leave tomorrow. What are you going to do about it? Well, knowing that wouldn’t actually be helpful. It would be too late. You would actually want a model that said what kinds of people are leaving USF and it turns out that everybody that goes to the downstairs cafe leaves USF. I guess their food is awful or whatever. Or everybody that we are paying less than half a million dollars a year is leaving USF because they can’t afford basic housing in San Francisco. So you could use your employee churn model not so much to say which employees hate us but why do employees leave. Again it’s really the interpretation there that matters.

Question: For churn model, it sounds like there are two predictors that you need to predict for — one being churn and the other you need to optimize your profit. So how does it work [18:30]? Yes, exactly. So this is what the simulation model is all about. You figure out this objective we are trying to maximize which is company profitability. You can create a pretty simple Excel model or something that says here is the revenue and here is the costs and the cost is equal to the number of people we employ multiplied by their salary, etc. Inside that Excel model, there are certain cells/inputs that are kind of stochastic or uncertain. But we could predict it with a model and so that’s what I do then is to say okay we need a predictive model for how likely somebody is to stay if we change their salary, how likely they are to leave with the current salary, how likely they are to leave next year if I increased their salary now, etc. So you a bunch of different models and then you can bind them together with simple business logic and then you can optimize that. You can then say okay if I pay Jeremy Howard half a million dollars, that’s probably a really good idea and if I pay him less then it’s probably not or whatever. You can figure out the overall impact. So it’s really shocking to me how few people do this. But most people in industry measure their models using AUC or RMSE or whatever which is never actually what you really want.

More Horizontal Applications…[22:04]

Lead prioritization is a really interesting one. Every one of these boxes I’m showing, you can generally find a company or many companies whose sole job in life is to build models of that thing. So there are lots of companies that sell lead prioritization systems but again the question is how would we use that information. So if it’s like our best lead is Jeremy, he is a highest probability of buying. Does that mean I should send a salesperson out to Jeremy or I shouldn’t? If he’s highly probable to buy, why I waste my time with him. So again, you really want some kind of simulation that says what’s the likely change in Jeremy’s behavior if I send my best salesperson out to go and encourage him to sign. I think there are many many opportunities for data scientists in the world today to move beyond predictive modeling to actually bringing it all together.

Vertical Applications [23:29]

As well as these horizontal applications that basically apply to every company, there’s a whole bunch of applications that are specific to every part of the world. For those of you that end up in healthcare, some of you will become experts in one or more of these areas. Like readmission risk. So what’s the probability that this patient is going to come back to the hospital. Depending on the details of the jurisdiction, it can be a disaster for hospitals when somebody is readmitted. If you find out that this patient has a high probability of readmission, what do you do about it? Again, the predictive model is helpful of itself. It rather suggests we shouldn’t send them home yet because they are going to come back. But wouldn’t it be nice if we had the tree interpreter and it said to us the reason that they are at high risk is because we don’t have a recent EKG/ECG for them. Without a recent EKG, we can’t have a high confidence about their cardiac health. In which case, it wouldn’t be like let’s keep them in the hospital for two weeks, it’ll be let’s give them an EKG. So this is interaction between interpretation and predictive accuracy.

Question: So what I’m understanding you are saying is that the predictive models are a really great but in order to actually answer these questions, we really need to focus on the interpretability of these models [24:59]? Yeah, I think so. More specifically I’m saying we just learnt a whole raft of random forest interpretation techniques and so I’m trying to justify why. The reason why is because I’d say most of the time the interpretation is the thing we care about. You can create a chart or a table without machine learning and indeed that’s how most of the world works. Most managers build all kinds of tables and charts without any machine learning behind them. But they often make terrible decisions because they don’t know the feature importance of the objective they are interested in and so the table they create is of things that actually are the least important things anyway. Or they just do a univariate chart rather than a partial dependence plot, so they don’t actually realize that the relationship they thought they are looking at is due entirely to something else. So I’m kind of arguing for data scientists getting much more deeply involved in strategy and in trying to use machine learning to really help a business with all of its objectives. There are companies like dunnhumby which is a huge company that does nothing but retail application with machine learning. I believe there’s like a dunnhumby product you can buy which will help you figure out if I put my new store in this location versus that location, how many people are going to shop there. Or if I put my diapers in this part of the shop versus that part of the shop, how is that going to impact purchasing behavior, etc. So it’s also good to realize that the subset of machine learning applications you tend to hear about in the tech press or whatever is this massively biased tiny subset of stuff which Google and Facebook do. Where else the vast majority of stuff that actually makes the world go around is these kinds of applications that actually help people make things, buy things, sell things, build things, so forth.

Question: About tree interpretation, we looked at which feature was more important for a particular observation. For businesses, they have a huge amount of data and they want this interpretation for a lot of observations so how do they automate it? Do they set threshold [27:50]? The vast majority of machine learning models don’t automate anything. They are designed to provide information to humans. So for example, if you are a customer service phone operator for an insurance company and your customer asks you why is my renewal $500 more expensive than last time, then hopefully the insurance company provides in your terminal those little screen that shows the result of the tree interpreter or whatever. So you can jump there and tell the customer that last year you were in this different zip code which has lower amounts of car theft, and this year also you’ve actually changed your vehicle to more expensive one. So it’s not so much about thresholds and automation, but about making these model outputs available to the decision makers in the organization whether they be at the top strategic level of like are we going to shutdown this whole product or not, all the way to the operational level of that individual discussion with a customer.

So another example is aircraft scheduling and gate management. There’s lots of companies that do that and basically what happens is that there are people at an airport whose job it is to basically tell each aircraft what gate to go to, to figure out when to close the doors, stuff like that. So the idea is you’re giving them software which has the information they need to make good decisions. So the machine learning models end up embedded in that software to say okay that plane that’s currently coming in from Miami, there’s a 48% chance that it’s going to be over 5 minutes late and if it does then this is going to be the knock-on impact through the rest of the terminal, for instance. So that’s how these things fit together.

Other applications [31:02]

There are lots of applications, and what I want you to do is to spend some time thinking about them. Sit down with one of your friends and talk about a few examples. For example, how would we go about doing failure analysis in manufacturing, who would be doing that, why would they be doing it, what kind of models might they use, what kind of data might they use. Start to practice and get a sense. Then when you’re at the workplace and talking to managers, you want to be straightaway able to recognize that the person you are talking to — what are they trying to achieve, what are the levers they have to pull, what are the data they have available to pull those levers to achieve that thing, and therefore how could we build models to help them do that and what kind of predictions would they have to be making. So then you can have this really thoughtful empathetic conversation with those people and then saying “in order to reduce the number of customers that are leaving, I guess you are trying to figure out who should you be providing better pricing to” and so forth.

Question: Are explanatory problems people are faced with in social sciences something machine learning can be useful for or is used for or is that nor really the realm that’s in [32:29]? I’ve had a lot of conversations about this with people in social sciences and currently machine learning is not well applied in economics or psychology or whatever on the whole. But I’m convinced it can be for the exact reasons we are talking about. So if you are going to try to do some kind of behavioral economics and you’re trying to understand why some people behave differently to other people, a random forest with a feature importance plot would be a great way to start. More interestingly, if you are trying to do some kind of sociology experiment or analysis based on a large social network dataset where you have an observational study, you really want to try and pull out all of the sources of exogenous variables (i.e. all the stuff that’s going on outside) so if you use a partial dependence plot with a random forest that happens automatically. I actually gave a talk at MIT a couple of years ago for the first conference on digital experimentation which was really talking about how do we experiment in things like social networks in these digital environments and economists all do things with classic statistical tests but in this case, the economists I talked to were absolutely fascinated by this and they actually asked me to give an introduction to machine learning session at MIT to these various faculty and graduate folks in the economics department. And some of those folks have gone on to write some pretty famous books and so hopefully it’s been useful. It’s definitely early days but it’s a big opportunity. But as Yannet says, there’s plenty of skepticism still out there. The skepticism comes from unfamiliarity basically with this totally different approach. So if you spent 20 years studying econometrics and somebody comes along and says here is a totally different approach to all the stuff econometricians do, naturally your first reaction will be “prove it”. So that’s fair enough but I think over time the next generation of people who are growing up with machine learning, some of them will move into the social sciences, they’ll make huge impacts that nobody has ever managed to make before and people will start going wow. Just like happened in computer vision. When computer vision spent a long time of people saying “maybe you should use deep learning for computer vision” and everybody in computer vision said “Prove it. We have decades of work on amazing feature detectors for computer vision.” And then finally in 2012, Hinton and Kryzanski came along and said “our model is twice as good as yours and we’ve only just started on this” and everybody was convinced. Nowadays every computer vision researchers basically uses deep learning. So I think that time will come in this area too.

Different random forest interpretation methods [37:17]

Having talked about why they are important, let’s now remind ourselves what they are.

Confidence based on tree variance

What does it tell us? Why would be interested in that? How is it calculated?

The variance of the predictions of the trees. Normally the prediction is just the average, this is variance of the trees.

Just to fill in a detail here, what we generally do here is we take just one row/observation often and find out how confident we are about that (i.e. how much variance there are in the trees for that) or we can do as we did here for different groups [39:34].

What we’ve done here is to say if there are any groups that we are very unconfident (which could be due to very little observations). Something that I think is even more important would be when you are using this operationally. Let’s say you are doing a credit decisioning algorithm. So we are trying to determine whether Jeremy is a good risk or a bad risk. Should we loan him a million dollars. And the random forest says “I think he’s a good risk but I’m not at all confident.” And in which case, we might say okay maybe I shouldn’t give him a million dollars. Where else, if the random forest said “I think he’s a good risk and I’m very sure of that” then we are much more comfortable giving him a million dollars. And I’m a very good risk. So feel free to give me a million dollars. I checked the random forest before — a different notebook. Not in the repo 😆

It’s quite hard for me to give you folks direct experience with this kind of single observation interpretation because it’s really the kind of stuff that you actually need to be putting out to the front line [41:30]. It’s not something which you can really use so much in a Kaggle context but it’s more like if you are actually putting out some algorithm which is making big decisions that could cost a lot of money, you probably don’t so much care about the average prediction of the random forest but maybe you actually care about the average minus a couple standard deviations (i.e. what’s the worst-case prediction). Maybe there is a whole group that we are unconfident about, so that’s confidence based on tree variance.

Feature importance [42:36]

Student: It’s basically to find out which features are important. You take each feature and shuffle the values in the feature and check how the predictions change. If it’s very different, it means that the feature was actually important; otherwise it is not that important.

Jeremy: That was terrific. That was all exactly right. There were some details that were skimmed over a little bit. Anybody else wants to jump into a more detailed description of how it’s calculated? How exactly do we calculate feature importance for a particular feature?

Student: After you are done building a random forest model, you take each column and randomly shuffle it. And you run a prediction and check the validation score. If it gets bad after shuffling one of the columns, that means that column was important, so it has a higher importance. I’m not exactly sure how we quantify the feature importance.

Jeremy: Ok, great. Do you know how we quantify the feature importance? That was a great description. To quantify, we can take the difference in R² or score of some sort. So let’s say we’ve got our dependent variable which is price, and there’s a bunch of independent variables including year made [44:22]. We use the whole lot to build a random forest and that gives us our predictions. The we can compare that to get R², RMSE, whatever you are interested in from the model.

Now the key thing here is I don’t want to have to retrain my whole random forest. That’s slow and boring, so using the existing random forests. How can I figure out how important year made was? So the suggestion was, let’s randomly shuffle the whole column. Now that column is totally useless. it’s got the same mean, same distribution. Everything about it is the same, but there’s no connection at all between actual year made and what’s now in that column. I’ve randomly shuffled it. So now I put that new version through the same random forest (so there is no retraining done) to get some new ŷ (ym). Then I can compare that to my actuals to get RMSE (ym). So now I can start to create a little table where I got the original RMSE (3, for example), with YearMade scrambled with RMSE of 2. Enclosure scrambled had RMSE of 2.5. Then I just take these differences. For YearMade, the importance is 1, Enclosure is 0.5, and so forth. How much worse did my model get after I shuffled that variable.

Question: Would all importances sum to one [46:52]? Honestly, I’ve never actually looked at what the units are, so I’m not quite sure. We can check it out during the week if somebody’s interested. Have a look at sklearn code and see exactly what those units of measures are because I’ve never bothered to check. Although I don’t check like the units of measure specifically, what I do check is the relative importance. Here is an example.

So . rather than just saying what are the top ten, yesterday one of the practicum students asked me about a feature importance where they said “oh, I think these three are important” and I pointed out that the top one was thousand times more important than the second one. So look at the relative numbers here. So in that case, it’s like “no, don’t look at the top three, look at the one that’s a thousand times more important and ignore all the rest.” Your natural tendency is to want to be precise and careful, but this is where you need to override that and be very practical. This thing is a thousand times more important. Don’t spend any time on anything else. Then you can go and talk to your manager of your project and say this thing is a thousand times more important. And then they might say “oh, that was a mistake. It shouldn’t have been in there. We don’t actually have that information at the decision time or for whatever reason we can’t actually use that variable.” So then you could remove it and have a look. Or they might say “gosh, I had no idea that was by far more important than everything else put together. So let’s forget this random forest thing and just focus on understanding how we can better collect that one variable and better use that one variable.” So that’s something which comes up quite a lot and actually another place that came up just yesterday. Another practicum student asked me “I’m doing this medical diagnostics project and my R² is 0.95 for a disease which I was told is very hard to diagnose. Is this random forest genius or is something going wrong?” And I said remember, the second thing you do after you build a random forest is to do feature importance, so do feature importance and what you’ll probably find is that the top column is something that shouldn’t be there. So that’s what happened. He came back to me half an hour later, he said “yeah, I did the feature importance and you were right. The top column was basically a something that was another encoding of the dependent variable. I’ve removed it and now my R² is -0.1 so that’s an improvement.”

The other thing I like to look at is this chart [50:03]:

Basically it says where things flatten off in terms of which ones I should be really focusing on. So that’s the most important one. When I did credit scoring in telecommunications, I found there were nine variables that basically predicted very accurately who was going to end up paying for their phone and who wasn’t. Apart from ending up with a model that saved them three billion dollars a year in fraud and credit costs, it also let them basically rejig their process so they focused on collecting those nine variables much better.

Partial dependence [50:46]

This is an interesting one. Very important but in some ways kind of tricky to think about.

Let’s come back to how we calculate this in a moment, but the first thing to realize is that the vast majority of the time, when somebody shows you a chart , it will be like a univariate chart that’ll just grab the data from the database and they’ll plot X against Y. Then managers have a tendency to want to make a decision. So it would be “oh, there’s this drop-off here, so we should stop dealing in equipment made between 1990 and 1995. This is a big problem because real world data has lots of these interactions going on. So maybe there was a recession going on around the time that those things are being sold or maybe around that time, people were buying more of a different type of equipment. So generally what we actually want to know is all other things being equal, what’s the relationship between YearMade and SalePrice. Because if you think about the drivetrain approach idea of the levers, you really want a model that says if I change this lever, how will it change my objective. It’s by pulling them apart using partial dependence that you can say actually this is the relationship between YearMade and SalePrice all other things being equal:

So how do we calculate that?

Student: For the variable YearMade, for example, you keep all other variables constant. Then you are going to pass every single value of the YearMade, train the model after that. So for every model you’ll have light blue lines and the median is going to be the yellow line.

Jeremy: So let’s try and draw that. By “leave everything else constant”, what she means is leave them at whatever they are in the dataset. So just like when we did feature importance, we are going to leave the rest of the dataset as it is. And we’re going to do partial dependence plot for YearMade. So we’ve got all of these other rows of data that we will just leave as they are. Instead of randomly shuffling YearMade, what we are going to do is replace every single value with exactly the same thing — 1960. Just like before, we now pass that through our existing random forests which we have not retrained or changed in any way to get back out a set of predictions y1960. Then we can plot that on a chart — YearMade against partial dependence.

Now we can do that for 1961, 1962, 1963, and so forth. We can do that on average for all of them, or we could do it just for one of them. So when we do it for just one of them and we change its YearMade and pass that single thing through our model, that gives us one of these blue lines. So each one of these blue lines is a single row as we change its YearMade from 1960 up to 2008. So then we can just take the median of all of these blue lines to say on average what’s the relationship between YearMade and price all other things being equal. Why is it that it works? Why is it that this process tells us the relationship between YearMade and price all other things being equal? Maybe it’s good to think about a really simplified approach [56:03]. A really simplified would say what’s the average auction? What’s the average sale date, what’s the most common type of machine we well? Which location we mostly sell things? And we could come up with a single row that represents the average auction and then we could say okay, let’s run that row through the random forest but replace its YearMade with 1960 and then do it again with 1961 and we could plot those on our little chart. That would give us a version of the relationship between YearMade and sale price all other things being equal. But what if tractors looked like that and backhoe loaders looked like a flat line:

Then taking the average one would hide the fact that there are these totally different relationships. So instead, we basically say, okay our data tells us what kinds of things we tend to sell, who we tend to sell them, and when we tend to sell them, so let’s use that. Then we actually find out for every blue line, here are actual examples of these relationships. So then what we can do is as well as plotting the median, we can do a cluster analysis to find out a few different shapes.

In this case, they all look pretty much the different versions of the same thing with different slopes, so my main takeaway from this would be that the relationship between sale price and year made is basically a straight line. And remember, this was a log of sale price so this is actually showing us an exponential. So this is where I would then bring in the domain expertise which is like “okay, things depreciate over time by a constant ratio so therefore, I would expect older stuff year made to have this exponential shape.” So this is where, as I mentioned, the very start of of my machine learning project, I generally try to avoid using as much domain expertise as I can and let the data do the talking. So one of the questions I got this morning was “there’s like a sale ID and model ID, I should throw those away, right? Because they are just IDs.” No. Don’t assume anything about your data. Leave them in and if they turn out to be super important predictors, you want to find out why that is. But then, now I’m at the other end of my project. I’ve done my feature importance, I’ve pulled out the stuff which is from that dendrogram (i.e. redundant features), I’m looking at the partial dependence and now I’m thinking okay is this shape what I expected? So even better, before you plot this, first of all think what shape would I expect this to be. Because it’s always easy to justify to yourself after the fact, oh, I knew it would look like this. So what shape you expect and then is it that shape? In this case, I’d say this is what I would expect. Where else the previous plot is not what I’d expect. So the partial dependence plot has really pulled out the underlying truth.

Question: Say you have 20 features that are important, are you going to measure the partial dependence for every single one of them [1:00:05]? If there are twenty features that are important, then I will do the partial dependence for all of them where important means like it’s a lever I can actually pull, the magnitude of its size is not much smaller than the other nineteen, you know, based on all these things it’s a feature I ought to care about then I will want to know how it’s related. It’s pretty unusual to have that many features that are important both operationally and from a modeling point of view in my experience.

Question: How do you define importance [1:00:58]? Important means it’s a lever (i.e. something I can change) and it’s on the spiky end of this tail (left):

Or maybe it’s not a lever directly. Maybe it’s like zip code and I can’t actually tell my customers where to live but I could focus my new marketing attention on a different zip code.

Question: Would it make sense to do pairwise shuffling for every combination of two features and hold everything else constant in feature importance to see interactions and compare scores [1:01:45]? You wouldn’t do that so much for partial dependence. I think your question is really getting to the question of could we do that for feature importance. I think interaction feature importance is a very important and interesting question. But doing it by randomly shuffling every pair of columns, if you’ve got a hundred columns, it sounds computationally intensive, possibly infeasible. So what I’m going to do is after we talk about tree interpreter, I’ll talk about interesting but largely unexplored approach that will probably work.

Tree interpreter [1:02:43]

Prince: I was thinking this to be more like feature importance, but feature importance is for complete random forest model, and this tree interpreter is for feature importance for particular observation. So let’s say it’a about hospital readmission. If a patient A is going to be readmitted to a hospital, which feature for that particular patient is going to impact and how can we change that. It is calculated starting from the prediction of mean then seeing how each feature is changing the behavior of that particular patient.

Jeremy: I’m smiling because that was one of the best examples of technical communication I’ve heard in a long time, so it’s really good to think about why was that effective. So what Prince did there was, he used as specific an example as possible. Humans are much less good at understanding abstractions. So if you say “it takes some kind of feature, and then there’s an observation in that feature” whereas it’s the hospital readmission. So we take a specific example. The other thing he did that was very effective was to take an analogy to something we already understand. So we already understand the idea of feature importance across all of the rows in a dataset. So now we are going to do it for a single row. So one of the things I was really hoping we would learn from this experience is how to become effective technical communicators. So that was a really great role model from Prince of using all the tricks we have at our disposal for effective technical communication. So hopefully you found that useful explanation. I don’t have a lot to add to that other than to show you what that looks like.

With the tree interpreter, we picked out a row [1:04:56]:

Remember when we talked about the confidence intervals at the very start (i.e. the confidence based on tree variance). We said you mainly use that for a row. So this would also be for a row. So it’s like “why is this patient likely to be readmitted?” Here is all the information we have about that patient or in this case this auction. Why is this auction so expensive? So then we call ti.predict and we get back the prediction of the price, the bias (i.e. the root of the tree — so this is just the average price for everybody so this is always going to be the same), and then the contributions which is how important is each of these things:

The way we calculated that was to say at the very start, the average price was 10. Then we split on enclosure. For those with this enclosure, the average was 9.5. Then we split on year made less than 1990 and for those with that year made, the average price was 9.7. Then we split on the number of hours on the meter, and with this branch, we got 9.4.

We then have a particular auction which we pass it through the tree. It just so happens that it takes the top most path. One row can only have one path through the tree. So we ended up at 9.4. Then we can create a little table. As we go through, we start at the top and we start with 10 — that’s our bias. And we said enclosure resulted in a change from 10 to 9.5 (i.e. -0.5). Year made changed it from 9.5 to 9.7 (i.e. +0.2), then meter changed it from 9.7 down to 9.4 (-0.3). Then if we add all that together (10–0.5+0.2–0.3), lo and behold that’s the prediction.

Which takes us to our Excel spreadsheet [1:08:07]:

Last week, we have use Excel for this because there wasn’t a good Python library for doing waterfall charts. So we saw we got our starting point this is the bias, and then we had each of our contributions and we ended up with our total. The world is now a better place because Chris has created a Python waterfall chart module for us and put it on pip. So never again where we have to use Excel for this. I wanted to point out that waterfall charts have been very important in business communications at least as long as I’ve been in business — so that’s about 25 years. Python is maybe a couple of decades old. But despite that, no one in the Python world ever got to the point where they actually thought “you know, I’m gonna make a waterfall chart” so they didn’t exist until two days ago which is to say the world is full of stuff which ought to exist and doesn’t. And doesn’t necessarily take a heck a lot of time to build. It took Chris about 8 hours, so a hefty amount but not unreasonable. And now forevermore people when they want the Python waterfall chart will end up at Chris’ Github repo and hopefully find lots of other USF contributors who have made it even better.

In order for you to help improve Chris’ Python waterfall, you need to know how to do that. So you are going to need to submit a pull request. Life becomes very easy for submitting pull requests if you use something called hub. What they suggest you do is that you alias git to hub because it turns out that hub is actually a strict superset of git. What it lets you do is you can go git fork, git push , and git pull-request and you’ve now sent Chris a pull request. Without hub, this is actually a pain and requires like going to the website and filling in forms and stuff. So this gives you no reason not to do pull request. I mention this because when you are interviewing for a job, I can promise you that the person you are talking to will check your github and if they see you have a history of submitting thoughtful pull requests that are accepted to interesting libraries, that looks great. It looks great because it shows you’re somebody who actually contributes. It also shows that if they are being accepted that you know how to create code that fits with people’s coding standards, has appropriate documentation, passes their tests and coverage, and so forth. So when people look at you and they say oh, here is somebody with a history of successfully contributing, accepted pull requests to open-source libraries, that’s a great part of your portfolio. And you can specifically refer to it. So either I’m the person who build Python waterfall, here is my repo or I’m the person who contributed currency number formatting to Python waterfall, here is my pull request. Anytime you see something that doesn’t work right in any open source software you use, it is not a problem, it’s a great opportunity because you can fix it and send in the pull request. So give it a go. It actually feels great the first time you have a pull request accepted. And of course, one big opportunity is the fastai library. Thanks to one of our students, we now have docstrings for most of the fastai.structured library, again came via a pull request.

Does anybody have any questions about how to calculate any of these random forest interpretation methods or why we might want to use them [1:12:50]? Towards the end of the week, you’re going to need to be able to build all of these yourself from scratch.

Question: Just looking at the tree interpreter, I noticed that some of the values are nan ’s. I get why you keep them in the tree but how can nan have a feature importance [1:13:19]? Let me pass it back to you. Why not? So in other words, how is nan handled in Pandas and therefore in the tree? Does anybody remember, notice these are all in categorical variables, how does Pandas handle nan ’s in categorical variable and how does fastai deal with them? Pandas sets them to -1 category code and fastai adds one to all of the category code so it ends up being zero. In other words, remember by the time it hits the random forest it’s just a number, and it’s just zero. And we map it back to the descriptions back here. So the question really is why shouldn’t the random forest be able to split on zero? It’s just another number. So it could be nan, high, medium, low= 0, 1, 2, 3. So missing values are one of these things that are generally taught really badly. Often people get taught here are some ways to remove columns with missing values or remove rows with missing values or to replace missing values. That’s never what we want because missingness is very very very often interesting. So we actually learnt that from our feature importance that coupler system nan is one of the most important features. For some reason, well, I could guess, right? Coupler system nan presumably means this is a kind of industrial equipment that doesn’t have a coupler system. Now I don’t know what kind that is, but apparently it’s more expensive kind.

I did this competition for university grant research success where by far the most important predictors were whether or not some of the fields were null [1:15:41]. It turned out that this was data leakage that these fields only got filled in most of the time after a research grant was accepted. So it allowed me to win that Kaggle competition but didn’t actually help the university very much.

Extrapolation [1:16:16]

I am going to do something risky and dangerous which is we are going to do some live coding. The reason we are going to do some live coding is I want to explore extrapolation together with you, and I also want to give you a feel of how you might go about writing code quickly in this notebook environment. And this is the kind of stuff that you are going to need to be able to do in the real world and in the exam is quickly create the kind of code that we are going to talk about.

I really like creating synthetic datasets anytime I’m trying to investigate the behavior of something because if I have a synthetic dataset, I know how it should behave.

Which reminds me, before we do this, I promised that we would talk about interaction importance and I just about forgot.

Interaction importance [1:17:24]

Tree interpreter tells us the contributions for a particular row based on the difference in the tree. We could calculate that for every row in our dataset and add them up. That would tell us feature importance. And it would tell us feature importance in a different way. One way of doing feature importance is by shuffling the columns one at a time. Another way is by doing tree interpreter for every row and adding them up. Neither is more right than the others. They are actually both quite widely used so this is kind of type 1 and type 2 feature importance. So we could try to expand this a little bit. To do not just single variable feature importance, but interaction feature importance. Now here is the thing. What I’m going to describe is very easy to describe. It was described by Breiman right back when random forests were first invented, and it is part of the commercial software product from Salford systems who have the trademark on random forests. But it is not part of any open source library I’m aware of, and I’ve never seen an academic paper that actually studies it closely. So what I’m going to describe here is a huge opportunity but it’s also like there’s lots and lots of details that need to be fleshed out. But here is the basic idea.

This particular difference here (in red) is not just because of year made but because of a combination of year made and enclosure [1:19:15]:

The fact that this is 9.7 is because enclosure was in this branch and year made was in this branch. So in other words, we could say the contribution of enclosure interacted with year made is -0.3.

So what about the difference between 9.5 and 9.4? That’s an interaction of year made and hours on the meter. I’m using star here not to mean “times” but to mean “interacted with”. It’s a common way of doing things like R’s formulas do it this way as well. So year made interacted with meter has a contribution of -0.1.

Perhaps we could also say from 10 to 9.4, this also shows an interaction between meter and enclosure with one thing in between them. So we could say meter interacted with enclosure equals …and what should it be? Should it be -0.6? Some ways that seems unfair because we are also including the impact of year made. So maybe it should be -0.6 and maybe we should add back this 0.2 (9.5 → 9.7). These are like details that I actually don’t know the answer to. How should we best assign a contribution to each pair of variables in this path? But clearly conceptually we can. The pairs of variables in that path all represent interactions.

Question: Why don’t you force them to be next to each other in the tree [1:21:47]? I’m not going to say it’s the wrong approach. I don’t think it’s the right approach though. Because it feels like in this path, meter and enclosure are interacting. So it seems like not recognizing that contribution is throwing away information. But I’m not sure. I had one of my staff at Kaggle actually do some R&D on this a few years ago and they actually found (I wasn’t close enough to know how they dealt with these details), but they got it working pretty well. But unfortunately it never saw the light of day as a software product. But this is something maybe a group of you could get together and build. Do some googling to check, but I really don’t think that there are any interaction feature importance parts of any open source library.

Question: Wouldn’t this exclude interactions though between variables that don’t matter until they interact? So say your row never chooses to split down that path, but that variable interacting with another one becomes your most important split [1:22:56]. I don’t think that happens. Because if there is an interaction that’s important only because it’s an interaction (and not in a univariate basis), it will appear sometimes, assuming that you set max features to less than one, so therefore it will appear in some path.

Question: What is meant by interaction? Is it multiplication, ratio, addition [1:23:31]? Interaction means appears on the same path through a tree. In the above example, there is an interaction between enclosure and year made because we branched on enclosure and then we branched on year made. So to get to 9.7, we have to have some specific value of enclosure and some specific value of year made.

Question: What if you went down the middle leaves between the two things you are trying to observe and you would also take into account what the final measure is? I mean if we extend the tree downwards, you’d have many measures both of like the two things you are trying to look at and also the in between steps. There seems to be a way to average information out in between them [1:24:03]? There could be. I think what we should do is talk about this on the forum. I think this is fascinating and I hope we build something great, but I need to do my live coding. That was a great discussion. Keep thinking about it and do some experiments.

Back to Live Coding [1:24:50]

So to experiment with that, you almost certainly want to create a synthetic dataset first. It’s like y = x1 + x2 + x1*x2 or something. Something where you know there is this interaction effect and there isn’t that interaction effect, and you want to make sure that the feature importance you get at the end is what you expected.

So probably the first step would be to do single variable feature importance using the tree interpreter style approach [1:25:14]. One nice thing about this is it doesn’t really matter how much data you have. All you have to do to calculate feature importance is just slide through tree. So you should be able to write in a way that’s actually pretty fast, so even writing it in pure Python might be fast enough depending on your tree size.

We are going to talk about extrapolation and the first thing I want to do is create a synthetic dataset that has a simple linear relationship. We are going to pretend it’s like a time series. So we need to create some x values. The easiest way to create some synthetic data of this type is to use linspace which just creates some evenly spaced data between start and stop by default 50 observations.

Then we are going to create dependent variable, so let’s assume there is a linear relationship between x and y, and let’s add a little bit of randomness to it. random.uniform between low and high, so we could add somewhere between -0.2 and 0.2, for example.

The next thing we need is a shape which is basically what dimensions do you want these random numbers to be, and obviously we want them to be the same shape as x’s shape. So we can just say x.shape.

So in other words, (50,) is x.shape. Remember when you see something in parentheses with a comma, that’s a tuple with just one thing in it. So this is shape 50 and so we added 50 random numbers. Now we can plot those.

Alright, so there is our data. When you were both working as a data scientist or for doing your exams in this course, you need to be able to quickly whip up a dataset like that, throw it up in a plot without thinking too much. As you can see, you don’t have to really remember much if anything. You just have to know how to hit shift + tab to check the names of parameters, google, or something to try and find linspace if you forgot what it’s called.

So let’s assume that’s our data [1:28:33]. We’re now going to build a random forest model and what I want to do is build a random forest model that kind of acts as if this is a time series. So I’m going to take left part as a training set. And take the right part as our validation or test set just like we did in groceries or bulldozers.

We can use exactly the same kind of code that we used in split_vals. So we can say:

x_trn, x_val = x[:40], x[40:]

That splits it into the first 40 versus the last 10. We can do the same thing for y and there we go.

y_trn, y_val = y[:40], y[40:]

The next thing to do is we want to create a random forest and fit it which requires x and y.

m = RandomForestRegressor().fit(x, y)

That’s actually going to give an error and the reason why is that it expects x to be a matrix, not a vector, because it expects x to have a number of columns of data.

So it’s important to know that a matrix with one column is not the same thing as a vector.

So if I try to run this, “Expected 2D array, got 1D array instead”:

So we need to convert 1D array into a 2D array. Remember I said x.shape is (50,). So x has one axis and x’s rank is 1. The rank of a variable is equal to the length of it’s shape — how many axes it has. Vector we can think of as an array of rank 1 and matrix as an array of rank 2. I very rarely use words like vector and matrix because they are kind of meaningless — specific example of something more general which is they are all N dimensional tensors or N dimensional arrays. So an N dimensional array we can say it’s a tensor of rank N. They basically mean kind of the same thing. Physicists get crazy when you say that because to a physicist, a tensor has quite a specific meaning but in machine learning, we generally use it in the same way.

So how do we turn an one dimensional array into a two dimensional array. There are a couple of ways we can do it but basically we slice it. Colon (:) means give me everything in that axis. :,None means give me everything in the first axis (which is the only axis we have) and then None is a special indexer which means add a unit axis here. So let me show you.

That is of shape (50, 1), so it’s a rank 2. It has two axes. One of them is a very boring axis — it’s a length one axis. So let’s move None to the left. There is (1, 50). Then to remind you, the original is (50,).

So you can see I can put None as a special indexer to introduce a new unit axis there. So x[None,:] has one row and fifty columns. x[:,None] has fifty rows and one column — so that’s what we want. This kind of playing around with ranks and dimension is going to become increasingly important in this course and in the deep learning course. So spend a lot of time slicing with None, slicing with other things, try to create 3 dimensional, 4 dimensional tensors and so forth. I’ll show you two tricks.

The first is you never ever need to write ,: as it’s always assumed. So these are exactly the same thing:

And you see that in code all the time, so you need to recognize it.

The second trick is x[:,None] is adding an axis in the second dimension (or I guess index 1 dimension). What if I always want to put it in the last dimension? Often our tensors change dimensions without us looking because you went from a one channel image to a three channel image, or you went from a single image to a mini batch of images. Suddenly, you get new dimensions appearing. So make things general, I would say ... which means as many dimensions as you need to fill this up. So in this case (x[…, None].shape ), it’s exactly the same but I would always try to write it that way because it means it’s going to continue to work as I get higher dimensional tensors.

So in this case, I want 50 rows and one column, so I’ll call that x1. Let’s now use that here and so this is now a 2D array and so I can create my random forest.

Then I could plot that, and this is where you’re going to have to turn your brains on because the folks this morning got this very quickly which was super impressive. I’m going to plot y_trn against m.predict(x_trn). Before I hit go, what is this going to look like? It should basically be the same. Our predictions hopefully are the same as the actuals. So this should fall on a line but there is some randomness so it won’t quite.

That was the easy one. Let’s now do the hard one, the fun one. What is that going to look like?

Think about what trees do and think about the fact that we have a validation set on the right and a training set on the left:

So think about a forest is just a bunch of trees.

Tim: I’m guessing since all the new data is actually outside of the original scope, so it’s all going to be basically the same — it’s like one huge group [1:37:15].

Jeremy: Yeah, right. So forget the forest, let’s create one tree. So we are probably going to split somewhere around here first, then split somewhere here, … So our final split is right most node. Our prediction, when we take one from validation set, so it’s going to put that through the forest and end up predicting the right most average. It can’t predict anything higher than that because there is nothing higher to average.

So this is really important to realize a random forest is not magic. It’s just returning the average of nearby observations where nearby is kind of in this like “tree space”. So let’s run it and see if Tim is right

Holy crap, that’s awful. If you don’t know how random forests work then this is going to totally screw you. If you think that it’s actually going to be able to extrapolate to any kind of data it hasn’t seen before, particularly future time period, it’s just not. It just can’t. It’s just averaging stuff it’s already seen. That’s all it can do.

Okay, so we are going to be talking about how to avoid this problem. We talked a little bit in the last lesson about trying to avoid it by avoiding unnecessary time dependent variables where we can. But in the end, if you really have a time series that looks like this, we actually have to deal with a problem. One way we could deal with the problem would be use a neural net. Use something that actually has a function or shape that can actually fit something that actually has a function or shape that can actually fit something like this so it will extrapolate nicely:

Another approach would be to use all the time series techniques you guys are learning about in the morning class to fit some kind of time series and then detrend it. Then you’ll end up with detrended dots and then use the random forest to predict those. That’s particularly cool because imagine what your random forest was actually trying to predict data which was two different states. So the blues ones are down there, and the red ones are up here.

If you try to use a random forest, it’s going to do a pretty crappy job because time is going to seem much more important. So it’s basically still going to split like this and split like that, then finally once it gets down to left corner, it will be like “oh okay, now I can see the difference between the states.”

In other words, when you’ve got this big time piece going on, you’re not going to see the other relationships in the random forest until every tree deals with time. So one way to fix this would be with a gradient boosting machine (GBM). What a GBM does is, it creates a little tree, and runs everything through that first little tree (which could be the time tree) then it calculates the residuals and the next little tree just predicts the residuals. So it would be kind of like detrending it, right? GBM still can’t extrapolate to the future but at least they can deal with time-dependent data more conveniently.

We are going to be talking about this quite a lot more over the next coupe of weeks, and in the end that a solution is going to be just use neural nets. But for now, using some kind of time series analysis, detrend it, and then use random forest on that isn’t a bad technique at all. If you are playing around something like Ecuador groceries competition, that would be a really good thing to fiddle around with.


Lessons: 123456789101112