Machine Learning 1: Lesson 12

56 min readOct 20, 2018

My personal notes from machine learning class. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12

Video / Notebook

I thought what we might do today is to finish off where we were in this Rossmann notebook looking at time series forecasting and structured data analysis. Then we might do a little mini review of everything we’ve learnt because believe it or not, this is the end. There is nothing more to know about machine learning rather than everything that you’re going to learn next semester and for the rest of your life. But anyway, I got nothing else to teach. So I’ll do a little review and then we’ll cover the most important part of the course which is like thinking about how are ways to think about how to use this kind of technology appropriately, and effectively in a way it’ll be a positive impact on society.

Last time, we got to the point where we talked a bit about this idea that when we were looking at building this CompetitionMonthsOpen derived variable but we actually truncated it down to be no more than 24 months and we talked about the reason why being that we actually wanted to use it as a categorical variable because categorical variables, thanks to embeddings, have more flexibility in how the neural net can use them. And so that was kind of where we left off.

for df in (joined,joined_test):
    df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"]//30
    df.loc[df.CompetitionMonthsOpen>24, "CompetitionMonthsOpen"]= 24
joined.CompetitionMonthsOpen.unique()array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,  1, 14, 20,  8, 18,  6, 21,  5])

Let’s keep working through this. Because what’s happening in this notebook is stuff which is probably going to apply to most time series datasets that you work with. As we talked about although we used df.apply here, this is something where it’s running a piece of Python code over every row and that’s terrifically slow. So we only do that if we can’t find a vectorized pandas or numpy function that can do it to the whole column at once. But in this case, I couldn’t find a way to convert a year and a week number into a date without using arbitrary Python.

Also worth remembering this idea of a lambda function. Anytime you’re trying to apply a function to every row of something or every element of a tensor, if there isn’t a vectorized version already, you are going to have to call something like DataFrame.apply which will run a function you pass to every element. So this is basically a map in functional programming since very often the function you want to pass to it is something you’re just going to use once and then throw it away. It’s really common to use this lambda approach. So this lambda is creating a function just for the purpose of telling df.apply what to use.

for df in (joined,joined_test):
    df["Promo2Since"] = pd.to_datetime(df.apply(lambda x: Week(
        x.Promo2SinceYear, x.Promo2SinceWeek).monday(), 
            axis=1).astype(pd.datetime))
    df["Promo2Days"] = df.Date.subtract(df["Promo2Since"]).dt.days

We could also have written this in a different way [3:16]. The following two cells are the same thing:

One approach is to define the function (create_promo2since(x)) and then pass it by name, or the other is to define the function in place using lambda. So if you are not comfortable creating and using lambda, it’s a good thing to practice and playing around with df.apply is a good way to practice it.

Durations [4:32]

Let’s talk about this durations section which may at first seem a little specific but actually it turns out not to be. What we are going to do is we’re going to look at three fields: “Promo”, “StateHoliday”, “SchoolHoliday”

So basically what we have is a table of:

for each store for each date, does that store have a promo going on that date
is there a school holiday in that region of that store that date
is there a state holiday in that region for that store that date

These kind of things are events. And time series with events are very very common. If you are looking at oil and gas drilling data, you’re trying to say the flow through this pipe, here is an event representing when it set off some alarm, or here is an event where the drill got stuck, or whatever. So like most time series, at some level, will tend to represent some events. The fact that an event happened at at time is interesting itself, but very often a time series will also show something happening before and after the event. For example, in this case, we are doing grocery sales prediction. If there’s a holiday coming up, it’s quite likely that sales will be higher before and after the holiday, and lower during the holiday if this if this is a city based store. Because you have to stock up before you go away to bring things with you, then when you come back, you’ll have to refill the fridge, for instance. Although we don’t have to do this kind of feature engineering to create features specifically about this is before or after a holiday, the neural net, the more we can give the neural net the kind of information it needs, the less it’s going to have to learn it. The less it’s going to have to learn it, the more we can do with the data we already have and the more we can do with the size architecture we already have. So feature engineering even with stuff like neural nets is still important because it means that we will be able to get better results with whatever limited data we have, whatever limited computation we have.

So the basic idea here, therefore, is when we have events in our time series, we want to create two new columns for each event [7:20]:

How long is it going to be until the next time this event happens.
How long has it been since the last time that event happened.

So in other words, how long until the next state holiday, how long since the previous state holiday. So that’s not something which I am aware of as existing as a library or anything like that. So I wrote up here by hand.

def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values, 
                     df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    df[pre+fld] = res

So importantly, I need to do this by store. So I want to say, for this store, when was this store’s last promo (i.e. how long has it been since the last time it had a promo), how long it will be until the next time it has a promo, for instance.

Here is what I’m going to do. I’m going to create a little function that’s going to take a field name and I’m going to pass it each of Promo and then StateHoliday, and then SchoolHoliday. So let’s do school holiday for example. So we say field equals school holiday, and then we’ll say get_elapsed('SchoolHoliday', 'After'). So let me show you what that’s going to do. We are going to first of all sort by store and date. Now when we loop through this, we are going to be looping through within a store. So store #1, January the first, January the second, January the third, and so forth.

fld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

As we loop through each store, we are basically going to say is this row a school holiday or not [8:56]. If it is a school holiday, then we’ll keep track of this variable called last_date which says this is the last date where we saw a school holiday. So then we are going to append to our result the number of days since the last school holiday.

Importance of using `zip` [9:26]

There are a few interesting features. One is the use of zip. I could actually write this much more simply by writing for row in df.iterrows(): then grab the fields we want from each row. It turns out this is 300 times slower than the version that I have. Basically, iterating through a DataFrame and extracting specific fields out of a row has a lot of overhead. What’s much faster is to iterate through a numpy array. So if you take a Series (e.g. df.Store), and add .values after it, that grabs a numpy array of that series.

So here are three numpy arrays. One is the store IDs, one is whatever fld is (in this case, that’s a school holiday), and what is the date. So now what I want to do is loop through the first one, the second one, and the third one of each of those lists. And this is a really really common pattern. I need to do something like this in basically every notebook I write. And the way to do it is with zip. So zip means loop through each of these lists one at a time. Then this here is where we can grab the element out of the first list, the second list, and the third list:

So if you haven’t played around much with zip, that’s a really important function to practice with. Like I said, I use it in pretty much every notebook I write — all the time you have to loop through a bunch of lists at the same time.

So we are going to loop through every store, every school holiday, and every date [11:34].

Question: Is it looping through all the possible combination of each of those [11:44]? No. Just 111, 222, etc.

So in this case, we basically want to say let’s grab the first store, the first school holiday, and the first date. So for store 1, January the first, school holiday was true or false. So if it is a school holiday, I’ll keep track of that fact by saying the last time I saw a school was that date, and append how long has it been since the last school holiday. And if the store ID is different to the last store ID, then I’ve now got to a whole new store, in which case, I have to basically reset everything.

Question: What will happen to the first points that we don’t have a last holiday [12:39]? Yeah, so I just set this to some arbitrary starting point (np.datetime64()), it’s going to end up with, I can’t remember, either the largest or the smallest possible date. You may need to replace this with a missing value afterwards or zeros. The nice thing is though, thanks to ReLU’s, it’s very easy for a neural net to cut off extreme values. So in this case, I didn’t do anything special with it. I ended up with these like negative billion date time stamps and it still worked fine.

The next thing to note is there’s a bunch of stuff that I need to do to both the training set and the test set [13:35]. So in the previous section, I actually added this loop where I go for each of the training DataFrame and the test DataFrame, do these things:

Each cell, I did for each of the data frames:

Coming up, there are a whole series of cells that I want to run first of all for the training set and for the test set. In this case, the way I did that was I have two different cells here: one which sets df to be the training set, one which set it to be the test set.

The way I use this is, I run just the first cell (i.e. skip the df=test[columns]) then I run all the cells underneath, so it does it all to the training set. Then I come back and run the second cell, then run all the cells underneath. So this notebook is not designed to be just run from top to bottom. But it’s designed to be run in this particular way. I mentioned that because this can be a handy trick to know. You could, of course, put all the stuff underneath in a function that you pass the data frame to and call it once with a test set, once with a training set. But I kind of like to experiment a bit, more interactively look at each step as I go. So this way is an easy way to run something on different data frames without turning it into a function.

If I sort by store and by date, then this is keeping track of the last time something happened [15:11]. So d — last_date is, therefore, going to end up telling me how many days was it since the last school holiday:

So now if I sort date descending and call the exact same function, then it’s going to say how long until the next holiday:

So that’s kind of a nice little trick for adding arbitrary event times into your time series models. If you are doing, for example, the Ecuadorean groceries competition right now, maybe this kind of approach would be useful for various events in that as well.

Do it for state holiday, do it for promo, there we go:

fld = 'StateHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')fld = 'Promo'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

Rolling function [16:11]

The next thing that we look at here is rolling functions. Rolling in pandas is how we create what we call windowing functions. Let’s say I had some data like this. What I could do is to say okay let’s create a window around this point of like 7 days.

Then I could take the average sales in that seven day window. Then I could do the same thing over here, take the average sales over that seven day window.

So if do that for every point and join up those averages, you are going to end up with a moving average:

The more generic version of moving average is a window function i.e. something where you apply to some function to some window of data around each point. Very often that windows that I’ve shown here are not actually what you want. If you’re trying to build a predictive model, you can’t include the future as part of a moving average. So quite often you actually need a window that ends at a point (rather than a point being in the middle of the window). So that’ll be our window function:

Pandas lets you create window arbitrary window functions using this rolling here:

bwd = df[['Store']+columns].sort_index().groupby("Store"
                  ).rolling(7, min_periods=1).sum()fwd = df[['Store']+columns].sort_index(ascending=False
                  ).groupby("Store").rolling(7, min_periods=1).sum()

The first argument says how many time steps do I want to apply the function to. The second argument says if I’m at the edge, in other words, if I’m at the left edge of the above graph, should you make that a missing value because I don’t have seven days to average over, or what’s the minimum number of time periods to use. So here, I said 1. Then optionally you can also say do you want to set the window at the start of a period, the end of a period, or the middle of a period. Then within that, you can then apply whatever function you like. So here, I’ve got my weekly, by store, sums. So there’s a nice easy way of getting moving average or whatever else.

I should mention [19:20], if you go to the time series page on Pandas, there’s a long list of indices on the left. There’s lots because Wes McKinney who created this, he was originally in hedge fund trading, I believe. And his work was all about time series. So I think Pandas originally was very focused on time series and still it’s perhaps the strongest part of Pandas. So if you’re playing around with time series computations, you definitely owe it to yourself to try to learn this entire API. And there’s a lot of conceptual pieces around time stamps, date offsets, resampling and stuff like that to get your head around. But it’s totally worth it because otherwise you’ll be writing this stuff as loops by hand. It’s going to take you a lot longer than leveraging what Pandas already does. And of course Pandas will do it in highly optimized vectorized C code for you, whereas your version is going to loop in Python. So definitely worth, if you are doing stuff in time series, learning the full Pandas time series API. They are just about as strong as any time series API out there.

Okay, so at the end of all that, you can see here’s those starting point values I mentioned [20:56] — slightly on the extreme side. So you can see here, 17th of September, store 1 was 13 days after the last school holiday. The 16th was 12, 11, 10, so forth.

We are currently in a promotion. Here, this is one day before a promotion:

And the left of it, we’ve got 9 days after the last promotion and so forth. So that’s how we can add kind of event counters to our time series and probably always a good idea when you are doing work with time series.

Categorical versus continuous [21:46]

So now we’ve done that, we’ve got lots of columns in our dataset and so we split them out into categorical versus continuous columns. We’ll talk more about that in the review section, but these are going to be all the things I’m going to create an embedding for:

And contin_vars are all the things that I’m going to feed directly into the model. So for example, we’ve got CompetitionDistance so that’s distance to the nearest competitor, maximum temperature, and we have a categorical value DayOfWeek. So here, we’ve got maximum temperature, maybe like 22.1 because they use centigrade in Germany, we’ve got distance to nearest competitor, might be 321.7km. Then we’ve got day of week, maybe Saturday is a 6. So the first two numbers are going to go straight into our vector that we are going to be feeding into our neural net. We will see in a little moment, but we’ll actually normalize them, but more or less. But this categorical variable, we are not. We need to put it through an embedding. So we will have some embedding matrix of 7 by 4 (e.g. dimension 4 embedding). So this will look up the 6th row to get back the four items. So day of week 6 will turn into length 4 vector which will then get added here.

So that’s how our continuous and categorical variables are going to work.

Then all of our categorical variables will turn them into Pandas categorical variables in the same way that we’ve done before [24:21]:

for v in cat_vars: 
    joined[v] = joined[v].astype('category').cat.as_ordered()

Then we are going to apply the same mappings to the test set. If Saturday is 6 in the training set, this apply_cats makes sure that Saturday is also 6 in the test set:

apply_cats(joined_test, joined)

For the continuous variables, make sure they’re all floats because PyTorch expects everything to be a float.

for v in contin_vars:
    joined[v] = joined[v].fillna(0).astype('float32')
    joined_test[v] = joined_test[v].fillna(0).astype('float32')

So then this is another little trick that I use.

idxs = get_cv_idxs(n, val_pct=150000/n)
joined_samp = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_samp); samp_size150000

Both of these cells (above and below) define something called joined_samp. One of them defines them as the whole training set, one of them defines them as a random subset. So the idea is that I do all of my work on the sample, make sure it all works well, play around with different hyper parameters and architectures. And then when I’m happy with it, I then go back and run this line of code (below) to say, okay now make the whole dataset be the sample, then rerun it.

samp_size = n
joined_samp = joined.set_index("Date")

This is a good way, again similar to what I showed you before, it lets you use the same cells in your notebook to run first of al on the sample and then go back later and run it on the full dataset.

Normalizing data [25:51]

Now that we’ve got that joined_samp, we can then pass it to proc_df as we’ve done before to grab the dependent variable to deal with missing values. In this case, we pass one more thing which is do_scale=True. This will subtract the mean and divide by the standard deviation.

df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)
yl = np.log(y)

The reason for that is that if our first layer, it’s just a matrix multiply. So here is our set of weights. And our input has something like 0.001 and something else which is 10⁶, for example, then our weight matrix has been initialized to be random numbers between 0 and 1. Then basically10⁶ is going to have gradients that are 9 orders of magnitude bigger than 0.001 which is not going to be good for optimization. So by normalizing everything to be mean of zero standard deviation of 1 to start with, then that means that all of the gradients are going to be on the same kind of scale.

We didn’t have to do that in random forests because in random forests, we only cared about the sort order. We didn’t care about the values at all. But with linear models and things that are built out of layers of linear models i.e. neural nets, we care very much about the scale. So do_scale=True normalizes our data for us. Now since it normalizes our data for us, it returns one extra object mapper which is an object that contains for each continuous variable what was the mean and standard deviation it was normalized with. The reason being that we are going to have to use the same mean and standard deviation on the test set because we need our test set and our training set to be scaled in the exact same way; otherwise they are going to have different meanings.

So these details about making sure that your tests and training set have the same categorical codings, the same missing value replacement, and the same scaling normalization are really important to get right because if you don’t get it right, then your test set is not going to work at all. But if you follow these steps, it’ll work fine. We also take the log of the dependent variable and that’s because in this Kaggle competition, the evaluation metric was root mean squared percent error. Root mean squared percent error means we are being penalized based on the ratio between our answer and the correct answer. We don’t have a loss function in PyTorch called root mean squared percent error. We could write one, but easier is just to take the log of the dependent because the difference between logs is the same as the ratio. So by taking the log, we kind of get that for free.

You’ll notice the vast majority of regression competitions on Kaggle use either root mean squared percent error or a root mean squared error of the log as their evaluation metric [29:23]. That’s because in real world problems, most of the time, we care more about ratios than about raw differences. So if you are designing your own project, it’s quite likely that you want to think about using log of your dependent variable.

So then we create a validation set and as we’ve learned before, most of the time if you’ve got a problem involving a time component, your validation set probably wants to be the most recent time period rather than a random subset [30:00]. So that’s what I do here:

val_idx = np.flatnonzero(
    (df.index<=datetime.datetime(2014,9,17)) & 
    (df.index>=datetime.datetime(2014,8,1)))

When I finished modeling and I found an architecture and a set of hyper parameters and a number of epochs and all that stuff that works really well, if I want to make my model as good as possible, I’ll retrain on the whole thing — including the validation set. Now, currently at least, Fast AI assumes that you do have a validation set, so my kind of hacky workaround is to set my validation set to just be one index which is the first row:

val_idx=[0]

That way all the code keeps working but there’s no real validation set. Obviously if you do this, you need to make sure that your final training is like the exact same hyper parameters, the exact same number of epochs, exactly the same as the thing that worked because you don’t actually have a proper validation set now to check against.

Question: I have a question regarding get_elapsed function which we discussed before. In get_elapsed function, we are trying to find how many days away is the next holiday. So every year, the holidays are more or less fixed, like there will be holiday on 4th of July, 25th of December, and there is hardly any change. So can’t we just look from previous years and just get a list of all the holidays that are going to occur this year [31:09]? Maybe. I mean, in this case, I guess that’s not true for Promo and some holidays change, like Easter. So this way, I get to write one piece of code that works for all of them. And it doesn’t take very long to run. If your dataset was so big that this took too long, you could maybe do it on one year and then kind of somehow copy it. But in this case, there was no need to. And I always value my time over my computer’s time, so I try to keep things as simple as i can.

Creating a model [32:31]

So now we can create our model. To create our model, we have to create a model data object as we always do with Fast AI. So a columnar model data object is just a model data object that represents training set, a validation set, and an optional test set of standard columnar structured data.

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, 
                    yl.astype(np.float32), cat_flds=cat_vars, 
                    bs=128, test_df=df_test)

We just have to tell it which of the variables should we treat as categorical. Then pass in our data frames.

For each of our categorical variables, here is the number of categories it has. So for each of our embedding matrices, this tells us the number of rows in that embedding matrix. Then we define what embedding dimensionality we want. If you are doing natural language processing, then the number of dimensions you need to capture all the nuance of what a word means and how it’s used has been found empirically to be about 600. It turns out when you do NLP models with embedding matrices that are smaller than 600, you don’t get as good a result as you do with the size 600. Beyond 600, it doesn’t seem to improve much. I would say that human language is one of the most complex things that we model, so I wouldn’t expect you to come across many if any categorical variables that need embedding matrices with more than 600 dimensions. At the other end, some things may have pretty simple kind of causality. So for example, StateHoliday — maybe if something is a holiday then it’s just a case like okay at stores that are in the city, there’s some behavior, the stores in the country, there’s some other behavior and that’s about it. Maybe it’s a pretty simple relationship. So ideally, when you decide what embedding size to use, you would kind of use your knowledge about the domain to decide how complex is the relationship and so how big embedding do I need. In practice, you almost never know that. You only know that because maybe somebody else has previously done that research and figured it out like in NLP. So in practice, you probably need to use some rule of thumb, and having tried a rule of thumb, you could then maybe try a little bit higher, a little bit lower and see what helps. So it’s kind of experimental.

cat_sz=[(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]cat_sz[('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 7),
 ('Promo_bw', 7),
 ('StateHoliday_fw', 4),
 ('StateHoliday_bw', 4),
 ('SchoolHoliday_fw', 9),
 ('SchoolHoliday_bw', 9)]

So here is my rule of thumb [35:45]. My rule of thumb is look at how many discrete values the category has (i.e. the number of rows in the embedding matrix) and make the dimensionality of the embedding half of that. So if . day of week which is the second one, eight rows and four columns. Here it is (c+1)//2 — the number of columns divided by two. But then I say don’t go more than 50. Here you can see for Store (first row), there’s 116 stores, only have a dimensionality of 50. Why 50? I don’t know. It seems to have worked okay so far. You may find you need something a little different. Actually for the Ecuadorian groceries competition, I haven’t really tried playing with this but I think we may need some larger embedding sizes. But it’s something to fiddle with.

emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
emb_szs[(1116, 50),
 (8, 4),
 (4, 2),
 (13, 7),
 (32, 16),
 (3, 2),
 (26, 13),
 (27, 14),
 (5, 3),
 (4, 2),
 (4, 2),
 (24, 12),
 (9, 5),
 (13, 7),
 (53, 27),
 (22, 11),
 (7, 4),
 (7, 4),
 (4, 2),
 (4, 2),
 (9, 5),
 (9, 5)]

Question: As your cardinality size becomes larger and larger, you are creating wider and wider embedding matrices. Aren’t you therefore massively risking overfitting because if you are choosing 70 parameters, the model can never possibly capture all that variations that your data is actually huge [36:44]? That’s a great question and so let me remind you about my golden rule of the difference between modern machine learning and old machine learning. In old machine learning, we control complexity by reducing the number of parameters. In modern machine learning, we control complexity by regularization. So a short answer is no. I’m not concerned about overfitting because the way I avoid overfitting is not by reducing the number of parameters but by increasing my dropout or increasing my weight decay. Now having said that, there’s no point using more parameters for a particular embedding than i need. Because regularization is penalizing a model by giving it more random data or by actually penalizing weights. So we’d rather not use more than we have to. But they are kind of my general rule of thumb for designing an architecture is to be generous on the side of the number of parameters. In this case, if after doing some work, we kind of felt like the store doesn’t actually seem to be that important. Then I might manually go and make change to this to make it smaller. Or if I was really finding there’s not enough data here, I’m either overfitting or I’m using more regularization than I’m comfortable with, then you might go back. But I would always start with being generous with parameters. In this case, this model turned out pretty good.

Okay, now we’ve got a list of tuples containing the number of rows and columns of each of our embedding matrices [38:41]. And so when we call get_learner to create our neural net, that’s the first thing we pass in:

emb_szs: how big is each of our embeddings
len(df.columns)-len(cat_vars): how many continuous variables we have
[1000,500]: how many activations to create for each layer
[0.001,0.01]: what dropout to use for each layer

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.summary()

Then we can go ahead and call fit. We fit for a while and we are getting something around the 0.1 mark.

m.fit(lr, 1, metrics=[exp_rmspe])[ 0.       0.01456  0.01544  0.1148 ]m.fit(lr, 3, metrics=[exp_rmspe])[ 0.       0.01418  0.02066  0.12765]                           
[ 1.       0.01081  0.01276  0.11221]                           
[ 2.       0.00976  0.01233  0.10987]m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)[ 0.       0.00801  0.01081  0.09899]                            
[ 1.       0.00714  0.01083  0.09846]                            
[ 2.       0.00707  0.01088  0.09878]

So I tried running this on the test set and I submitted it to Kaggle last week, and here it is [39:25]:

Private score .107, public score .103. So let’s have a look and see how that would go. Let’s start on public:

340th out of 3000. That’s not good. Let’s try the private leader board which is .107.

Oh, 5th [40:30]. So hopefully you are now thinking oh, there are some Kaggle competition finishing soon which I entered and I spent a lot of time trying to get good results on the public leaderboard. I wonder if that was a good idea. The answer is no, that wasn’t. Kaggle public leaderboard is not meant to be a replacement for your carefully developed validation set. So for example, if you are doing the iceberg competition (which ones are ships, which ones are icebergs), then they’ve actually put something like 4,000 synthetic images into the public leaderboard and none into the private leaderboard. So this is one of the really good kind of things that tests you out on Kaggle is “are you creating a good validation set and are you trusting it?” Because if you are trusting your leaderboard feedback more than your validation feedback, then you may find yourself in 350th place when you thought you are in 5th. In this case, we actually had a pretty good validation set because as you can see, it’s saying somewhere around 0.1 and we actually did get somewhere around 0.1. So in this case, the public leaderboard in this competition was entirely useless.

Question: So in regards to that, how much does the top of the public leaderboard actually correspond to the top of a private leaderboard? Because in the churn prediction challenge, there’s like 4 people who are just completely above everyone else [42:07]. It totally depends. If they randomly sample the public and private leaderboard, then it should be extremely indicative. But it might not be. So in this case, the person who was second on the public leaderboard did end up winning. The first place on the public leaderboard came in 7th. In fact, you can see the little green thing here. Where else, this guy jumped 96 places.

If we had entered with the neural net we just looked at, we would have jumped 350 places. So it just depends. Sometimes they will tell you the public leaderboard was randomly sampled. Sometimes they will tell you it’s not. Generally you have to figure it out by looking at the correlation between your validation set results and the public leaderboard results to see how well they are correlated. Sometimes if 2 or 3 people are way ahead of everybody else, they may have found some kind of leakage or something like that. That’s often a sign that there’s some trick.

Okay, so that’s Rossmann and that brings us to the end of all of our material.

Review [44:21]

We’ve learnt two ways to train a model. One is by building a tree and one is with SGD. So the SGD approach is a way we can train a model which is a linear model or stack of linear layers with nonlinearities between them. Whereas tree building specifically will give us a tree. Then tree building, we can combine with bagging to create a random forest or with boosting to create a GBM or various other slight variations such as extremely randomized trees. So it’s worth reminding ourselves of what these things do. So let’s look at some data. Actually, let’s look specifically at categorical data. So categorical data, there’s a couple of possibilities of what categorical data might look like. Let’s say we got zip code, so we’ve got 94003 is our zip code. Then we’ve got like sales, say 50. For 94131, sales of 22, and so forth. So we’ve got some categorical variable. There’s a couple of ways we could represent that categorical variable. One would be just to use the number. Maybe it wasn’t a number to start. Maybe it wasn’t a number at all. Maybe a categorical variable is like San Francisco, New York, Mumbai, and Sydney. But we can turn it into a number just by arbitrarily deciding to give them numbers. So it ends up being a number. We could just use that kind of arbitrary number. So if it turns out that zip codes that are numerically next to each other have somewhat similar behavior then the zip code versus sales chart might look something like this, for example:

Or alternatively, if the two zip codes next each other didn’t have in any ways similar sales behavior, you would expect to see something that look more like this:

Kind of just all over the place. So they are two possibilities. So what a random forest would do if we just encoded zip in this way is it’s going to say, alright, I need to find my single best split point — the split point that’s going to make the two sides have as smaller standard deviation as possible or mathematically equivalently have the lowest root mean squared error. So in this case, it might pick here as a first split point because on the left side, there’s one average and on the other side, there’s the other average [48:07].

Then for its second split point, it’s going to say how do I split the right hand side, and it’s probably going to say I would split here because now we’ve got this average and versus this average:

Then finally, it’s going to say how do we split the middle it, and it’s going to say okay I’ll split right in the middle. So you can see that it’s able to hone in on the set of splits it needs even though it kind of does it greedily top down one at a time. The only reason it wouldn’t be able to do this is if it was just such bad luck that the two halves were always exactly balanced. But even if that happens, it’s not going to be the end of the world. It’ll split on something else, some other variable and next time around, it’s very unlikely that it’s still going to be exactly balanced in both parts of the tree. So in practice, this works just fine.

In the second case, it can do exactly the same thing [49:25]. It’ll say okay which is my best first split even though there’s no relationship between one zip code and its neighboring zip code numerically. We can still see here, if it splits here, there’s the average on one side and the average on the other side is probably about here:

Then where would it split next? Probably here, because here is the average on one side, here’s the average on the other side.

So again, can do the same thing. It’s going to need more splits because it’s going to end up having to narrow down on each individual large zip code and each individual small zip code. But it’s still going to be fine. So when we are dealing with building decision trees for random forests or GBM’s or whatever, we tend to encode our variables just as ordinals.

On the other hand [50:26], if we are doing a neural network or like a simplest version like a linear regression or logistic regression, the best it could do is that (in green) which is no good at all:

And ditto with this one. It’s going to be like that:

So an ordinal is not going to be a useful encoding for a linear model or something that stacks linear and nonlinear models together. So instead, what we do is we create a one hot encoding. Like so:

With that encoding, that can effectively create like a little histogram where it’s going to have a different coefficient for each level. So that way, it can do exactly what it needs to do.

Question: At what point does that become too tedious for your system [51:36]? Pretty much never. Because remember, in real life, we don’t actually have to create that matrix. Instead, we can just have the four coefficients and just do an index lookup which is mathematically equivalent to multiply on the one hot encoding. So that’s no problem.

One thing to mention [52:14]. I know you guys have been taught quite a bit of more analytical solutions to things. And in analytical solutions to like a linear regression, you can’t solve something with this amount of collinearity. In other words, you know something is in Sydney if it’s not Mumbai, New York, or San Francisco. So in other words, there’s a hundred percent collinearity between the fourth of these classes versus the other three. So if you try to solve a linear regression analytically that way, the whole thing falls apart. Now note with SGD, we have no such problem. Like SGD why would it care? We’re just taking one step along the derivative. It cares a little because in the end, the main problem with collinearity is that there’s an infinite number of equally good solutions. So in other words, we could increase all of these on the left, and decrease this. Or decrease all of these and increase this. And they are going to balance out.

And when there’s an infinitely large number of good solutions, it means there’s a lot of flat spots in the loss surface and it can be harder to optimize. So the really easy way to get rid of all those flat spots is to add a little bit of regularization. So if we added a little bit of weight decay, like 1e-7 even, then that says these are not all equally good anymore, the one which is the best is the one where the parameters are the smallest and the most similar to each other, and so that’ll again move it back to being a nice loss function.

Question: Could you clarify that point you made about why one hot encoding wouldn’t be that tedious [54:03]? Sure. If we have one hot encoded vector and we are multiplying it by a set of coefficients, then that’s exactly the same thing as simply saying let’s grab the thing where the one is. In other words, if we had stored this (1000) as a zero, 0100 as a one, 0020 as a two, then it’s exactly the same as just saying hey, look up that thing in the array.

So we call that version an embedding. So an embedding is a weight matrix you can multiply by one hot encoding. And it’s just a computational shortcut. But it’s mathematically the same.

There is a key difference between solving linear type model analytically versus with SGD [55:03]. With SGD, we don’t have to worry about collinearity and stuff, at least not nearly to the same degree. Then the difference between solving a linear or single layer or multi-layer model with SGD versus a tree; a tree is going to complain about less things. So in particular, you can just use ordinals as your categorical variables and as we learnt just before, we also don’t have to worry about normalizing continuous variables for a tree, but we do have to worry about it for these SGD trained models.

Then we also learnt a lot about interpreting, random forests in particular. And if you are interested, you may be interested in trying to use those same techniques to interpret neural nets. If you want to know which of my features are important in a neural net, you could try the same thing; try shuffling each column in turn and see how much it changes your accuracy. That’s going to be your feature importance for your neural net. Then if you really want to have fun, recognize, then, that shuffling that column is just a way of calculating how sensitive the output is to that input which in other words is the derivative of the output with respect to that input. So therefore, maybe you could just ask PyTorch to give you the derivatives with respect to the input directly and see if that gives you the same kind of answers.

You could do the same kind of thing for partial dependence plot. You could try doing the exact same thing with your neural net; replace everything in the column with the same value, do it for 1960, 1961, 1962, plot that. I don’t know if anybody who’s done these things before, not because it’s rocket science but just because maybe no one thought of it or it’s not in a library, I don’t know. But if somebody tried it, I think you should find it useful. It would make a great blog post. Maybe even the paper if you wanted to take it a bit further. So there’s a thought on something you can do. So most of those interpretational techniques are not particularly specific to random forests. Things like the tree interpreter certainly are because they are all about what’s inside the tree.

Question: In tree interpreter, we are looking at the paths and their contributions of the features. In neural net case, it will be same with activations, I guess the contributions of each activation on their path [57:42]? Yeah, maybe. I don’t know. I haven’t thought about it. Question continued: How can we make inference out of the activations? Jeremy: Be careful saying the word “inference” because people normally use the word inference specifically to mean the same as the test time prediction. You mean kind of interrogate the model. I’m not sure. We should think about that. Actually, Hinton and one of his students just published a paper on how to approximate a neural net with a tree for this exact reason. I haven’t read the paper yet.

Question: In linear regression and traditional statistics, one of the things that we focused on was statistical significance of the changes and things like that. So when thinking about a tree interpreter or even like the waterfall chart which I guess is just a visualization. I guess where does that fit in? Because we can see like oh yeah this looks important in the sense that it causes large changes but how do we know that it’s traditionally statistically significant [58:43]? So most of the time, I don’t care about the traditional statistical significance and the reason why is that nowadays, the main driver of statistical significance is data volume, not kind of practical importance. And nowadays most of the models you build will have so much data that every tiny thing will be statistically significant but most of them won’t be practically significant. So my main focus, therefore, is practical significance which is does the size of this influence impact your business? Statistical significance was much more important when we had a lot less data to work with. If you do need to know statistical significance because, for example, you have a very small dataset because it’s really expensive to label or hard to collect or whatever, or it’s a medical dataset for a rare disease, you can always get statistical significance by bootstrapping which is to say that you can randomly resample your dataset a number of times, train your model a number of times, and you can then see the actual variation in predictions. So that’s with bootstrapping, you can turn any model into something that gives you confidence intervals. There is a paper by Michael Jordan which has a technique called the bag of little bootstraps which actually kind of takes this a little further and well worth reading if you are interested.

Question: You said we don’t need one hot encoding matrix if you are doing random forests. What will happen if we do that and how bad can a model be [1:00:46]? We actually did do it. Remember we had that maximum category size and we did create one hot encodings, and the reason why we did it was that then our feature importance would tell us the importance of the individual levels and our partial dependence plot, we could include the individual levels. So it doesn’t necessarily make the model worse, it may make it better, but it probably won’t change it much at all. In this case, it hardly changed it. Question continued: This is something that we have noticed on real data also that if cardinality is higher let’s say 50 levels and if you do one hot encoding, the random forest performs very badly? Jeremy: Yes, thats right. That’s why in Fast.AI, we have maximum categorical size because at some point, your one hot encoded variables become too sparse. So I generally cut it off at 6 or 7. Also because when you get past that, it becomes less useful because for the feature importance, there is going to be too many levels to really look at. Question continued: can it just not look at those levels which are not important and just give those significant feature as important? Jeremy: Yeah, it will be okay. It’s just like once the cardinality increases too high, you’re just splitting your data up too much basically, and so in practice your ordinal version is likely to be better.

There is no time to review everything but that’s the key concepts and then remembering that the embedding matrix that we can use is likely to have more than just one coefficient, we will actually have a dimensionality of a few coefficients which isn’t going to be useful for most linear models [1:02:42]. But once you’ve got multi-layer models, that’s now creating a representation of your category which is quite a lot richer and you can do a lot more with it.

Ethics and Data Science [1:03:13]

Powerpoint

Let’s now talk about the most important bit. We started off early in this course talking about how actually a lot of machine learning is kind of misplaced. People focus on predictive accuracy like Amazon has a collaborative filtering algorithm for recommending books and they end up the book which it thinks you’re most likely to rate highly. So what they end up doing is probably recommending a took that you already have or that you already know about and would have bought anyway which isn’t very valuable. What they should instead have done is to figure out which book can I recommend that would cause you to change your behavior. That way, we actually maximize our lift in sales due to recommendations. So this idea of the difference between optimizing influencing your actions versus just improving predictive accuracy. Improving predictive accuracy is a really important distinction which is very rarely discussed in academia or industry, crazily enough. It’s more discussed in industry, it’s particularly ignored in most academia. So it’s a really important idea which is that in the end the goal of your model, presumably, is to influence behavior. So remember, I actually mentioned a whole paper I have about this where I introduced this thing called the drivetrain approach where I talk about ways to think about how to incorporate machine learning into how do we actually influence behavior. So that’s a starting point, but then the next question is okay if we are trying to influence behavior, what kind of behavior should we be influencing and how and what might it mean when we start influencing behavior. Because nowadays a lot of the companies that you are going to end up working at a big arse companies and you’ll be building stuff that can influence millions of people. So what does that mean?

Actually I’m not going to tell you what it means because I don’t know [1:05:34]. All I’m going to try and do is make you aware of some of the issues and make you believe two things about them:

You should care.
They are big current issues.

The main reason I want you to care is because I want you to want to be a good person and show you that not thinking about these things will make you a bad person. But if you don’t find that convincing, I will tell you this. Volkswagen were found to be cheating on their emissions tests. The person who was sent to jail for it was the programmer that implemented that piece of code. They did exactly what they were told to do. So if you are coming in here thinking hey, I’m just a techie, I’ll just do what I’m told, that’s my job. I’m telling you, if you do that, you can be sent to jail for doing what you are told, So a) don’t just do what you’re told because you can be a bad person and b) you can go to jail.

Second thing to realize is, in the head of the moment, you’re in a meeting with twenty people at work and you’re all talking about how you’re going to implement this new feature and everybody is discussing it [1:06:49]. And everybody’s like “we could do this and here’s a way of modeling it and then we can implement it and here’s these constraints” and there’s some part of you that’s thinking am I sure if we should be doing this? That’s not the right time to be thinking about that because it’s really hard to step up then and say “excuse me I’m not sure this is a good idea”. You actually need to think about how you would handle that situation ahead of time. So I want you to think about these issues now and realize by the time you’re in the middle of it, you might not even realize it’s happening. It’ll just be a meeting like every other meeting and a bunch of people will be talking about how to solve this technical question. You need to be able to recognize oh, this is actually something with ethical implications.

Rachel actually wrote all of these slides. I’m sorry she can’t be here to present this because she has studied this in depth. She’s actually being in difficult environments herself where she’s seen these things happening, an we know how hard it is. But let me give you a sense of what happens.

So engineers trying to solve engineering problems and causing problems is not a new thing. In Nazi Germany, IBM, the group known as Hollerith, Hollerith was the original name of IBM and it comes from the guy who actually invented the use of punch cards for tracking the US Census. The first mass wide-scale use of punch cards for data collection in the world. And that turned into IBM, so at this point, was still called Hollerith. So Hollerith sold a punch card system to Nazi Germany and so each punch card would code like this is Jew, 8, gypsy, 12, general execution for death by gas chamber, 6. So here is one of these cards describing the right way to kill these various people. So a Swiss judge ruled that IBM’s technical assistance facilitated the tasks of Nazis and commission of their crimes against humanity. This led to the death of something like twenty million civilians. So according to the Jewish virtual library where I got these pictures and quotes from, their view is that “the destruction of the Jewish people became even less important because of the invigorating nature of IBM’s technical achievement was only heightened by the fantastical profits to be made”. So this was a long time ago and hopefully you won’t end up working at companies that facilitate genocide. But perhaps you will [1:09:59].

https://www.nytimes.com/2017/10/27/world/asia/myanmar-government-facebook-rohingya.html https://www.nytimes.com/2017/10/24/world/asia/myanmar-rohingya-ethnic-cleansing.html

Because perhaps you’ll go to Facebook who are facilitating genocide right now. And I know people at Facebook who are doing this and they had no idea they were doing this. Right now, in Facebook, the Rohingya are in the middle of a genocide, a Muslim population of Myanmar. Babies are grabbed out of their mother’s arms and thrown into fires, people are being killed, hundreds and thousands of refugees. When interviewed, the Myanmar generals doing this said we are so grateful to Facebook for letting us know about the “Rohingya fake news” that these people are actually not human that they are actually animals. Now Facebook did not set out to enable the genocide of the Rohingya in Myanmar, no. Instead, what happened is they wanted to maximize impression in clicks. So it turns out that for the data scientists at Facebook, their algorithms learned that if you take the kinds of stuff people are interested in and feed them slightly more extreme versions of that, you are actually going to get a lot more impressions and the project managers are saying maximize these impressions and people are clicking, and it creates this thing. So the potential implications are extraordinary and global. And this is something that is literally happening. This is October 2017. It’s happening now.

Question: I just want to clarify what was happening here. So it was the facilitation of fake news or inaccurate media [1:11:48]? Yeah, let me go into it in more detail, so what happened was in mid 2016, Facebook fired its human editors. So it was humans that decided how to order things on your homepage. Those people got fired and replaced with machine learning algorithms. So the machine learning algorithm written by data scientists like you, they had nice clear metrics and they were trying to maximize their predictive accuracy and be like okay we think if we put this thing higher than this thing, we’ll get more clicks. It turned out that these algorithms for putting things on the Facebook newsfeed had a tendency to say oh, human nature is that we tend to click on things which stimulate our views and are therefore like more extreme versions of things we’ve already seen. This is great for the Facebook revenue model of maximizing engagement, it looked good on all of their KPIs. At that time, there was some negative press about I’m not sure that the stuff that Facebook is now putting on their trending section is actually that accurate, but from the point of view of the metrics that people are optimizing at Facebook, it looked terrific. So way back to October 2016, people started noticing some serious problems.

For example, it is illegal to target housing to people of certain races in America. That is illegal, and yet a news organization discovered that Facebook was doing exactly that in October 2016. Again, not because somebody in that data science team said “let’s make sure black people can’t live in nice neighborhood.” But instead, they found that their automatic clustering and segmentation algorithm found there was a cluster of people who didn’t like African Americans and if you targeted them with these kinds of ads, then they would be more likely to select this kind of housing or whatever. But the interesting thing is that even after being told about this three times, Facebook still hasn’t fixed it. And that is to say these are not just technical issues. They are also economic issues. When you start saying the thing that you get paid for (that is ads), you have to change the ways that you structure those so that you know you either use more people that cost money, or less aggressive on your algorithms to target people based on minority group status or whatever, that can impact revenues. The reason I mention this is you will at some point in your career find yourself in a conversation where you’re thinking “I’m not confident that this is morally okay”, the person you are talking to is thinking in their head “this is going to make us a lot of money”, you don’t quite ever manage to have a successful conversation because you’re talking about different things. So when you are talking to somebody who may be more experienced and more senior than you and they may sound like they know what they are talking about, just realize that their incentives are not necessarily going to be focused on how so I be a good person. Like they are not thinking how do I be a bad person but the more time you spend in an industry in my experience, the more desensitized you get to this stuff of like maybe getting promotions and making money isn’t the most important thing.

So for example [1:15:45], I’ve got a lot of friends who are very good at computer vision and some of them have gone on to create startups that seem like they are almost handmade to help authoritarian governments surveil their citizens. When I ask my friends like have you thought about how this could be used in that way, they are generally kind of offended that I ask. But I’m asking you to think about this. Wherever you end up working, if you end up creating a startup, tools can be used for good or for evil. So I’m not saying like don’t create excellent object tracking and detection tools from computer vision because you could go on and create a much better surgical intervention robot tool kit. I’m just saying be aware of it, think about it, talk about it.

So here is what I’d find fascinating [1:16:50]. There is this really cool thing actually meetup.com did. They think about this. They actually thought about this. They thought you know what, if we built a collaborative filtering system like we learned about in class to help people decide what meetup to go to. It might notice that on the whole in San Francisco, a few more men than women tend to go to techie meetups and so it might then start to decide to recommend techie meetups to more men than women. As a result of which more men will go to techie meetup, as a result of which when women go to techie meetups, they will be like oh, this is like all men, I don’t really want to go to techie meetups. As a result of which the algorithm will get new data saying that men like techie meetups better, so it continues. So a little bit of that initial push from the algorithm can create this runaway feedback loop. And you end up with almost all male techie meetups, for instance. So this kind of feedback loop is a subtle issue that you really want to think about when you are thinking about like what is the behavior that I’m changing with this algorithm that I’m building.

So another example which is kind of terrifying is in this paper where the authors describe how a lot of departments in the US are now using predictive policing algorithms [1:18:18]. So where can we go to find somebody who is about to commit a crime. So you know the algorithm simply feeds back to you basically the data that you’ve given it. So if your police department has engaged in racial profiling at all in the past, then it might suggest slightly more often, maybe you should go to the black neighborhoods to check for people committing crimes. As a result of which more of your police officers go to the black neighborhoods, as a result of which they arrest more black people, as a result of which the data says that the black neighborhoods are less safe. As a result of which the algorithm says to policeman maybe you should go to the black neighborhoods more often, and so forth.

This is not like vague possibilities of something that might happen in the future. This is documented work from top academics who have carefully studied the data and the theory. This serious scholarly work says, no this is happening right now. Again, I’m sure the people that started creating this predictive policing algorithm didn’t think like how do we arrest more black people. Hopefully they were thinking gosh, I’d like my children to be safer on the streets, how do I create a safer society. But they didn’t think about this nasty runaway feedback loop.

So this one about social network algorithms is actually an article in the New York Times recently by one of my friends, Renee Diresta, and she did something that was kind of amazing [1:20:08]. She set up a second Facebook account, a fake account. She was very interested in the anti-vaccination movement at that time, so she started following a couple of anti-vaxxers and visited a couple of anti-vaxxer links. And suddenly, her newsfeed starts getting full of anti-vaxxer news along with other stuff like chemtrails and deep state conspiracy theories and all this stuff. So she starts clicking on those and the more she clicked, the more hardcore far-out conspiracy stuff Facebook recommended. So now when Renee goes to that Facebook account, the whole thing is just full of angry, crazy, far-out conspiracy stuff. That’s all she sees. So if that was your world, then as far as you’re concerned is just like this continuous reminder and proof of all this stuff. So again, this is the kind of runaway feedback loop that ends up telling Myanmar generals throughout their Facebook homepage that Rohingya are animals, fake news, and whatever else.

A lot of this comes also from bias [1:21:51]. So let’s talk about bias specifically. Bias in image software comes from bias in data. And so most of the folks I know at Google brain building computer vision algorithms, very few of them are people of color. So when they are training the algorithms with photos of their families and friends, they are training them with very few people of color. So when FaceApp decided we’re going to try looking at lots of Instagram photos to see which ones are upvoted the most, without them necessarily realizing it, the answer was light colored faces. So they built a generative model to make you more “hot” and so this is the actual photos and here is the hotter version. So the hotter version is more white, less nostrils, more European-looking. This did not go down well to say the least. Again, I don’t think anybody at FaceApp said “let’s create something that makes people look more white. They just trained it on a bunch of images of the people that they had around them. And this has serious commercial implications as well. They had to pull this feature. They had a huge amount of negative pushback as they should.

Here is another example. Google photos created this photo classifier, airplanes, skyscrapers, cars, graduations, and gorillas. So think about how this looks to most people. To most people, they look at this, they don’t know about machine learning, they say “what the f@#$ somebody at Google wrote some code to take black people and call them gorillas”. That’s what it looks like. We know that’s not what happened. We know what happened is the team of folks at Google computer vision experts who have none or a few people of color working in the team built a classifier using all the photos they had available to them so when the system came across a person with dark skin it was like oh, I’ve only mainly seen that before amongst gorillas so I’ll put it in that category. Again, it’s the bias in the data creates a bias in the software and again, the commercial implications were very significant. Google really got a lot of bad PR from this as they should. This was a photo that somebody put in their Twitter feed. They said like look what Google photos just decided to do.

You can imagine what happened with the first international beauty contest judged by artificial intelligence. Basically it turns out all the beautiful people are white. So you kind of see this bias in image software, thanks to bias in the data, thanks to by lack of diversity in the teams building it.

You see the same thing in natural language processing [1:25:18]. Here is Turkish. O is the pronoun in Turkish which has no gender. There is no he vs. she. But of course in English, we don’t really have a widely used un-gendered singular pronoun, so Google Translate converts it to this. Now there are plenty of people who saw this online and said literally “so what?” it is correctly feeding back the usual usage in English. I know how this is trained, this is like Word2vec word vectors, it was trained on Google News corpus, Google books corpus, it’s just telling us how things are. From a point of view, that is entirely true. The biased data to create this biased algorithm is the actual data of how people have written books and newspaper articles for decades. But does that mean that this is the product that you want to create? Does this mean this is the product you have to create? Just because the particular way you’ve trained the model means it ends up doing this, is this actually the design you want?And can you think of potential negative implications and feedback loops this could create? If any of these things bother you, then now, lucky you, you have a new cool engineering problem to work on. How do I create unbiased NLP solutions? And now there are some startups starting to do that and starting to make some money. So these are opportunities for you. It’s like hey here’s some stuff where people are creating screwed up societal outcomes because of their shitty models. You can go and build something better. Another example of the bias in Word2vec word vectors is restaurant reviews ranked Mexican restaurants worse because the word “Mexican” tend to be associated with criminal words in the US press and books more often. Again, this is a real problem that is happening.

Rachel actually did some interesting analysis of just the plain Word2vec word vectors [1:27:46]. She basically pulled them out and looked at these analogies based on some research that had been done elsewhere. So you can see, Word2vec vector directions show that father is to doctor is the mother is to nurse. Man is to computer programmer as woman is to homemaker, and so forth. So it’s really easy to see what’s in these word vectors. They are kind of fundamental to just about all the NLP software we use today.

Here is a great example [1:28:30]. ProPublica has actually done a lot of good work in this area. Many judges now have access to sentencing guideline software. So sentencing guideline software says to the judge for this individual, we would recommend this kind of sentence. Now of course a judge doesn’t understand machine learning so they have two choices which is either do what it says or ignore it entirely and some people fall into each category. For the ones that fall into the do what it says category, here is what happens. For those that were labeled higher risk, the subset of those that were labeled high risk actually turned out not to re-offend was about quoter of whites and about a half of African Americans. So nearly twice as often, people who didn’t reoffend were marked as higher risk if they are African American and vice versa. Amongst those labeled lower risk but actually did reoffend turn out to be about half of the whites and only 28% of the African Americans. So this is data which I would like to think nobody is setting out to create something that does this. But when you start with biased data and the data says that whites and black smoke marijuana at about the same rate, but the blacks are jailed at, I think something like, 5 times more often than whites, the nature of the justice system in America or at least at the moment is that it’s not equal, it’s not fair. And therefore, the data that’s fed into the machine learning model is going to basically support that status quo. And then because of the negative feedback loop, it’s just going to get worse and worse.

Now I’ll tell you something else interesting about this one which researcher Abe Gong has pointed out [1:30:35]. Here are some of the questions that are being asked. So let’s take one. Was your father ever arrested? Your answer to that question is going to decide whether you’re locked up and for how long. Now as a machine learning researcher, do you think that might improve the predictive accuracy of your algorithm and get you a better R². It could well but I don’t know. Maybe it does. You try it out so oh, I’ve got a better R². So does that mean you should use it? There’s another question. Do you think it’s reasonable to lock somebody up for longer because of who their dad was. And yet, these are actually the examples of questions that we are asking right now to offenders and then putting into a machine learning system to decide what happens to them. Again, whoever designed this presumably they were like laser focused on technical excellence, getting the maximum area under the ROC curve and I found these great predictors that give me another 0.02. And I guess didn’t stop to think like well, is that a reasonable way to decide who goes to jail for longer.

So putting this together, you can kind of see how this can get more and more scary [1:32:03]. We take a company like Taser and tasers are these devices that give you a big electric shock basically and Tasers manage to do a great job of creating strong relationships with some academic researchers who seem to say whatever they tell them to say to the extent where now if you look at the data it turns out that there’s a pretty high probability that if you get tased that you’ll die. It happens not unusually. And yet the researchers who they’ve paid to look into this have consistently come back and said “oh no, it was nothing to do with the taser. The fact that they died immediately afterwards was totally unrelated. It was just a random, things happen”. So this company now owns 80% of the market for body cameras. And they’ve started buying computer vision AI companies. And they are going to do try and now use these police body camera videos to anticipate criminal activity. So what does that mean? Is that like okay I now have some augmented reality display saying tase this person because they are about to do something bad? It’s kind of like a worrying direction and so I’m sure nobody who’s a data scientist at Taser or at the companies that they bought out is thinking like this is the world I want to help create, but they could find themselves in or you could find yourself in the middle of this kind of discussion where it’s not explicitly about that topic but there’s a part of you that’s just like “I wonder if this is how this could be used.” And I don’t know exactly what the right thing to do in that situation is because you can ask and of course people are going to be like “no no no.” So what could you do? You could ask for some kind of written promise, you could decide to leave, you could start doing some research into the legality of things to say like oh, I would at least protect my own legal situation. I don’t know. Have a think about how you would respond to that.

So these are some questions that Rachel created as being things to think about [1:34:39]. If you are looking at building a data product or using a model, if you are building machine learning model, it’s for a reason. You’re trying to do something. So what bias may be in that data? Because whatever bias is in that data ends up being a bias in your predictions, potentially then biases the actions you are influencing, potentially then biases the data that you come back, and you may get a feedback loop.

If the team that built it isn’t diverse, what might you be missing? For example, one senior executive at Twitter called the alarm about major Russian bot problems at Twitter way back well before the election. That was the one black person in the exec team at Twitter. The one. And shortly afterwards, they lost their job. So definitely having a more diverse team that means having a more diverse set of opinions and beliefs and ideas and things to look for and so forth. So non-diverse team seem to make more of these bad mistakes.

Can we audit the code? Is it open source? Check for the different error rates amongst different groups. Is there a simple rule we could use instead that’s extremely interpretable and easy to communicate? And if something goes wrong, do we have a good way to deal with it?

When we’ve talked to people about this and a lot of people have come to Rachel and said I’m concerned about something my organization is doing, what do I do [1:36:21]? Or I’m just concerned about my toxic workplace, what do I do? And very often Rachel will say have you considered leaving? And they will say I don’t want to lose my job. But actually you can code, you’re in like 0.3% of the population. If you can code and do machine learning, you’re in probably like 0.01% of the population. You are massively massively in demand. So realistically, obviously an organization does not want you to feel like you are somebody who could just leave and get another job, that’s not in their interest. But that’s absolutely true. So one of the things I hope you’ll leave this course with is enough self-confidence to recognize that you have the skills to get a job and particularly once you’ve got your first your job, your second job is an order of magnitude easier. So this is important not just so that you feel like you actually have the ability to act ethically, but it’s also important to realize if you find yourself in a toxic environment which is pretty damn common unfortunately. There’s a lot of shitty tech cultures/environments particularly in the Bay Area. If you find yourself in one of those environments, the best thing to do is to get the hell out. And if you don’t have the self-confidence to think you can get another job, you can get trapped. So it’s really important. It’s really important to know that you are leaving this program with very in demand skills and particularly after you have that first job, you’re now somebody with in-demand skills and a track record of being employed in that area.

Question: This is kind of just a broad question but what are some things that you know of that people are doing to treat bias in data [1:38:41]? You know, it’s kind of like a bit of controversial subject at the moment and there are people trying to use an algorithmic approach where they are basically trying to say how can we identify the bias and kind of subtract it out. But the most effective ways I know of are the ones that are trying to treat it at the data level. So start with a more diverse team particularly a team includes people from the humanities like sociologists, psychologists, economists, people that understand feedback loops and implications for human behavior and they tend to be equipped with good tools for identifying and tracking these kinds of problems. And then try to incorporate the solutions into the process itself. But there isn’t some standard process I can point you to and say here’s how to solve it. If there’s such a thing, we haven’t found it yet. It requires a diverse team of smart people to be aware of the problems and work hard at them is the short answer.

Comment: This is just kind of a general thing for the whole class, if you are interested in this stuff, I read a pretty cool book, Jeremy you’ve probably heard of it, Weapons of Math Destruction by Cathy O’Neil. It covers a lot of the same stuff [1:40:09]. Yeah, thanks for the recommendation. Kathy is great. She’s also got a TED talk. I didn’t manage to finish the book. It’s so damn depressing. I was just like “no more”. But yeah, it’s very good.

All right. That’s it. Thank you, everybody. This has been really intense for me. Obviously this was meant to be something that I was sharing with Rachel. So I’ve ended up doing one of the hardest things in my life which is to teach two peoples worth of course on my own and also look after a sick wife and have a toddler and also do a deep learning course. And also do all this with a new library that I just wrote. So I’m looking forward to getting some sleep. But it’s been totally worth it. Because you’ve been amazing. I’m thrilled with how you’ve reacted to the opportunities I’ve given you and also to the feedback that I’ve given you. So congratulations.