Machine Learning 1: Lesson 1

Hiromi Suenaga
Feb 2, 2018 · 15 min read

My personal notes from machine learning class. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 123456789101112


Syllabus in brief

Depending on time and class interests, we’ll cover something like (not necessarily in this order):

Train vs. test

Trees and ensembles

What is ML? Why do we use it?

Feature engineering

Regularized models trained with SGD

Basic neural nets

KNN

CV / bootstrap (Diabetes data set?)

Ethical considerations

Skip:

Random Forest: Blue Book for Bulldozers

Notebook / Kaggle

%load_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

Data science ≠Software engineering [08:43]. You will see code that does not follow PEP 8 and things like import *, but go along with it for a while. What we are doing right now is prototyping models, and prototyping models has a very different set of best practices that are taught nowhere. The key is to be able to do things very interactively and iteratively. Jupyter notebook makes this easy. If you ever wondered what display is, you can do one of the three things:

Download data [12:05]

Entering to Kaggle competition will let you know whether you are competent at this kind of data in this kind of model. Is the accuracy bad because the the data is so noisy that you cannot do better? Or is it actually an easy dataset and you have made a mistake? When you are working on your own project with your own dataset, you will not get this kind of feedback — we just have to know that we have good effective techniques to reliably building baseline model.

Machine learning should help us understand a dataset, not just make predictions about it [15:36]. So by picking an area which we are not familiar with, it is a good test of whether we can build an understanding. Otherwise what can happen is that your intuition about the data can make it very difficult for you to be open-minded enough to see what the data really says.

There are a few options to download the data:

Image for post
Image for post

Jupyter trick [21:39] — you can open web-based terminal like so:

Image for post
Image for post

The goal of this competition is to use the training set which contains data through the end of 2011 to predict the sale price of bulldozers.

Let’s look at the data[25:25]:

Structured data: Columns representing a wide range of different types of things such as identifier, currency, date, size.

Unstructured data: Images

pandas is the most important library when you are working with structured data which is usually imported as pd.

df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, 
parse_dates=["saledate"])
def display_all(df):
with pd.option_context("display.max_rows", 1000):
with pd.option_context("display.max_columns", 1000):
display(df)
display_all(df_raw.tail().transpose())

In Jupyter Notebook, if you type a variable name and press ctrl+enter whether that being Dataframe, video, HTML, etc — it will generally figure out a way of displaying it for you [32:13].

Image for post
Image for post

The variable we want to predict is called Dependent Variable in this case our dependent variable is SalePrice

Question: Should you never look at the data because of the risk of overfit? [33:08] We want to find out at least enough to know that we have managed to imported okay, but tend not to really study it at all at this point, because we do not want to make too many assumptions about it. Many books say to do a lot of exploratory data analysis (EDA) first. We will learn machine learning driven EDA today.

Purpose of the project — Evaluation [34:06]

Root mean squared log error. The reason we use log is because generally, you care not so much about missing by $10 but missing by 10%. So if it was $1000,000 item and you are $100,000 off or if it was a $10,000 item and you are $1,000 off — we would consider those equivalent scale issues.

df_raw.SalePrice = np.log(df_raw.SalePrice)

What is a random forest? [36:37]

Random forest is a universal machine learning technique.

Question: What about a curse of dimensionality? [38:16] There are two concepts you often hear — curse of dimensionality and no free lunch theorem. They are both largely meaningless and basically stupid and yet many people in the field not only know that but think the opposite so it is well worth explaining. The curse of dimensionality is this idea that the more columns you have, it creates a space that is more and more empty. There is this fascinating mathematical idea that the more dimensions you have, the more all of the points sit on the edge of that space. If you just have a single dimension where things are random, then they are spread out all over. Where else, if it is a square then the probability that they are in the middle means that they cannot be on the edge of either dimension so it is a little less likely that they are not on the edge. Each dimension you add, it becomes multiplicatively less likely that the point is not on the edge of at least one dimension, so in high dimensions, everything sits on the edge. What that means in theory is that the distance between points is much less meaningful. So if we assume it matters, then it would suggest that when you have lots of columns and you just use them without being careful to remove the ones you do not care about that things will not work. This turns out not to be the case for number of reasons

scikit-learn [42:54]

Most popular and important package for machine learning in Python. It is not the best at everything (e.g. XGBoost is better than Gradient Boosting Tree), but pretty good at nearly everything.

m = RandomForestRegressor(n_jobs=-1)
m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)

Everything in scikit-learn has the same form.

Image for post
Image for post

The above code will result in an error. There was a value inside the dataset “Conventional”, and it did not know how to create a model using that String. We have to pass numbers to most machine learning models and certainly to random forests. So step 1 is to convert everything into numbers.

This dataset contains a mix of continuous and categorical variables.

Here are some of the information we can extract from date — year, month, quarter, day of month, day of week, week of year, is it a holiday? weekend? was it raining? was there a sport event that day? It really depends on what you are doing. If you are predicting soda sales in SoMa, you would probably want to know if there was a San Francisco Giants ball game that day. What is in a date is one of the most important piece of feature engineering you can do and no machine learning algorithm can tell you whether the Giants were playing that day and that it was important. So this is where you need to do feature engineering.

The add_datepart method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities.

def add_datepart(df, fldname, drop=True):
fld = df[fldname]
if not np.issubdtype(fld.dtype, np.datetime64):
df[fldname] = fld = pd.to_datetime(fld,
infer_datetime_format=True)
targ_pre = re.sub('[Dd]ate$', '', fldname)
for n in ('Year', 'Month', 'Week', 'Day', 'Dayofweek',
'Dayofyear', 'Is_month_end', 'Is_month_start',
'Is_quarter_end', 'Is_quarter_start', 'Is_year_end',
'Is_year_start'):
df[targ_pre+n] = getattr(fld.dt,n.lower())
df[targ_pre+'Elapsed'] = fld.astype(np.int64) // 10**9
if drop: df.drop(fldname, axis=1, inplace=True)
fld = df_raw.saledate
fld.dt.year
add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()

Question: [55:40] What is the difference between df['saleYear'] and df.saleYear ? It is safer to use square brackets especially when assigning values and there is a possibility that the column does not already exist.

After running add_datepart, it added many numerical columns and removed saledate column. This is not quite enough to get passed the error we saw earlier as we still have other columns that contain string values. Pandas has a concept of a category data type, but by default it would not turn anything into a category for you. Fast.ai provides a function called train_cats which creates categorical variables for everything that is a String. Behind the scenes, it creates a column that is an integer and it is going to store a mapping from the integers to the strings. train_cats is called “train” because it is training data specific. It is important that validation and test sets will use the same category mappings (in other words, if you used 1 for “high” for a training dataset, then 1 should also be for “high” in validation and test datasets). For validation and test dataset, use apply_cats instead.

train_cats(df_raw)
df_raw.UsageBand.cat.categories
Index(['High', 'Low', 'Medium'], dtype='object)

The order does not matter too much, but since we are going to be creating a decision tree that split things at a single point (i.e. High vs. Low and Medium , High and Low vs. Medium ) which is a little bit weird. To order them in a sensible manner, you can do the following:

df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'],
ordered=True, inplace=True)

There is a kind of categorical variable called “ordinal”. An ordinal categorical variable has some kind of order (e.g. “Low” < “Medium” < “High”). Random forests are not terribly sensitive for that fact, but it is worth noting.

display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

The above will add a number of empty values for each series, we sort them by the index (pandas.Series.sort_index), and divide by a number of dataset.

Reading CSV took about 10 seconds, and processing took another 10 seconds, so if we do not want to wait again, it is a good idea to save them. Here we will save it in a feather format. What this is going to do is to save it to disk in exactly the same basic format that it is in RAM. This is by far the fastest way to save something, and also to read it back. Feather format is becoming standard in not only Pandas but in Java, Apache Spark, etc.

os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')

We can read it back as so:

df_raw = pd.read_feather('tmp/raw')

We will replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable.

df, y, nas = proc_df(df_raw, 'SalePrice')
Image for post
Image for post
proc_df in structured.py

fix_missing

Image for post
Image for post

numericalize

Image for post
Image for post
df.head()
Image for post
Image for post

Now we have all numerical values. Note that booleans are treated as numbers. So we can create a random forest.

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)

Random forests are trivially parallelizable — meaning if you have more than one CPU, you can split up the data across different CPUs and it linearly scale. So the more CPUs you have, it will divide the time it takes by that number (not exactly but roughly). n_jobs=-1 tells the random forest regressor to create a separate job/process for each CPU you have.

m.score will return r² value (1 is good, 0 is bad). We will learn r² next week.

Wow, an r² of 0.98 — that’s great, right? Well, perhaps not…

Possibly the most important idea in machine learning is that of having separate training & validation data sets. As motivation, suppose you don’t divide up your data, but instead use all of it. And suppose you have lots of parameters:

Image for post
Image for post
Underfitting and Overfitting

The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it’s not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.

This illustrates how using all our data can lead to overfitting. A validation set helps diagnose this problem.

def split_vals(a,n): return a[:n].copy(), a[n:].copy()n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)
X_train.shape, y_train.shape, X_valid.shape((389125, 66), (389125,), (12000, 66))

Base Model

By using validation set, you see that the r² is 0.88 for validation set.

def rmse(x,y): return math.sqrt(((x-y)**2).mean())def print_score(m):
res = [rmse(m.predict(X_train), y_train),
rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)
CPU times: user 1min 3s, sys: 356 ms, total: 1min 3s
Wall time: 8.46 s
[0.09044244804386327, 0.2508166961122146,
0.98290459302099709, 0.88765316048270615]

*[training rmse, validation rmse, r² for training set, r² for validation set]

If you check Kaggle competition’s public board, RMSE of 0.25 will fall around top 25%. Random forests are insanely powerful and this totally standardized process is insanely good for any datasets.

Before the next class

Please try this process to solve as many Kaggle competitions as possible. What will likely happen is that you will be pleasantly surprised that you will do pretty well with just an hour of lecture.


Lessons: 123456789101112

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store