Deep Learning Part 1 — fast.ai - Rossman Notebook

Chunduri
12 min readOct 14, 2018

--

This blog covers the application of Deep learning on structured data. I will be explaining the preprocessing part of the code from fast.ai course Deep Learning part 1, Lesson-4, Rossmann notebook.

To run the notebook in collab you should install following versions of “fastai” and “torchtext”,

!pip install fastai==0.7.0!pip install torchtext==0.2.3

With this setup in place, we can run the notebook on google collab without any issues. The supporting versions are changing so fast, that you might end up facing problems by the time you run the notebook. In that case, leave a comment below with your issue, I will help you out.

The dataset has the records of German supermarket. Using this data we should be able to predict future sales, by training the model on the given training data.

This data set is an old kaggle challenge and a very popular one. What is covered as part of the implementation inspired by the third place winner of the competition. Teams that won the first two places are the domain experts, who did a lot of feature engineering. The third place winner did minimum possible feature engineering and the right candidate for applying Neural networks. Unlike traditional machine learning techniques which need a lot of feature engineering, DL model will figure out the feature importance by itself.

There are other very good blogs by @timlee and @hiromi_suenaga, who wrote about this notebook very well. Their focus was mostly on explaining Deep Learning part, they briefly touched the preprocessing part of the implementation.

This article will focus more on explaining preprocessing part of the code, and very briefly on DL part of the code. I will try my best explaining, part of the code that I find difficult to understand and skip over the easy parts.

Major preprocessing steps:

  1. Create dataset
  2. Data cleaning/Feature engineering
  3. Durations with respect to Date-Time column

Creating dataset:

From kaggle training data, test data and store details are made available. In addition to that, we will use additional data sets put together by kaggle participants.

table_names = [‘train’, ‘store’, ‘store_states’, ‘state_names’, 
‘googletrend’, ‘weather’, ‘test’]

Apart from the original tables, we have store_states, state_names, googletrend, weather are used to include additional features. These additional tables will provide more features that can influence the training process by generalizing the model. Higher the number of relevant features considered, better will be the function approximation ability of our DL model.

For example, weather patterns will determine the flow of customers and their buying preferences.

It is a good idea to search for as many tables as possible that can influence the sales. The tricky part is putting them together into one single table and feeding it as input to a DL model. This has been done very well by the fast.ai notebook.

tables = [pd.read_csv(f’{PATH}{fname}.csv’, low_memory=False) for fname in table_names]
for t in tables: display(t.head())

The above two lines, load the dataset and does the initial visualization.

Data cleaning/Feature engineering

Following 10 steps explain the notebook, from preprocessing to DL model:

1. Replacing binary categories into boolean dtype, which is much more convenient to handle.

train.StateHoliday = train.StateHoliday!=’0'
test.StateHoliday = test.StateHoliday!=’0'

2. Joining together tables one by one using merge operator of pandas. In the end we should be left with a single table on which we will apply DL model.

Join_df” is a function that joins together two table. It uses pandas merge operator with an outer join to do the job. The outer join is preferred over the inner join because it will show the non-overlapping rows in the resulting table. With the missing values as “NaN”. This dataset is so structured that there are no missing values after joining the tables.

It also ensures that the missing values are checked for their presence after joining the tables.

We could use inner join, since the missing values are absent. But it is a good practice to do outer join and check for missing values. This approach can be readily applied on other data sets, which might have some missing values and needs some approach to fill them up.

def join_df(left, right, left_on, right_on=None, suffix=’_y’):
if right_on is None: right_on = left_on
return left.merge(right, how=’left’, left_on=left_on, right_on=right_on,
suffixes=(“”, suffix))
weather = join_df(weather, state_names, “file”, “StateName”)googletrend[‘Date’] = googletrend.week.str.split(‘ — ‘, expand=True)[0]
googletrend[‘State’] = googletrend.file.str.split(‘_’, expand=True)[2]
googletrend.loc[googletrend.State==’NI’, “State”] = ‘HB,NI’
trend_de = googletrend[googletrend.file == ‘Rossmann_DE’]store = join_df(store, store_states, “Store”)
len(store[store.State.isnull()])
joined = join_df(train, store, “Store”)
len(joined[joined.StoreType.isnull()])
joined = join_df(joined, googletrend, [“State”,”Year”, “Week”])
len(joined[joined.trend.isnull()])
joined = joined.merge(trend_de, ‘left’, [“Year”, “Week”], suffixes=(‘’, ‘_DE’))
len(joined[joined.trend_DE.isnull()])
joined = join_df(joined, weather, [“State”,”Date”])
len(joined[joined.Mean_TemperatureC.isnull()])
joined_test = test.merge(store, how=’left’, left_on=’Store’, right_index=True)
len(joined_test[joined_test.StoreType.isnull()])
for c in joined.columns:
if c.endswith(‘_y’):
if c in joined.columns: joined.drop(c, inplace=True, axis=1)

3. Identify the date column in all the tables and create additional categorical columns that are Derived from the time column.

4. Also using the date, time column, create additional columns that indicated time till some relevant event and time elapsed after some relevant event. These columns will also be categorical.

add_datepart” is a fast.ai function, does this job. For example, time from last promo, tile left till next promo, day of the week etc. Which play an important role in predicting sales.

drop=False” means we are not dropping the date column, even though we managed to extract lot of relevance trends out of it, because there could be some hidden patterns in data that DL model might identify.

add_datepart(weather, “Date”, drop=False)
add_datepart(googletrend, “Date”, drop=False)
add_datepart(train, “Date”, drop=False)
add_datepart(test, “Date”, drop=False)
add_datepart(googletrend, “Date”, drop=False)

5. Replace missing values and extreme values with other values of our choice.

fillna()” is a pandas function which fills missing values with values of our choice. Below is the case of year the missing values are filled with the year 1900, “CompetitionOpenSinceMonth”, “Promo2SinceWeek” are filled with value one.

Extreme values are replaced by a cut off minimum and maximum values, using pandas filter technique.joined.CompetitionOpenSinceYear = joined.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
joined.CompetitionOpenSinceMonth = joined.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
joined.Promo2SinceYear = joined.Promo2SinceYear.fillna(1900).astype(np.int32)
joined.Promo2SinceWeek = joined.Promo2SinceWeek.fillna(1).astype(np.int32)
joined[“CompetitionOpenSince”] = pd.to_datetime(dict(year=joined.CompetitionOpenSinceYear,
month=joined.CompetitionOpenSinceMonth, day=15))
joined[“CompetitionDaysOpen”] = joined.Date.subtract(joined.CompetitionOpenSince).dt.days
joined.loc[joined.CompetitionDaysOpen<0, “CompetitionDaysOpen”] = 0
joined.loc[joined.CompetitionOpenSinceYear<1990, “CompetitionDaysOpen”] = 0
joined[“CompetitionMonthsOpen”] = joined[“CompetitionDaysOpen”]//30
joined.loc[joined.CompetitionMonthsOpen>24, “CompetitionMonthsOpen”] = 24
joined.CompetitionMonthsOpen.unique()
joined[“Promo2Since”] = pd.to_datetime(joined.apply(lambda x: Week(
x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1).astype(pd.datetime))
joined[“Promo2Days”] = joined.Date.subtract(joined[“Promo2Since”]).dt.days
joined.loc[joined.Promo2Days<0, “Promo2Days”] = 0
joined.loc[joined.Promo2SinceYear<1990, “Promo2Days”] = 0
joined[“Promo2Weeks”] = joined[“Promo2Days”]//7
joined.loc[joined.Promo2Weeks<0, “Promo2Weeks”] = 0
joined.loc[joined.Promo2Weeks>25, “Promo2Weeks”] = 25
joined.Promo2Weeks.unique()
joined.to_feather(f’{PATH}joined’)

6. Separate categorical and continuous columns. Make sure to identify as many columns as possible with categorical type. Unless a column has a clear floating value type, we can make others categorical.

cat_vars = [‘Store’, ‘DayOfWeek’, ‘Year’, ‘Month’, ‘Day’, ‘StateHoliday’, ‘CompetitionMonthsOpen’,
‘Promo2Weeks’, ‘StoreType’, ‘Assortment’, ‘PromoInterval’, ‘CompetitionOpenSinceYear’, ‘Promo2SinceYear’,
‘State’, ‘Week’, ‘Events’, ‘Promo_fw’, ‘Promo_bw’, ‘StateHoliday_fw’, ‘StateHoliday_bw’,
‘SchoolHoliday_fw’, ‘SchoolHoliday_bw’]
contin_vars = [‘CompetitionDistance’, ‘Max_TemperatureC’, ‘Mean_TemperatureC’, ‘Min_TemperatureC’,
‘Max_Humidity’, ‘Mean_Humidity’, ‘Min_Humidity’, ‘Max_Wind_SpeedKm_h’,
‘Mean_Wind_SpeedKm_h’, ‘CloudCover’, ‘trend’, ‘trend_DE’,
‘AfterStateHoliday’, ‘BeforeStateHoliday’, ‘Promo’, ‘SchoolHoliday’]
n = len(joined); n

Following lines of code ensures that, those columns that are listed under categorical will be converted into “category” type with pandas functionality. Those which are “continuous” are given ‘float32’ type which is Pytorch standard for float data.

In the last line of code all the categorical columns, continuous type columns, dependant variable(sales) column and date-time column are put together in a single data frame(joined) with all these columns.

for v in cat_vars: joined[v] = joined[v].astype(‘category’).cat.as_ordered()
for v in contin_vars: joined[v] = joined[v].astype(‘float32’)
dep = ‘Sales’
joined = joined[cat_vars+contin_vars+[dep, ‘Date’]]

7. Sample a subset(150000) of the train data. Apply “proc_df” the subset.

In the following three lines of code, we are sampling a subset(150000 samples) of the final data frame. We set “date” as the index for the DataFrame, the reason could be that we will separate the train and validation sets based on the time and not based on random selection.

idxs = get_cv_idxs(n, val_pct=150000/n)
joined_samp = joined.iloc[idxs].set_index(“Date”)
samp_size = len(joined_samp); samp_size

proc_df” does the following things:

  1. proc_df — pulls out the target (y)(sales) and deletes from the original.
  2. Also scales the dataFrame(do_scale=True).
  3. Also creates another object to keep track of std and mean for changing the test set.
  4. Also handles missing values, fills with a median.
df, y, nas, mapper = proc_df(joined_samp, ‘Sales’, do_scale=True)
yl = np.log(y)

8. Separate the sample data into train and validation datasets.

train, validation sets are separated with 75% of the data taken for training. Last 25% of the data set with the time index is taken as a validation set, in the last line of the code below.

Any dataset with date/time as an important column, it is always better to divide train, validation based on time chunks of earlier and later time. Because even the future prediction which is like a test set comes like a chunk of data with a sequence based on time.

Random selection is not a good practice here, unlike other applications.

train_ratio = 0.75
train_size = int(samp_size * train_ratio); train_size
val_idx = list(range(train_size, len(df)))

9. Creating Embeddings for categorical variables.

Categorical columns are identified from the list of “cat_vars”. Each categorical variable represents its values with an embedding vector. The embedding vector dimension for each categorical variable is different and is based on the cardinality of the variable.

Cardinality is the number of levels or possible values each categorical variable has. For example, a variable day of the week has a cardinality of 7 and embedding vector length of 4.

The second line of code below uses floor division to decide vector length.

cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

Traditionally one-hot encodings are used to represent categorical variables, but it has two major disadvantages: one, it takes a very long vector to represent a value, second, the representation is not rich, in the sense that it will not hold any of the characteristics of the value it represents and the functional relationship between different values of the category.

10. Now the data is ready to the given as input to the DL model.

Now we have a DataFrame(df) and target values(yl) with a clear indication of which columns are categorical and continuous. With this information, we create a model data object ‘md’, can be seen in the line of code below. This is a fast.ai functionality, a wrapper written on top of Pytorch.

This is one of the three lines of code, all that is needing to implement a DL model completely. Other two line, one creates a model using the data model and the second is the fit function that runs/trains the model on the given data.

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl, cat_flds=cat_vars, bs=128)

For the same of visualization, each continuous variable column is taken directly as input to the model and takes a single input neuron, each categorical variable takes as many input neurons, as the number of embedding vector length.

Following figure clearly shows, each continuous variable takes single input neuron, but each categorical variable takes the number of neurons equal to the length of the embedding vector.

This line of code creates a model object. Its parameters are the all the embedding vectors(emb_szs), number of continuous variables, number of hidden layers and neurons per layer[1000,500],dropout at embedding layer(0.04), dropout per layer[0.001,0.01], target labels(y_range).

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)

The final step is the following, which fits the data to the model and starts training. It takes in “lr” learning rate, the number of epochs and metrics used. Metric used here, is RMSPE (Root Mean Square Percentage Error). This is the metric used in kaggle error calculations.

m.fit(lr, 3, metrics=[exp_rmspe])def inv_y(a): return np.exp(a)def exp_rmspe(y_pred, targ):
targ = inv_y(targ)
pct_var = (targ — inv_y(y_pred))/targ
return math.sqrt((pct_var**2).mean())
max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)

Important functions used for preprocessing and fast.ai internal functions:

In this section, I will specify the function definition and how it is used in the notebook. And some explanation of what is going on.

  1. join_df

Refer to the point number 2 for detailed explanation of this function.

2. add_datepart

Refer to the point number 4 for detailed explanation of this function.

3. get_elapsed

This function takes a “fld” and takes another parameter which has two options “Before” and “After”. It tells us depending on which parameter we select, whether the event in the “fld” happened “After” or “Before” how many days. This used time/date column to get this information.

def get_elapsed(fld, pre):
day1 = np.timedelta64(1, ‘D’)
last_date = np.datetime64()
last_store = 0
res = []
for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
if s != last_store:
last_date = np.datetime64()
last_store = s
if v: last_date = d
res.append(((d-last_date).astype(‘timedelta64[D]’) / day1).astype(int))
df[pre+fld] = res

Refer to the point number 4 for detailed explanation of this function, along with other function add_datepart which also uses date/time column to extract additional categorical columns.

columns = [“Date”, “Store”, “Promo”, “StateHoliday”, “SchoolHoliday”]
df = train[columns]
fld = ‘SchoolHoliday’
df = df.sort_values([‘Store’, ‘Date’])
get_elapsed(fld, ‘After’)
df = df.sort_values([‘Store’, ‘Date’], ascending=[True, False])
get_elapsed(fld, ‘Before’)
fld = ‘StateHoliday’
df = df.sort_values([‘Store’, ‘Date’])
get_elapsed(fld, ‘After’)
df = df.sort_values([‘Store’, ‘Date’], ascending=[True, False])
get_elapsed(fld, ‘Before’)
fld = ‘Promo’
df = df.sort_values([‘Store’, ‘Date’])
get_elapsed(fld, ‘After’)
df = df.sort_values([‘Store’, ‘Date’], ascending=[True, False])
get_elapsed(fld, ‘Before’)

In the following lines of code “columns” are selected for which we will measure the duration using Date-time. The code block inside the double for loop creates additional columns with the name given by appending “Before” or “After”, to the selected columns.

Two pandas DataFrames are created for “bwd” and “fwd”, for backward in time and forward in time. The “df” DataFrame is grouped by store name and every window of 7 samples are summed for both backward in time and forward in time as a rolling window, saved as new DataFrames, “bwd” and “fwd” respectively.

df = df.set_index(“Date”)columns = [‘SchoolHoliday’, ‘StateHoliday’, ‘Promo’]for o in [‘Before’, ‘After’]:
for p in columns:
a = o+p
df[a] = df[a].fillna(0).astype(int)


bwd = df[[‘Store’]+columns].sort_index().groupby(“Store”).rolling(7, min_periods=1).sum()
fwd = df[[‘Store’]+columns].sort_index(ascending=False
).groupby(“Store”).rolling(7, min_periods=1).sum()

Both the “bwd” and “fwd” DataFrames are merged into the original dataFrame “df”. What we essentially did was, created new columns that gave additional information about time duration before and after some important events. This is done on both forward in time and backward in time. All this information is added back to the original DataFrame “df”, to enrich the feature space of the dataset.

In the last line of code below, we are dropping the originally selected columns.

bwd.drop(‘Store’,1,inplace=True)
bwd.reset_index(inplace=True)
fwd.drop(‘Store’,1,inplace=True)
fwd.reset_index(inplace=True)
df.reset_index(inplace=True)df = df.merge(bwd, ‘left’, [‘Date’, ‘Store’], suffixes=[‘’, ‘_bw’])
df = df.merge(fwd, ‘left’, [‘Date’, ‘Store’], suffixes=[‘’, ‘_fw’])
df.drop(columns,1,inplace=True)df.head()

4. apply_cats

Changes any columns of strings in df into categorical variables using trn as a template for the category codes.

5. get_cv_idxs

Refer to the point number 7 for a detailed explanation of this function.

This function is used to sample 20% of the of the total samples from the dataFrame. These sampled data is used as input to the DL model.

6. proc_df

Refer to the point number 7 for a detailed explanation of this function.

7. ColumnarModelData.from_data_frame

8. md.get_learner

9. M.fit

Above three lines of code in 7,8 and 9 are the actual DL model implementation. Fast.ai makes life so easy for people who want to implement DL models on their datasets. It is as simple as these three lines of code.

But for a serious DL practitioner life only starts from here, and we should take deep dive into what is going on in the lower abstractions of these three lines of code, the parameters passed to these objects and also the internals of other supporting functionalities.

The effort that I made here is a simple one, by looking at the code and understand the lines of code from the parameters passed to the lower level functions and objects. Understand inner workings of the objects used here helped me understand the code much better, but I could not capture that understanding in this blog.

With the emergence of DL, we are seeing the field after field being overwhelmed with DL techniques replacing traditional ML algorithms. Major reasons why this is happening are the following:

  1. DL models can approximate much more complex patterns than the traditional ML models.
  2. DL models are good at finding hidden patterns in the data. They can figure out feature importance. All this without any preprocessing or with minimal preprocessing.
  3. With advent of GPU and TPU compute available along with huge data footprint, it is not a surprise that DL is replacing other ML models.

Computer Vision is the first field to be taken over by DL, and now DL has made big strides in NLP as well. Coming to structured data like columnar data, DL is a recent entrant with most of the models builds here are based on traditional ML techniques like Random Forest, Logistic Regression, SVM etc. For DL in structured data, there are no standard models that can be applied to raw data sets. That is why we had to do a lot of preprocessing in the above implementation. As time progresses there will emerge, better DL models which can avoid most of the preprocessing. This is also an opportunity for the people in the community to make use of this early start in this area.

Even though DL technique has proved to be much more accurate on structured data, lack of explainability, to be more precise feature explainability in DL models is a big challenge. In applications like financial data, health records, banking data etc, there is a need for feature explainability and the models cannot be treated as black boxes.

At this stage, DL models are seen as black boxes, with lack of explainability in feature selection and feature importance. Even experienced engineers and scientists do a lot of hyperparameter tuning without clear logic. Even though there are a lot of efforts being made in this direction, but we are not in a place where subject matter experts and customers are in a position to trust us.

I have been thinking of writing an article for a very long time, at least since I attended Jeremy Howard Deep, fast.ai Deep Learning learning part 2 course May 2018. With a lot of hesitation, difficulty and push from a good friend of mine, finally, I am writing this article. I know this might have a lot of mistakes and you might face difficulty in following the article as I intended. There is a lot of scope for improvement. I will be more than happy to answer questions and take suggestions to improve the article.

Thanks to Jeremy Howard for teaching the course and for being a great source of inspiration and motivation.

I have taken inspiration, content and style from the following sources:

  1. https://github.com/fastai/fastai/blob/master/courses/dl1/lesson3-rossman.ipynb
  2. http://forums.fast.ai/t/wiki-lesson-4/9402
  3. http://forums.fast.ai/t/deeplearning-lec4notes/8146
  4. https://medium.com/@hiromi_suenaga/deep-learning-2-part-1-lesson-4-2048a26d58aa
  5. https://www.youtube.com/watch?v=gbceqO8PpBg&feature=youtu.be

--

--