Finding Data Block Nirvana (a journey through the fastai data block API) — Part 2

Published in

Analytics Vidhya

5 min readOct 31, 2019

This article describes how to train the custom fast.ai ItemList (and other custom DataBlock API bits) we built in part 1 of this series. If you haven’t done so already, make sure you read the first it and can run the companion code located here in the yelp-00 notebook.

The code for part 2 is here (see the yelp-01 notebook).

What I said here in the first article bears repeating:

I am a firm believer that one will learn more about using this framework by first reading and running the associated code, and then coding it themselves by hand (no copy&paste), than by any other means. This doesn’t mean reading the docs, highlighting and underlining important concepts isn’t important (believe me I do more than my fair share of this), only that for it to take a solid hold in your brain you have to do it. So get the code, run the code, and love the code. Utilizing it and the contents of this article, make it your goal to demonstrate individual understanding by coding everything up yourself.

With that said, let me highlight some of the more interesting parts of the part 2 code.

Keep things DRY

I moved all the DataBlock API related code in the utils.py file which I then import all from at the top of the notebook. Since I’m likely to want to reuse this code elsewhere, it’s good to remember one of the golden maxims of programming: Don’t Repeat Yourself.

Fixes to the DataBlock API bits

When I first attempted training my model I noticed that the tensors weren’t quite being grouped right in the mixed_tabular_pad_collate function. I fixed that in both the part 1 notebook and utils.py file, but I also left the old, wrong code, in the part 1 file so you can review what the output should, and should not, look like. I recommend you run the yelp-00-custom-itemlist notebook with both the corrected version or previous version to see the difference yourself.

Fine-tuning the LM

Since we are working with text, I figured it made sense to actually fine-tune the AWD LSTM based ULMFit model using the target text available in our dataset. See the LM Fine-tuning section of the notebook. I illustrate the basic steps required to do this, and I’m sure its one of the many places where improvements can be made.

Building the MixedTabular DataBunch

Remember all that code we wrote in part 1? Well, all that hard work has made it as simple as this to actually use it in our modeling task at hand.

data_cls = (MixedTabularList.from_df(
                            train_df, cat_cols, cont_cols, txt_cols,
                            vocab=data_lm.train_ds.vocab, 
                            procs=procs, path=PATH)
          .split_by_rand_pct(valid_pct=0.1, seed=42)
          .label_from_df(dep_var)
          .databunch(bs=32))

This should look very familiar to anyone using the fast.ai framework. Notice how we are using the vocab from our fine-tuned language model above.

TabularTextNN

How do we use this DataBunch? We’ll, I’m sure there are better ways than the one I present here, but I was able to get decent results from merely utilizing the models created from tabular_learner and text_classifier_learner fast.ai Learners. I definitely believe the approach is at least novel (at least I haven’t seen this anywhere) and can likely be improved upon.

As for the configuration required by both learners above, I decided to use a simply dictionary to make experimentation simple. See the respective tabular_args and text_args variables declared just above the definition of the TabularTextNN module.

The module’s init is where all the interesting things are:

def __init__(self, data, tab_layers, tab_args={}, text_args={}):
        super().__init__()
        
        tab_learner = tabular_learner(data, tab_layers, **tab_args)
        tab_learner.model.layers = tab_learner.model.layers[:-1]
        self.tabular_model = tab_learner.model        text_class_learner = text_classifier_learner(data, AWD_LSTM, 
                                                     **text_args)
        text_class_learner.load_encoder('lm_ft_enc')
        self.text_enc_model = /         
                        list(text_class_learner.model.children())[0]
        
        self.bn_concat = nn.BatchNorm1d(400*3+100)
        
        self.lin = nn.Linear(400*3+100, 50)
        self.final_lin = nn.Linear(50, data.c)

If you look at the model returned by the tabular leaner here, you’ll see that the last layer is a linear that outputs the number of expected label the model needs to predict. As we’re going to be merging the outputs of this model with the text outputs before getting the probabilities for our labels, we just chop it off by setting the model layers equal to tabular_learner.model.layers[:-1].

Similarly, we only need the text classification learner’s encoder for our purposes here, and so we remove the PoolingClassifier from it via list(text_class_learner.model.children())[0]. Learning how to manipulate the PyTorch models as I have here is extremely helpful to understand and I’ve included a few resources below that were instructive for me.

The final step in our forward() function is to concatenate the results from both models, run them through a batch normalization layer and a couple of linear layers to get our predicted values. Notice how I also employ the concat pooling trick used in fast.ai to take advantage of all the information returned by the text encoder.

Training

Guess what? You train this just like any other fast.ai model. That means there really isn’t anything new to learn here. You can just stick this model in a learner as such:

model = TabularTextNN(data_cls, tab_layers, tabular_args, text_args)
learn = Learner(data_cls, model, metrics=[accuracy])

Nice, huh?

Next Steps

Can you beat my best accuracy of .673?

I’m happy to accept any and all pull requests with your own notebooks that utilize the MixedTabular ItemList to improve upon my results (and they can surely be improved upon). Maybe some of you would like to submit a notebook with some EDA work? Or maybe a notebook that demonstrates a solid approach in determining what features should, and shouldn’t, be included based on feature importance? Or maybe someone can illustrate something helpful in the way of feature engineering and/or data augmentation that helps to improve upon my results?

Either way, I’d love to see some work from the community that have benefited from these articles … work that can in turn benefit myself and everyone else. That’s my challenge to you.

As always, I hope the two articles benefit you in your work and feel free to follow me on twitter either way at @wgilliam . Over the next couple of months I’ll be publishing a series of articles showing you how to do all this with the upcoming version 2 of the fast.ai framework, so stay tuned.