Stumble Upon Classification Challenge with Transformers: Tensor Flow+ Hugging Face

SUMEET SAWANT
The Startup
Published in
5 min readFeb 6, 2021

NLP is one domain which is changing rapidly . The advent of transformer architecture has made NLP task more close to human level accuracy. I too wanted to explore this development which is making machine parse data like a human , but just reading research paper was not enough . I tried to apply my new gained skill on an old Kaggle competition.

Transformer architecture was born out of the Attention is all you need paper from Google. A nice explanation of Transformer can be found in this article

Hugging Face’s transformer library builds and maintains the different kind of transformer architecture and abstract most of the inner working of the model. It also has many pretrained model available which we can be fine-tune to our task.

Competition Background:

StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as “ephemeral” or “evergreen”. The ratings we get from our community give us strong signals that a page may no longer be relevant — but what if we could make this distinction ahead of time? A high quality prediction of “ephemeral” or “evergreen” would greatly improve a recommendation system like ours.

Many people know evergreen content when they see it, but can an algorithm make the same determination without human intuition?

Thus for a machine to classify this correctly is huge as unlike humans, machine do not have any pre-conceived notion of language . Example: if I say I flew from NY to LA. Its understood that I took a flight from NY to LA but a machine working on this text segment wont have this idea . Humans assume fact and its never stated during conversation.

This competition dataset had a lot of metadata columns which could have been used to classify the article along with the text column, but I decided to just to use the text along with BERT transformer architecture from Hugging face’s transformer library and Tensor Flow 2.0

Various columns of the dataset
Category distribution

Above graph shows how there is a evergreen and non-evergreen article under each document segment.

My Approach

I would like to highlight how I went about tackling this problem . The metric for this was ROC-AUC curve. I had decided on Transformer architecture and specifically I chose BERT ( Bibut even then I could use a completely pretrained model and use it directly for prediction on this dataset, or train the model from scratch. I took the intermediate option where I used the embedding ( encoder layer ) from the BERT architecture and fined tuned a custom feed forward neural network model ( decoder) on the embeddings from BERT model . This method is also called transfer learning.

Using the transformer library you can load the BERT model with its pretrained weights in a couple of lines of code.

Import BERT from Transformers

You can chose any type of architecture by specifying its name in quotes .A detail list can be found on hugging face website. This downloads the model, corresponding weights , vocabulary used and its corresponding tokenizer used from AWS S3 bucket.

The model code used for classification

The fist two lines show the input format to BERT . It requires input_ids which is text tokenized ( broken up into distinct words , lowercased and applied all pre-processing steps) which is then converted into number using a pre-defined vocabulary. The other input it requires is mask tokens .

Mask tokens helps the model to distinguish between a actual token ( word) and padding which are added so as to make all inputs to the model of the same length . This is done by adding zero tokens at the end of the sentence.

source: Jay Alammar

The above picture shows how a sentence A visually stunning rumination on love is pre-processed and converted into numbers after this if length of this sentence is less than required length it will padded.

input_ids:[101,1037,17453,14726,19379,12758,2006,2293,102,0,0,0,0,0]

mask_tokens: [1,1,1,1,1,1,1,1,1,0,0,0,0,0]

If you want to use the other columns of the dataset you would have to create another input layer here .

After this both this input layers are fed into a pre-trained BERT Model whose weights are frozen this gives a vector representation of the input sequence which is then fed into a feed forward neural network .

We optimize the binary crossentropy loss function using the adam optimizer

Results

After running the model for just 20 iteration using GPU I was able to get a private leader board AUROC score of 0.85

Conclusion

The transformer is surely the state of the art architecture and libraries like huggingface are surely making it easy and straight forward to apply this to practical problems.

Future Work

Surely more work can be done on this dataset with more careful pre-processing of text or selecting a different transformer architecture or incorporating the other data columns it is possible to improve the AUROC score to 0.88

Feel free to comment if you like the article.

LinkedIn: https://www.linkedin.com/in/sawantsumeet/

Kaggle Notebook : https://www.kaggle.com/sumeetsawant/stumble-upon-challenge-auc-private-lb-0-85

--

--