Bridge the gap between online course and kaggle-experience from Jigsaw unintended Toxicity bias classification

Don Don
The Startup
Published in
8 min readJun 27, 2020

--

Photo by Fab Lentz on Unsplash

It’s easy to find all kinds of online courses about machine learning and deep learning nowadays. But after all kinds of courses, one still finds it’s hard to start some real work and not sure how to apply different knowledge. The best way to advance is to work on different projects and participate in competitions like ones hosted on kaggle. However, the first competition can be very difficult. You don’t know what to do with the given dataset. You quickly run out of ideas to try because of the limit of knowledge. Or maybe you wasted a lot of time on some fundamental things or deal with hardware issues and don’t have enough time to experiment your ideas.

I recently finished my one-month modelling challenge with Jigsaw unintended Toxicity bias classification. This competition ends around a year ago. Even though it looks outdated, I still think it worth checking out. Over time, some algorithms might lose popularity, but the importance of fundamental skills never fades such as how to preprocess the text or how to build the model with the library of your choice (tensorflow or pytoch).

Within a month, I picked up the knowledge of using pytorch library, building different models such as LSTM and Bert, transfer learning, multi-task learning, different tricks in NLP and many more. After doing a lot of my own trial and errors, I want to share my experiences and lessons to people who is also a beginner in NLP knowing not much besides online courses and wants to start some project. And I hope this blog can help new people to bridge the gap between courses and projects as soon as possible.

There are three things that makes the most impression on me: apex, bucket sequencing and setting random seed. I chose to talk about them because they are fundamental but also easy to be neglected by a beginner. I didn’t talk about many NLP techniques here since there are a lot of resources online already.

Set up apex

This is the first thing you should do before writing any code. In this modelling challenge all my code is based on pytorch library. 10 days into the modelling challenge, my model’s best result is only 0.66 using competition metric. I know I have an error in the code, since I am nowhere even near the benchmark kernal. I troubleshooted the problem by comparing with public kernals line by line and in the end I found the magic piece of code:

MyModel, optimizer = amp.initialize(MyModel, optimizer, opt_level="O1", verbosity=0)

After this change, my model started making sense and the result jumps to around 0.9.

The instruction of how to set up apex can be found in official github repository. Apex is a pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. It uses mixed precision in model parameters to save space and speed up training process without losing much accuracy. The whole implementation is only three lines of code. Before training loop:

from apex import ampMyModel, optimizer = amp.initialize(MyModel, optimizer, opt_level="O1", verbosity=0)

And in the training loop, instead of directly using ‘loss.backward()’, do this instead:

with amp.scale_loss(loss, optimizer) as scaled_loss:            
scaled_loss.backward()

This checks the possible underflow in the gradient and change the representation of the gradient accordingly. It can help to prevent infs/NANs in the gradient and be able to capture small gradients to make model training more efficient.

Bucket Sequencing

This is a trick I learnt after I went through some public kernals in kaggle. This trick makes quite an impression on me since it greatly reduces my training and inference time and I wish I had known it earlier.

Neural network is notorious since it’s very time and resource consuming to train. Therefore reducing the time is important since it allows you to experiment more and include more models in the final inference. One reason why it’s time consuming when dealing with texts is that traditionally all texts are padded to the same length. For example, suppose the universal length is set to 6, a batch can have samples such as [1, 2, 3, 0, 0, 0] and [2, 4, 0, 0, 0, 0]. However, the padded part is not useful for training or inference and it is better to get rid of it since it can save time. This is where Bucket Sequencing can help. The small change that Bucket Sequencing makes is that instead of padding according to a universal maximum length (6 in the above example) set by yourself for all training data, the padding is done according to the maximum length in one batch. Using the same example, the batch will be [1, 2, 3] and [2, 4, 0]. I used this trick to reduce my training time by 30% and inference time by 80%. The details of the implementation is as follows. I will explain the code line by line.

def clip_to_max_len(batch):        
inputs, Target, Target_aux, Target_identity, weight, lengths =
map(torch.stack, zip(*batch))

max_len = torch.max(lengths).item()

return inputs[:, :max_len], Target, Target_aux, Target_identity,
weight
def resort_index(ids_train, num_of_bucket, seed):

num_of_element = int(len(ids_train)/num_of_bucket)
ids_train_new = ids_train.copy()

for i in range(num_of_bucket):
copy = ids_train[i*num_of_element:
(i+1)*num_of_element].copy()
random.Random(seed).shuffle(copy)
ids_train_new[i*num_of_element:(i+1)*num_of_element] = copy
return ids_train_newlengths = np.argmax(sequences == 0, axis=1)
lengths[lengths == 0] = sequences.shape[1]
train_dataset = data.TensorDataset(inputs_train, Target_train, Target_train_aux, Target_train_identity, weight_train, Lengths_train)ids_train = lengths_train.argsort(kind="stable")
ids_train_new = resort_index(ids_train, t_config.num_of_bucket, s_config.seed)
train_loader = torch.utils.data.DataLoader(data.Subset(train_dataset, ids_train_new), batch_size=t_config.batch_size, collate_fn=clip_to_max_len, shuffle=False)

The first block of code defines clip_to_max_len function. In the first line from inputs to lengths are variables to load in the dataset. lengths is the lengths of texts in a batch. Then max_len is the maximum length in a batch. Our returned inputs is clipped according to max_len.

I will skip the second block for now and went back later. In third and fourth block the variable lengths is defined and loaded into the dataset together with input texts and targets. lengths records lengths of the corresponding input texts. This way of getting lengths from input texts applies to padding text at the end of a sentence.

In fifth block of code texts are sorted. ids_train is the index of input texts from shortest to longest. Sorting is necessary otherwise the maximum length in one batch can’t be guaranteed to be much smaller than the preset universal maximum length. Then for training I divided the sorted texts with ordered indexes ids_train into two buckets and shuffled texts within each bucket. This adds the randomness to the training while also keeping the benefits of shorter training time. The details of implementation is in the resort_index function from the second block of code. I only implement the ‘bucket’ parameter during training process. Since in inference process, you can just go from the shortest text to longest ones to save as much time as possible. The bucket number can not be too large otherwise the training data will be too ordered, resulting in a distribution very different from the actual data distribution and the training process doesn’t have enough randomness to allow the model remain general. Besides, in inference process, remember to keep the old indices so that we can find the right indices for the prediction. I didn’t show this part of code here but it should not be too hard to figure out.

The last block of code shows the dataloader to load data padded according to the maximum length in a batch. data.Subset(train_dataset, ids_train_new) is the dataset with new indexes obtained by shuffling within two buckets. The collate_fn parameter is set to clip_to_max_len function to clip texts in a batch. Shuffle is set to false since we don’t want to mess up the dataset with ids_train_new.

Fix the seed. Tune it maybe

This is more of a standard procedure. Neural network can be pretty random. Therefore when you get a golden model you don’t want to lose it. To be able to reproduce the model later it is important to fix the seed in the beginning. You can fix the seed a few lines of code:

def seed_everything(SEED):    np.random.seed(SEED)    
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.cuda.set_rng_state(torch.manual_seed(SEED).get_state())

Another advice regarding the seed is that: you can treat the random seed as a hyperparameter when you really run out of ideas. Changing seed to produce a new model and using this model in the later model blending procedure can gives a better result.

Summary of my methods in the competition

This part is for people who’s interested in working on this competition. I will list things I have tried and what I think are the most effective ones to help you get some idea of how to improve your modelling results. Three points that give the most boost to my final score are:

  1. Use more data. This is always the first thing to think about. I tried to feed as much data as possible to train given the time limit I set for myself. Another way I didn’t try though is to use external source data and think about how to utilize the data to help your task.
  2. Train Bert based models 2 epochs and LSTM models 3 epochs. Even though models can take long to train, don’t be afraid of throwing more time. I found 2 epochs for Bert and 3 epochs for LSTM usually give the best result. After that models can start overfitting.
  3. Try different varieties of models and blend them. Since I only have one month, I only tried Bert-base-cased, Bert-base-uncased and LSTM models. Then I used different hyperparameters to train them. The main hyperparameter I was tunning is the random seed. Different types of models can have different inductive biases, capturing different information in the data. Combined with different random seeds, all the models can help to generalize the prediction better and give better results.

Things I implemented but I am not sure about their impacts:

  1. Different methods of preprocessing. For example translating different languages, replacing slangs and so on. My personal feeling is these preprocessing is more important to LSTM models since my LSTM model’s vocab is created with public word embeddings which have standard words. Therefore it is important to make sure that words appearing in a sentence can be recognized and found in the public word embeddings such as fasttext. I haven’t tried other types of embeddings myself but I think other ways might not require as complicated preprocessing.
  2. Tuning weights for different samples. Personally I couldn’t figure out the best weight setting. I tried a lot but the results are either worse or only slightly better (0.001) than the original. Maybe I didn’t find the right weight.
  3. Multi-task training. I tried classifying not only the toxicity of a tweet but also its identities and in which way it is offensive (all labels provided in the training set). But the improvement in the final result is not very significant.

All these things don’t work out that well, but they are good practice and help me to learn a lot during my attempt to set them up. With all the above steps, my final score 0.94276 ranking 145 out of 3165 in private leaderboard.

Conclusion

In this blog I talked about three things a beginner must learn to do in a NLP task: apex, bucket sequencing and random seed fixing. I also mentioned what I did during the competition, hopefully it provide a clear guidance if someone wants to work on this project. If you are also a beginner, I hope after this blog you have a better idea about what to do. And from my advice and following instructions from kaggle top solutions, I hope you can feel that these competitions are not so formidable and quickly gain confidence. My code can be found on github.

--

--

Don Don
The Startup

A physics PhD student interested in data science, machine learning, deep learning and AI. https://www.linkedin.com/in/yhtang-96833b1a8