Semantic Code Search Using Transformers and BERT- Part II: Converting Docstrings to Vectors

Shashank Ramesh
The Startup

--

Intro

This article is a continuation to Part-I linked below which deals with taking you through gathering the data required and preprocessing the raw data to make it useful for our models to gather insights from.

In the current article we will continue from where we left off in part-I and will discuss in detail the methodology used to convert function docstring to vectors.

Converting the docstrings to vectors

The docstrings are converted to vectors using a pretrained ALBERT model which is fine-tuned on our data set.

ALBERT vs BERT

ALBERT which stands for “A Lite Bidirectional Encoder Representations from Transformers(BERT)” has a few advantages pertaining to our task as compared to BERT which are

i) Parameter Sharing — ALBERT though has the same number of encoders as the BERT, the weights between the encoders are shared unlike BERT where each encoder has individual weights. You could actually think of ALBERT as just a single encoder layer that’s applied multiple times to the word embeddings. This has significant effect on speed as sharing the weights greatly reduces the number of unique parameters in the model. When it comes to training and testing, can provide faster compute times as compared to BERT due to reusing of weights and saving time in fetching new set of weights for every encoder. Also the model occupies much less space in GPU memory due to saving only a single set of weights. The parameter sharing technique also act as a form of regularization that stabilizes the training and helps with generalization.

ii) Smaller Word Embeddings — In ALBERT, the initial word embeddings are 1/6th as long, with only 128 features. This drastically reduces the number of parameters. The notion behind this is that these initial embeddings are required to be more general i.e. the initial embedding for the word “bank” should be flexible depending on the sentence being processed as the single “bank” embedding needs to be used for the finance term — bank, a river’s bank, bank on a person etc. The embedding layers giving a more generalistic vector due to its lesser representation would leave it to the encoders to refine the vector to a more context-dependent output. This helps in speeding training, storing lesser weights and in learning better representations.

iii) Sentence Order Prediction — ALBERT trains on a harder task of Sentence Order Prediction, which is to predict if two sentences are coherent and which one comes before and which one after. In comparison, BERT trains on a task of Next Sentence Prediction, which is a binary classification if the next sentence is a continuation to the current sentence or not. The authors of ALBERT claim the latter to be an easier task. For example,

Sentence 1:- My friend came home.

Sentence 2:- He helped with homework.

In these sentences to predict Sentence 2 comes after Sentence 1 is easier due to words in past tense and reference of he to some person, probably being referred in Sentence 1. But to predict which sentence comes first and which follows is a harder task which will require the model to understand the statements, which is predicted by ALBERT. The authors prove in the paper that this leads to better generalized performance.

iv) LAMB Optimizer — Small learning rate with small batch sizes tends to lead to the best model performance as the gradients are not very noisy and learning happens in the right direction but the learning is very slow. In larger batch we utilize the GPU to its potential and training times are lower because a larger batch size means fewer training iterations per epoch, which needs to be compensated with a larger learning rate. Larger learning rates tends to make training more unstable due to noisy gradients which leads to poorer performance on the overall model. LAMB tries to fix the unstable training update problem of large batches and hence we can use larger batch sizes without having a big hit on performance. More about LAMB in this link.

v) n-gram Masked Language Model — ALBERT is trained on a masked language modeling (MLM) task similar to BERT. The only difference is in how it chooses which tokens to mask. BERT randomly picks 15% of all tokens to mask out. ALBERT does the same, but chooses contiguous sequences of tokens (referred to as “n-grams”) to mask. The task of predicting continuous sequence of words is more challenging as compared to a single word prediction task. For example in the sentence “The President Donald ______ has an important meeting in White House”, it is very easy to predict Trump as compared to predicting “President Donald Trump” in “The ______ _______ ________ has an important meeting in White House”. Solving a more difficult task leads to better generalized performance.

The above points prove that ALBERT is faster to train, low on memory consumption and trains on harder tasks as compared to BERT. It also gives comparable performance to that of BERT, hence is chosen by us for this project.

Why a Pre-trained Model?

Usage of pre-trained models is better in our case as it will save us time and resources for building an entire architecture from scratch and optimizing it for our task. Though we have a huge amount of data it is in no way near to the billions of sentences ALBERT is trained on, hence the representations from ALBERT will be far superior as it has seen more context. ALBERT is also trained such that it can be incorporated to solve a wide-variety of general tasks. Most importantly, our docstrings are nothing but mere English text descriptions which is what ALBERT is trained upon too.

Why fine-tune our model?

Even though ALBERT is trained on a large English corpus the datasets used for training is the Wikipedia Corpus, a large collection of books and so on. A code description text is slightly different from the sentences in these documents as words have special meanings for certain words not in adherence to the meaning in English used while general communication. For example, the statement “close the port and catch exceptions” means something completely different in colloquial English to when told in the context of programming as it uses programming jargons. To solve this problem we could fine-tune ALBERT i.e. fine-tune the word embedding weights and the weights of the encoders to make it understand the words and infer its meaning from a computer programming context.

How to fine-tune?

Now that we are convinced of why to fine-tune, the next question which arises is how to fine tune? To fine-tune the weights of ALBERT we could use a similar data set like a stackoverflow questions data set and train a model with pre-trained ALBERT as an embedder and add few other dense layers for any classification task. Another alternative would be, to do incremental training on the pre-trained ALBERT model using the same methods of training, i.e. n-gram Masked Language Modelling and Sentence Order Prediction using docstrings from our data set. The latter is more favorable for our task as it fine-tunes weight custom to our data set of docstrings unlike the former which fine-tunes weights for a similar data set.

For incremental training on a pre-trained ALBERT model we need to process our docstrings for the model to train on. We need to generate masks and create sentence pairs for the model to train on using the n-gram Masked Language Modelling and Sentence Order Prediction tasks which is a non-trivial task. Thankfully the team at Google who created ALBERT have provided us with scripts to process custom data and use it to carry out incremental training on ALBERT.

Fine-tuning ALBERT

We now begin the process of fine-tuning ALBERT to create more richer representations of our docstrings.

Creating training data

Few steps before you begin training are —

i) You need to download an ALBERT pretrained model from tensorflow-hub from this link. The link shared is for ALBERT-base which is used for this project. You can also download other variants from the tensorflow-hub models page.

ii) Next you will need to download two scripts from the ALBERT repository in github which are create_pretraining_data.py and run_pretraining.py.

iii) Write the docstrings, into a text file.

Writing all docstrings in a .txt file

iv) Lastly you need to install the module ‘albert-tensorflow’ using pip and we are ready for training.

Before you begin training you need to run the create_pretraining_data.py file which creates data in the format which can be used by ALBERT to do inference on.

Creates training data for our ALBERT model

Few things to note in the above snippet, the input file must be in .txt format. The ‘spm_model_file’ field must contain the path to the ‘30k-clean.model’ file which can be found in the assets folder on extracting albert model file downloaded from tensorflow-hub. The ‘vocab_file’ field must contain the ‘30k-clean.vocab’ which is the vocabulary of the pre-trained ALBERT model also found in the assets folder of the model file downloaded from tensorflow-hub. The ‘max_predictions_per_seq’ is generally set as 0.15 times the max_seq_length. The ‘output_file’ field is where you can find the output after the training in the form of a tf-record file.

Debugging Tip:- If you have errors in running the script try using tensorflow version 1.x only for the purpose of running the script and see if it rectifies the error.

A question which arises is why use the same vocab file which is used by the pre-trained ALBERT why not one custom to our data set which we are going to training on?

The argument though being valid as using our own vocabulary would help the ALBERT learn representations for even programming jargons, like ‘SQL, csv ’ etc., which might not be present in its own vocabulary. But, ALBERT word embedding weights are trained to work for the words in the vocabulary. To elaborate, every word has an id which is generally the index position of the word in the vocabulary. When text is passed to train the model they are passed in the form of ids’ not words. ALBERT already has learnt embeddings to the word specific to that id. If we create our own custom vocab we would spoil this ordering or numbering by generating ids for words not in accordance with those in the ALBERT vocab and hence cannot use the model for training. Therefore, if we are using the pre-trained weights we need to use the vocabulary used by ALBERT. Only if you want to train ALBERT from scratch could you use a vocabulary of your own. Nevertheless, many programming jargons are frequently used English words, which mean something different from the programming context and hence fine-tuning on a vast vocabulary of 30000 words of the ALBERT vocabulary should help us understand them.

Run incremental training

Once we have generated the training data from the above process we can begin the training process by running the run_pretraining.py file.

Executes the training process for the ALBERT model

About the inputs for the code snippet, the ‘input_file’ field must be the output file generated after the create_pretraining_data.py file and the ‘output_dir’ field the folder you want your output, which are model checkpoint files, saved in. Use the ‘init_checkpoint’ field if you want to continue your training from a checkpoint. If your training on your data for the first time and want to use pretrained weights you need to provide the path to the checkpoint file in the ALBERT model file downloaded from tensorflow-hub. The parameters which are also present in the create_pretraining_data.py must containing the same values which were used before. “num_train_steps” field must contain a value of hundred-thousand steps and above. Use “save_checkpoints_steps” to save your progress in checkpoint files after completion of a certain milestone of training steps. After your training you must get a loss and accuracy values for the training tasks performed as shown below. The metrics shown below is after training for a million steps.

Metric values after our training process

A masked language modelling accuracy of above 35% shows that the model has good language understanding of the data set. After our training we can find checkpoint files which store weight values of our model which can be fed into the model enabling it to generate better representations for our docstrings.

Note:- Any BERT family model you wish to train on your custom data the same guidelines could be used with some additions based on the model you are using.

Using our trained ALBERT model as a docstring vectorizer

From the training process we have a checkpoint file generated. Using this checkpoint file we could load weights into a tensorflow model to generate embeddings. To do so we have to provide input in the format which is accepted by ALBERT to generate embeddings from it. The input consists of generating ids for the sentence from the vocabulary and create padding and segment masks for our inputs. Finally, we could import the architecture of the ALBERT model from tensorflow hub and create a model which we could fill weights into using our checkpoint file generated from run_pretraining.py. This however is the cumbersome way of doing things I’ll help you with an easy way out using HuggingFace library and will be discussing the steps for the same.

Converting Tensorflow Checkpoint file to Pytorch dump

To use the HuggingFace library to realize our trained ALBERT model we need to convert our tensorflow checkpoint file to a pytorch dump. The snippet below helps us do it.

Converting tensorflow checkpoint file to a pytorch dump

To the function ‘convert_tf_checkpoint_to_pytorch’ mentioned above we need to pass the path to our checkpoint file generated after training, the albert_config file from the tensorflow-hub model file and the output path to save our pytorch_model.bin. Note:- The name of the output file must not be changed as it would lead to errors and you must keep the name as pytorch_model.bin only.

Creating the trained ALBERT model

Now that we have our pytorch_model.bin file with few steps we can make a functional model to help us generate embeddings for our docstrings. The below code will help us with it.

Before you run the below snippet you must place the pytorch_model.bin file, the albert_config.json file (renamed as config.json) and the ALBERT vocab file i.e. 30k_clean.vocab (renamed to vocab.txt) from the model folder downloaded from tensorflow-hub into the same folder and provide the path to this model folder. Renaming of the files is essential without which you will face errors.

Realizing a functional ALBERT model

Running the above snippet would download the vocabulary and the tokenizer used by ALBERT into albert_tokenizer and will create the ALBERT model using our trained weights. This same methodology can be used to create any model in the HuggingFace tranformers library from a tensorflow checkpoint.

Generating Embeddings

Now that our ALBERT model is ready we can churn out some docstring vectors from it. It’s as easy as encoding the input the via the albert_tokenizer and then send it as input to the model.

Encoding inputs using the ALBERT tokenizer

The output received contains 3 arrays -

The first array is of dimensions (batch_size, input_seq_length,768) and is the vectors for each of the words/sub-words tokenized from the input sequence. The vectors are output from the hidden units of the last layer of the ALBERT model.

ALBERT model architecture

The second array is a pooled output of size (batch_size,768)which is the output from the last layer hidden-state of the first token ([CLS] token), further processed by a linear layer and a tanh activation function. The linear layer weights are trained from the sentence order prediction objective during training.

The third array (is obtained only if output_hidden_states is = True while loading the config) is the vectors for each word in the input sequence output from each of the 12 encoder layer and the position encoder. The first one being the last encoder layer, the next being the second last encoder layer and so on till the final one being the output from the position encoder. Its dimensions being (13,batch_size, input_seq_len,768).

According to the HuggingFace library documentation the pooled output is usually not a good summary of the semantic content of the input, and it is better if we average the sequence of word-vectors output from the last encoder layer for the whole input sequence, to get the sentence vector. The first array of the output from the model is a 768-dimensional vector for each token in the docstring. It is output by the last encoder layer. We need a single 768-dimensional sentence vector for our docstring, which we get by averaging all the input token vectors.

Note that the tokenizer while tokenizing adds the [CLS] at the beginning of our input and a [SEP] token at the end of the input. We do not considering these embeddings while averaging to get the sentence vector. The sentence vector is obtained by the code given below.

Generating sentence vectors for our docstrings

The above snippet shows us how to get embeddings for a single sentence input. We can scale this to convert all our docstrings in the train set to vector representation. Once done we save it in a .tsv file to fetch it comfortably for later use.

Generating vectors for the train data set and saving them in a .tsv file

--

--