Semantic Code Search Using Transformers and BERT- Part III: Converting Functions to Vectors & Deploying the Search Engine

Shashank Ramesh
The Startup

--

Intro

This article is a continuation to Part-II linked below which deals with taking you through the training process of a BERT model for converting the function docstrings into vectors.

In the current article we will continue from where we left off in part-II and will discuss in detail the methodology used to convert functions to vectors, create the search engine and finally deploy our search engine using flask.

Converting the functions to vectors

We have converted our docstrings to 768-dimensional vectors and next we need to convert the functions into 768-dimensional vectors such that the function vector and the docstring vector are in a shared vector space.

What is a shared vector space?

On conversion of the functions to vectors if their values are very similar to that of the vectors generated from their corresponding docstrings, then we can say that the function vectors and the docstring vectors are in a shared vector space. Achieving this would help us interpret code and generate vectors for code as if we are generating vectors for its function description in English.

To understand the concept more clearly, let’s suppose we had two gentlemen one German and the other French and they wanted to communicate their love towards an English girl. They both approach translators one to translate his German proposal to English and other to translate from French to English and in either case both the translators will tell them the phrase ‘I Love You’ to tell the girl. Now who does the girl choose is not a matter of our concern, but lets try to understand how shared vector spaces is aligned to this. Here in the example lets assume French language was the docstring input and German language is the code input now both of them are taken by the translators in our case they are BERT and Transformer models respectively. Each of the translators in our story received inputs in different languages but both of them translated it to a similar phrase as their intent was alike. Similarly, if our code’s functionality and docstring’s description are similar we need to output similar vectors for them, which makes a shared vector space. This is what is the concept of a shared vector space.

Now that we have understood what a shared vector space is, we now proceed to our next task which is to generate function vectors such that they share their vector space with the docstring vectors from the ALBERT model.

How will we use transformer model to generate function vectors?

Transformer is widely popular due to its state-of-the-art results in the field of machine translation, hence it was decided to train the transformer model to translate functions into their English descriptions (docstrings). After training the model we would remove the decoder from the transformer architecture and use the trained encoders of the transformer to give an encoded representation of the function. This representation is further passed through LSTM and Dense layers to get the output as a 768-dimensional vector which is trained to share the vector space and be similar to docstring vectors.

Translating functions to its English description (docstrings) using transformer

The transformer architecture and code which we will be using for translating our functions to docstrings is the one used in the tensorflow tutorial ‘Transformer model for language understanding’ . There are few modifications from the tutorial made by me in preparing the training data set,which only will be discussed here.

To begin training using transformers we need to prepare and format our data so that it can be used by transformers as an input to train upon. The steps for it are -

i) Generating a vocabulary — We begin with generating two separate vocabularies one from our function tokens and the other from the docstring tokens and subsequently use these vocabularies to convert our function and docstring tokens into ids(which is the token’s index position in the vocabulary). We do not use the SubwordTextEncoder as suggested in the tutorial instead we use the BertWordPieceTokenizer. This during our analysis was found to give us a better vocabulary as we found lesser breakage of common words used into subwords.

It is always good to use sub-word or word piece tokenizers for creating a vocabulary from custom text as they can handle out-of-vocabulary words more properly by breaking such words into sub-words, which could help understand what the unknown word wishes to convey. For example, lets suppose our vocabulary contained the words bat and man but not the word batsman. On tokenizing it through subword tokenizers we would get ‘bat’ ‘s’ ‘man’ tokenized from batsman which could still help the model in understanding what the unknown word batsman could have meant as it understands the words bat and man which are part of the vocabulary.

Using tokenizers to generate vocabulary and encode the input

ii) encode function — Since we modified the tokenizer we need to make a few changes to the encode function from the tutorial.

Modified encode function from that in the tutorial

iii) filter_max_length function values — A graph was plotted for the sorted list of number of function tokens, from which it was inferred to choose a max length of 300 function tokens. The same process was repeated for the docstring tokens and a max length of 30 was inferred.

iv) Do not shuffle the train data set — During the preprocessing phase the last step was to sort the data set according to the count of function tokens. We now will be able to reap its benefits as when creating batches the padding tokens are added based on the largest number of function tokens in each batch. Sorting the data set based on the count of function tokens in an entry will cause all similar sized inputs being together in a batch, which will reduce the padding tokens added to every input drastically. On shuffling the train data set we would loose this benefit and will have varied lengths of function tokens in every batch and hence the padding tokens added per batch will increase. Therefore, we must not shuffle the data set.

v) Values of start token and end token — Since we used a different tokenizer the start token in our case is ‘[CLS]’ and it’s id is 2 and end token is ‘[SEP]’ and its id is 3 and for the tokenizer used in the tutorial it is the last two ids of the vocabulary. We must replace these values wherever they are used.

Once we are done with the above we can follow the tutorial for the rest of the steps and train the transformer for about 10–15 epochs to learn to translate functions into its English description (docstrings).

Debugging Tip:- If you are having out of memory errors while training reduce your batch size to a smaller value.

Transformer translating function in English

Examples of the output obtained from our trained transformer model is shown below.

Translations made by our model

Using transformer to generate function vectors

After the above steps we have a trained transformer which can take as input a function and translate it in English. The above task was performed by us as a precursor to our task of conversion of functions to 768-dimensional vectors. Doing so has helped us initialize our encoders weights and made them understand functions in English. This will ease the task of conversion of functions to vectors in shared vector space with the docstring vectors.

This style of training is very similar to that used in training ALBERT wherein you train for a different set of tasks like masked language modelling and sentence order prediction to gain understanding of the language being dealt with. We then fine-tune it for other specific purposes like generating embeddings or classification etc. This is also what we did with our docstrings wherein, we took a pretrained ALBERT (trained in masked language modelling and sentence order prediction) and fine-tuned it further for generating custom embeddings for our docstrings.

Let’s begin with making our trained transformer convert functions to vectors, which most importantly share their vector space with the docstring vectors.

Generating the data set

We need to generate a data set of functions and their corresponding docstring’s vector generated from the ALBERT model. With this we will be able to train our model to learn to map the functions to a vector similar to its corresponding docstring vector and in this way the function vectors and the docstring vector will be in a shared vector space.

The main challenge with creating the data set is that each function will have a 768-dimensional docstring vector and we have more than a million of them! Storing such a huge data set for processing and training into variable will be impossible on average systems with low RAM. So how do we create this data set and train on it?

The answer to this is lies in understanding how the training process works. In each step of the training process we generate a batch from our data set. These batches are what are sent through the model for inference and updation of its weights. From this it can be observed that technically at a certain point of time during training all we need is one batch of training data and once this batch is processed it can be discarded so that it doesn’t take up memory and the next batch can be fetched. This would not need the entire data set to be in RAM. We can fetch the docstring vectors needed from the text files during the training process on the fly and generate batches for training. This is exactly what is acheived by the tensorflow fucntion ‘tf.data.Dataset.from_generator’, this executes the function provided as an argument to fetch batches on the fly during the training process.

Generator to a read a file and create batches for input during training

Another problem which i faced was I had multiple docstring vector files as i split the docstrings in the train data into chunks and used my computer and Google Colab both to generate docstring vectors from the ALBERT model for different chunks in parallel. The above code snippet discusses reading only a single file and extracting data to form a batch. What if we had to read data from multiple files? FileInput module comes to our rescue! which enables us to read multiple files one after another and extract data from them in sequential manner.

Generator reading from multiple files and creating batches for input during training

One more thing to note is to not shuffle the data as it would lead to slower speeds due to the use of the generator. Also, on shuffling the data set we would loose the benefit of having a sorted data set, instead will have varied lengths of inputs in every batch leading to an increase in the padding tokens added per batch.

Modifying the architecture

Now that we have done the necessary arrangements to fetch our train data set we need a model to process it. We build another transformer model just as we did previously but with a few modifications. We do not include the decoder layers in the transformer instead send the input from the encoder layers of the transformer to an LSTM layer which passes its output to a dense layer to finally output a 768-dimensional vector. Below is the architecture of our modified transformer. We hyper-parameter tune for number of dense layers and the number of hidden units in the LSTM using the validation set and Keras Tuner.

Modified Transformer to generate 768-dimensional vectors
Modified architecture of a transformer to generate function vectors

Choosing the loss function to minimize

Our model needs to learn, during training, by comparing the 768-dimensional vector generated for the function input by itself to the ground truth (which is the docstring vector generated by the ALBERT model) and compute a loss. This loss will be back-propagated and the gradient will be updated. We cannot use cross entropy loss for this purpose and instead need a loss function which will calculate the similarity between the vector generated by the model and the ground truth. Optimizing for this loss should increase similarity between the function vectors and the docstring vectors ensuring they be in a shared vector space. The loss which fulfills these characteristics is cosine loss or cosine similarity.

Why Cosine Similarity?

Let’s understand why will cosine similarity fit the role. Suppose we have a word like ‘current’ which has many different meanings, a model while learning a vector representation for such a word will be tugged across various directions due to the varying contexts the word is used in. Due to this the resultant vector will have a lower magnitude value. Now lets consider another word similar in meaning to ‘current’ which is ‘electricity’. ‘electricity’ doesn’t have many interpretations and is used in very similar contexts hence it will have a vector which is very high on magnitude, which states the confidence of the model in the vector.

An example vector space for the word ‘current’

Assuming we use a loss metric like euclidean distance to find similar vectors to ‘electricity’ we will surely not get ‘current’ even in one of them which is one of the similar words in meaning to electricity. This is because euclidean distance metric uses the magnitude of the vector to determine similarity and there will be a huge difference in magnitude due to difference of confidence in the vector generated by the model. Similarly, magnitude of the vector is also dependent on number of times the word occurs in the data set. This would make the model more confident about the vector having seen multiple representations of a word and hence there would also be a disparity in magnitude in lesser represented and more represented words. All of this suggest that magnitude is not what is to be used to calculate similarity between vectors and hence the only thing which we are left with is direction. Cosine similarity is nothing but dot product of normalized vectors. The normalization operation in the calculation of similarity removes the impact of magnitude and compares similarity using direction, more specifically the cosine of the angle between the vectors. This is a good measure of semantics as words which are similar in meaning i.e. used in similar contexts have the same direction and optimizing on cosine loss will give us a more semantic approach in finding similarity between vectors.

We know that the cosine similarity values range between -1 to 1 with 1 being the highest value denoting complete similarity between the vectors. All our optimizers work on the concept of gradient descent which reduces the loss. Here reducing the loss would mean similarity values tending towards -1 which is not what we want, as this would generate completely dissimilar vectors. Keeping this in mind tensorflow implementation of the cosine loss is negative of the actual formula for cosine similarity computation, now -1 becomes the value for highly similar vectors and we can use the concepts of gradient descent to reduce our loss.

Tensorflow has two implementation for the cosine loss function one is ‘tf.keras.losses.CosineSimilarity’ and the other is ‘tf.keras.losses.cosine_similarity’ both have very similar implementation but ‘tf.keras.losses.CosineSimilarity’ is preferred as it has an extra parameter of reduction value. The reduction parameter helps in giving a single value of loss for a batch by averaging the loss values for each data point in the batch. This is unlike ‘tf.keras.losses.cosine_similarity’ which will output an array of loss values one value for each data point.

We can update the loss function in our transformer to use cosine similarity by using the below snippet of code.

Loss function code

Training the transformer to generate 768-dimensional function vectors

Now that we have the right loss function in place we can begin the training process. Before we begin the training process we need to initialize the encoders used in our modified transformer to those equal to the encoders in the transformer trained to translate functions to English.

Transferring weights of the encoders from the transformer we trained

Debugging Tip :- The above code could cause some errors due to lazy loading of the transformer model. The transformer model built by us is a sub-classed model and for these models model.build() is called only when we execute the __call__() method of the model. The __call__() method in-turn calls the build function and creates the weights in the model. Hence if facing errors run the model for an epoch and then set the weights, after which you can continue with further training.

The training process first involves training without the updation of weights of the encoder layers i.e. freezing the encoder layers and only updating the lstm and dense layer weights. We run this training for about 10 epochs. Finally, all the layers are made trainable and we run the training again for another 10 epochs. Finally, we have with us a trained transformer which can be used to convert functions to vectors.

Creating the semantic search engine

Its time to put our model trained to generate function vectors to test by searching through the vectors generated to find results intended by the search query.

For this we will be using the corpus of those functions which do not have docstrings which we had neglected all along and try to see if our model can make sense of these functions and generate semantic search results. Before we test we need to convert the functions without docstrings to vectors using our trained model this can be done by the snippet below.

Generating and saving function vectors for functions without docstrings

Now with our function vectors generated we are ready to search through them to find the desired function for our query. To do so we use the Non-Metric Space Library (nmslib) which does the heavy-lifting for us to construct a scalable and efficient search infrastructure, easily done using the code below.

Creating the search graph of our datapoints using nmslib

How does nmslib find nearest neighbors?

Above snippet shows us some highly abstract code which creates the search framework, used by us for searching the vector space of functions for semantically similar functions to a given query. But what happens behind the scenes and how does nmslib create our search framework? It is through creation of Hierarchical Navigable Small World (HNSW) graph from the function vectors which can later be used for searching.

Let’s understand what are Hierarchical Navigable Small World graphs, how they make nearest neighbor search efficient and an intuition of how does nmslib create them.

What is a Hierarchical Navigable Small World graph?

Before we understand HNSW we need to know a few concepts.

Small world graphs are a type of mathematical graphs wherein the typical distance between two randomly chosen nodes (the number of steps required) grows proportionally to the logarithm of the number of nodes N in the network.

Example of a Small World Graph (Source: Wikipedia)

To find nearest node to our query point we can take advantage of the small world property and navigate quickly through the graph and find the approximate nearest neighbor efficiently.

Searching for the nearest neighbor using greedy approach (Source: Website)

A greedy approach is used wherein we choose any node at random in the graph (entry point) and for the node chosen compute the distance of it’s neighboring nodes from the query point and find the neighbor nearest to the query point. If the neighboring node found is closer to the query point as compared to the chosen node chosen we move to that node and follow the same process of finding a neighbor of that node nearest to the query point and moving further in that direction. This continues till we reach a node where none of its neighbors are closer to the query point, this node is declared the nearest neighbor to the query point. Example is shown in the image above and this is what is the ‘Navigable Small World’ (NSW) approach to find nearest neighbors. For more accurate nearest neighbor we can reiterate this process multiple times and find the nearest neighbor from all the search results but for a speed trade-off. If we want multiple nearest neighbors we need to maintain a priority queue (max-heap) which stores the top-k nearest neighbors encountered during traversal.

We’ve learnt how to search the Navigable Small World graph but how do we make the graph in the first place? To construct the graph we first start with a graph with only one node then for every new element we find the k closest objects which are already in the structure,using the greedy approach mentioned, and connect them with the new element using bidirectional edges. Authors also have shown theoretically and confirmed by experimental results that the undirected graph which is constructed by the proposed algorithm has properties of small world network if elements arrive in random order.

HNSW builds upon the NSW concept by creating a hierarchy of NSW graphs. You can draw parallels between HNSW and probabilistic skip lists, the only difference being skip lists are made from hierarchy of linked lists and HNSW from hierarchy of Navigable Small World graphs.

Skip List structure

HNSW extends the NSW algorithm by building multiple layers of interconnected NSW-like graphs. For every input to be inserted we probabilistically choose a layer number ‘L’ between the minimum and the maximum. Based on the layer number we then find the nearest neighbors of the input in the NSW graph at layer ‘L’ and connect the input to them. This same process is reiterated for inserting the input in each layer from layer L to the last layer. We choose an exponentially decreasing probability function to generate probability values so that the lower values have a higher probability and larger values less. This allows us to have sparse top-layers and density increasing with every layer down, with the last layer consisting of an NSW graph on every data point.

HNSW searching example (Source: Paper)

To find the approximate nearest neighbors to a query, the search process finds the nearest neighbors to the query in the graph at the top layer and uses these points as the entry points to the subsequent layer till we finally reach the bottom most layer and find the nearest neighbor to the query. Refer the image which portrays this process for a 3-layer HNSW graph.

This strategy results in a nearest neighbors search algorithm which runs logarithmically with respect to the number of data points. nmslib creates an HNSW graph of all our function vector inputs and finds the nearest neighbors to a query quickly and efficiently using the algorithm described.

Searching Process

With the creation of the search HNSW graph by nmslib all we need is to search through them and find the most similar results. The process followed is to encode the search query to a vector using our trained ALBERT model. All the function vectors similar to the search query vector are searched and we are returned index values and the distances of five nearest neighbors to the search query. We extract its details using the index value which corresponds to its index value in the data set and display the results.

Deploying our semantic search engine

To deploy our search engine we use flask. For deployment all we need is to import our trained ALBERT model, our data corpus comprising only of the function definition and it’s corresponding url, and load the search graph created by nmslib (saved using saveIndex method) using loadIndex method.

Once we receive the input query from the browser using the form GET request and convert it to a vector using our ALBERT model. Using the search graph we find the the 5 nearest neighbors. The search function will return us index values and using theses indices the function definition and URL of these nearest neighbors are gathered from the data corpus. Their cosine distance from the search query is also captured and these results are displayed in the results page. A demo can of our final model deployed using Flask in AWS can be seen below.

A demo of our semantic search engine

Evaluating the performance of the search engine

With our search engine created it is crucial to measure its performance. One of the best metrics to evaluate search engine performance is Normalized Discounted Cumulative Gain (NDCG). The reason why NDCG is better than many of its alternatives is because it takes into account graded relevance scores for the documents to be fetched for a query unlike other metrics like Mean Average Precision, Precision@topk etc. which score merely based on the number of relevant documents fetched.

When relevance scores pertaining to queries are available in the dataset, the NDCG is a good fit. Generating relevance scores requires man-power and this being a project made by me alone, it was difficult to carry out. Instead it was decided to evaluate our engine on basis of speed for processing a query and semantic similarity of the search results with the query. We ran multiple queries and found our search engine (the python search function, which converts the search query to a vector and finds k-NNs from a corpus of a million python functions) to give us results in an average time of around 200ms. The machine used being a NVDIA GTX 1070 GPU-enabled system with an Intel Core i7 8th-gen processor. We also found the semantic similarity of the results to be on point.

In the web-app deployed using flask on an AWS EC2 instance (single cpu, 2GB RAM) we measured the throughput by calculating queries responded to per second and found that the web-app could handle around 4 requests per second. To increase throughput we used multi threading using 5 workers ( chosen by measuring the latency reduction obtained with each unit increase in number of workers) so this would help us respond to a maximum limit close to 20 requests but in cases of high concurrency wherein multiple multiple requests are sent at the same time we are able to handle only 10 queries per second due to miscellaneous network delays (mainly due to buffering of requests). The above testing using simulated environments was done using Apache Jmeter.

Testing Results from Jmeter on sending 10 requests concurrently

Final Remarks

Hope you gained something from the series of tutorials and learnt what it takes to build a semantic code search engine. Thank you for reading and kudos to making it this far. Finally, make sure you write clean code with good use of variable names and comments, so that your peers understand it better and machines can search through it with ease.

The source code for the project is available on GitHub

My LinkedIn profile link

That’s all folks!

--

--