Semantic Code Search Using Transformers and BERT- Part I: Overview and Data Preprocessing

Shashank Ramesh
The Startup
7 min readJul 10, 2020

--

Intro

Programs must be written for people to read, and for machines to execute.

This was the notion followed about source code. Now with the evolution of deep learning its time to change the norms and also make machines read and understand source code.

Why venture down this objective you ask? — Machines understanding source code would mean us finding our desired snippet of code from large repositories in a jiffy. This would be unlike the tedious traditional experience of finding code sections through word search using Ctrl+F. Instead we could give a short description of what we intend to find and let the machine using its understanding of code find the code snippet matching the given description. The GIF below shows a demo of our app where we search for query and are served with functions which implement the functionality described in the query.

Demo of search engine

Through this series of blogs I would be taking you through my journey of building the above Semantic Code Search Engine End-to-End from all the AI techniques used to its final deployment. To see a demo of our web app and to know about important topics covered in the blog, you may view the video below.

Video containing demo and important topics covered

Summary

To summarize we would be building a search engine which would take input a search query, find a function from a large pool of source code which implements the functionality semantically matching what is being queried. To implement the semantic search we will be posing it as a supervised learning problem where we extract the function definition and its docstring (refer image below) from the source code for training.

Elements of a python function

In the training process our aim would be to convert the function and its corresponding docstring into vectors such that both of them are similar and in a shared vector space. To convert the docstrings to vectors we would be fine-tuning a pre-trained BERT model and for understanding the function definition and its conversion to vectors we will be using the transformer model. Once we have vectorized our functions we use nmslib for searching through the function vectors and giving us functions similar in functionality to our search query. The model is deployed into a web-app using Flask. The methodology flow chart for our project is shown below.

The series of blogs will take you through the entire process in great detail with thorough analysis of every section of our methodology and aims to introduce the reader to using state-of-the-art NLP techniques like transformer and BERT, efficient ways to train large data and usage of tools like nmslib, Flask and how to combine all of them together to solve a complex problem like building a semantic code search engine.

The blogs explains in detail how to make a semantic code search engine but the learnings can be applied to make search engines to search any other arbitrary object like images, videos etc.

The current article is part-I of the series where we discuss on gathering the data for training our model and preprocessing the raw data to make it useful for our models to gather insights from.

Part-II picks up from the foundations laid in part-I and is a dive deep into the process used to convert docstrings to vectors.

Part-III is the final stretch which involves converting the functions to vectors and using these vectors to make a search engine and also deploy it using flask.

The source code for the project is available on GitHub

Data Collection

The data required for making our semantic search engine is large code bases which are made publicly available by GitHub on this link. There are 10 such files which can be downloaded by changing the last digit to its successor in the link. The data set contains code from repositories of diverse domains and can help in building a general purpose search engine.

Once downloaded we get a corpus of data containing 12,41,664 files and we can use the below snippet to read all the files.

Code to read all the files into a data frame

This is how the data will look like after running the above steps.

Dataset Description
Snapshot of the head of the data set

Data Preprocessing

Extracting Function-Docstring Pairs

The next step would be to extract functions and its respective docstring from the source code content in the files. For this task we will be using the AST library which just as python compilers converts the source code into an abstract syntax tree for analysis. It identifies all different components in the source code and helps us extract them for processing. We are interested in extracting the function definitions along with its corresponding docstring removing other information like comments, decorators and function signatures. After we extract the function definition and its docstring we tokenize each of them to remove punctuation, decorators and convert all the tokens to lower case.

Parsing the code files and extracting function-docstring pairs

Once we have extracted our function-docstring pairs and their tokens which are free from decorators and other unwanted elements,we stack our findings in a data-frame with every row containing details about a function and its corresponding docstring.

Data set after extracting function-docstring pairs

Further Preprocessing

We further process our the function tokens and docstring tokens by

i) Removing Duplicates — All entries which were duplicates with respect to function definitions or function tokens were removed. 11,99,426 duplicates were removed. This was done to not give more weightage/bias to certain function-docstring pairs while training, as having more entries of a certain data point could influence training.

ii) Separate functions without docstrings — On analysis we found that a docstring length of 4 tokens and above could only do justice in explaining the function and hence all those functions which have a docstring with three or less tokens were separated. This will help us segregate a data set of function-docstring pairs which can be used for supervised learning of a shared vector-space for vectorization of the code and its docstring. The number of functions without docstrings were 40,13,756. We do not discard the data set of the functions without docstrings, rather our objective is to learn from those with docstrings and learn the ability to search this code as well.

iii) Remove __init__() function entries — Remove function entries which were used for initialization of objects and those functions which have less than 5 function tokens. On analysis these were found to be those functions which were intermediate functions called for insignificant operations/processing before sending the data to the main logic.

After completion of the above processing steps we are left with 12,27,268 entries in our data set which need to be divided into train, validation and test data sets.

Train-Validation-Test Split

It was decided not to randomly split the data set instead to group the entries according to the repositories they belong and then split the data set. This was done to prevent any leakage of data during training and testing as every repository has its own style, variable name choices etc. which if we use while training and testing will not give us an idea of how well does our model generalize to even code from other repositories.

An 80–10–10 Train-Validation-Test Split was done with train set consisting of 996,341 validation set 125,377 entries and test set with 105,550 entries.

Sort the datasets

After separating the data sets each of them were sorted, based on the number of function tokens in each entry, in ascending order. The benefit behind the sorting of the data sets will be discussed later.

With this final step we finish the preprocessing of our data set.

--

--