Build your own StackOverflow.

Carlos Ortiz Urshela
Sounds klever
Published in
4 min readNov 3, 2022

Use codeBERT to create a search engine for your private code repos.

This image was generated by Dall-E

Companies developing closed-source digital products based on large or medium size code repositories may want to try AI language models trained with source code to optimize the onboarding of new developers.

Using these language models, software companies can build tools to help their new (or old) developers to figure out how the system work and find common code patterns to maintain or implement new features based on the project’s modules or libraries.

Let’s say this is your second week on a new job, a cutting-edge banking platform, and you were assigned to develop a new feature: “emit a security event if the customer has made more than five withdrawals in the last two minutes.

There are generally two options for addressing this situation. Ask a more experienced developer (usually very busy), or spend time navigating the documentation and the existing codebase to find a similar use case.

But wait a minute, wouldn’t it be cooler to have a search engine where you just type “How to get the last n transactions of a customer in a time window?” and get back some code snipped that already implement that functionality? Something like this:

Customer c = context.current_customer()
transactions = c.list_transactions(2,time.MINUTES,sortOrder.DESC)
-----There are five more related examples....

Well, it turns out that building a search engine for source code like the one above is not that difficult. Below I’ll prototype a collaborative code search engine based on the CodeBERT model developed by Microsoft.

CodeBERT is a model for programming languages pre-trained on NL (natural language) — PL (programming language) pairs in six programming languages (Python, Java, JavaScript, PHP, Ruby, Go). CodeBERT can be used in several use cases, like generating natural language documentation from source code, implementing a code reviewer, and extracting embedding from source code and natural language sentences.

In this case, I will use the model microsoft/unixcoder-base to extract embedding for source code snippets and description sentences and then feed it into a vector database like Pinecone or Faiss.

Well, let’s start writing some code to import the required libraries and set up the model:

!wget 'https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py'from unixcoder import UniXcoderdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")model = UniXcoder("microsoft/unixcoder-base")
model.to(device)

The first step is to create a function to extract the embeddings from code snippets and natural language sentences.

"""
Extract embeddings from a code snippet or a natural language query.
"""
def get_embeddings(text):

tokens_ids = model.tokenize([text],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,nl_embedding = model(source_ids)
norm_nl_embedding = torch.nn.functional.normalize(nl_embedding, p=2, dim=1)
norm_nl_embedding = norm_nl_embedding.detach().cpu().numpy()[0]
return norm_nl_embedding

To keep this example within the scope of a simple prototype, I will create a list with only four code fragments. Then we will extract and save the vector embeddings of each fragment into an “In memory vector database” that we can search later.

To build a production-level solution, we can create a web-based collaborative workflow to add the most relevant code snippets from the company’s repository. Each snippet is processed to extract its embedding vector, which then we can add to a vector database such as Pinecone.

code_corpus = ["""
Customer c = context.current_customer()
transactions = c.list_transactions(2,time.MINUTES,sortOrder.DESC)
""",
"""
ticket = Tickes.from(description,prority.MEDIUM)
ticket.save()
""",
"""
Customer c = context.current_customer()
c.sendMsg(alarmMsg)
""",
"""
customer_categories = [value1, value2, value3]
plt.hist(x, bins = 5)
plt.show()
"""
]vector_database = []for code in code_corpus:
vector_database.append(get_embeddings(code))

Now we are ready to start querying our code database!

nl_query = """
list the last N transactions of a customer in a time window.
"""
nlq_emb = get_embeddings(nl_query)cos_scores = util.cos_sim(nlq_emb, vector_database)[0]
top_results = torch.topk(cos_scores, k=3)
top_results--------------------torch.return_types.topk( values=tensor([0.4847, 0.3150, 0.2843]), indices=tensor([0, 3, 2]))

So, in this case, the index of the code fragment with the most similarity to the question “list the last N transactions of a customer in a time window.” is cero(0)):

Customer c = context.current_customer()
transactions = c.list_transactions(2,time.MINUTES,sortOrder.DESC)

Let’s try with the following question “Plot a histogram of customer categories”.

nl_query = 'Plot a histogram of customer categories'
nlq_emb = get_nlq_embeddings(nl_query)
cos_scores = util.cos_sim(nlq_emb, vector_database)[0]top_results = torch.topk(cos_scores, k=3)
top_results
----------
torch.return_types.topk( values=tensor([0.6671, 0.2543, 0.1562]), indices=tensor([3, 0, 2]))

In this case, the index of the code fragment with the most similarity to the question from above is three(3):

customer_categories = [value1, value2, value3]
plt.hist(x, bins = 5)
plt.show()

Awesome!

This blog post is a small sample of what you can create with language models specialized in source code. Building this prototype took me only a few hours. Creating a more robust solution with features such as detecting repeated questions and recommending relevant content (like StackOverflow and Quora) is feasible.

We have reached the end of the post. I hope this article has been helpful. Feel free to DM me if you want to know more details or want my help in developing a prototype. Please add your comments if you have any questions.

Thanks for reading!

Stay tuned for more content about GPT-3, NLP, System design, and AI in general. I’m the CTO of an Engineering services company called Klever, you can visit our page and follow us on LinkedIn too.

--

--

Carlos Ortiz Urshela
Sounds klever

Machine Learning Engineer | Enterprise Solutions Architect — Interested in AI-based solutions to problems in healthcare, logistics, and HR. CTO of Klever.