GitHub Copilot: Data Collection, Training, and Evaluation for Large Scale Code Generation

Maryam Farooq
Aggregate Intellect
5 min readSep 10, 2021

*Check out the full Copilot Recipe here*

Thoughts on GitHub Copilot have been buzzing throughout the AI community (AISC) since the tool came out in June. How does it work, what are its challenges / limitations, and where can I find a comprehensive list of resources to do ML on source code? (Spoiler alert: here!)

We brought in two of our most prominent NLP experts (Suhas Pai, CTO & NLP researcher @ Bedrock AI + Ehsan Amjadian, Director of AI & Technology @ Royal Bank of Canada) to discuss these questions and more. We also put together this Copilot Recipe (a step-by-step guide with annotated resources), which shows you how to collect the data you need, how to train it, and how to evaluate it. Don’t forget to check out the additional resources and the concept graph so you can see what kind of concepts you need to know about for this topic!

Speaker Bios:

Suhas Pai is the CTO & NLP Researcher @ Bedrock AI (YCS21). His interest in NLP includes control text generation, privacy preserving NLP, and neural style transfer cortex. Has been the NLP lead at AISC for 1.5 years + and hosts an NLP Discussion Group every Sunday.

Ehsan Amjadian is the Director of AI & Technology @ Royal Bank of Canada (RBC). He has extensive background in NLP — both academically (he did his PhD in the field), and industrially (he has led various initiatives in developing & deploying NLP algorithms as well as software).

Selected Highlights:

[Question] What are some common ML techniques people use on source code + what are their objectives? Highlight some of the differences of applying some of NLP-like techniques to natural language rather than formal language like source code.

Suhas: “One of the main ones is code search where you have a particular query that is presented either in a natural language or in the form of a code snippet, and you want something that is similar to that (or something that is the code equivalent of what you need). There are other tasks that are in that realm — for example: defect prediction, and bug detection. You can use ML techniques to identify features that indicate the presence or existence of security vulnerabilities or just defects in the code. You can also use ML for test case generation, and there have also been papers where they’ve tried predicting type inference for certain programming languages. Another commonly used task is clone detection, where you try to see where two snippets of code do the same thing functionally. There are also comment generation tasks where the model tries to generate efficient and readable talk strings based on the code. There are also code repair tasks where you can have a ML model suggest what changes would need to be done to the code in order to fix a bug present in it. Two most important tasks that are gaining a lot of prominence these days is code program systems or code generation. You can call it natural language text to source code generation where you provide a doc string of what you need, and the code generated is the implementation of what you specified.”

Ehsan: “Programming languages are more regular as compared to natural language (in layman’s terms, natural languages are more rich and therefore more noisy). They are harder to interpret and much more ambiguous because of the complexity just mentioned. One might imagine that applying NLP algorithms in code is relatively overpowered. Especially with the ones that directly target code per say. Some of the tasks that Suhas mentioned were the interface of natural language and code. Using NLP to form representations to generate code — those are absolutely hard tasks. But getting to the point, one might imagine applying analytics techniques only to programming languages is easier because of the complexity he mentioned, however there is also less data. When it comes to natural language, any corpora out there is part of our training set, especially with self-supervised learning algorithms these days, but the subset of code out there is much smaller than natural language.”

[Question] How does GitHub Copilot really work & does it manage to solve some of the challenges that prior attempts (at solving this problem) fell short of?

Ehsan: “They started advertising it as an AI pair programmer, which I think is a fitting brand for it regarding the variety of tasks it does. It focuses on the interface Suhas and I hinted towards — the NLP & natural language makes things very simple. Basically, it does a few tasks that you would do with a pair programmer, or one an AI would do for you. One is generate code out of your comments. It can auto complete your code, it can write test cases for you. And autofill the repetitive code, so on and so forth. Brilliant use of natural language capabilities. To be honest, I think it has been there in the past (at least theoretically), but it’s a great showcase of the capability.”

Suhas: “These types of model has existed for a long time and the problems have been around for a decade now. Copilot is a vast improvement from what we’ve seen in the past decade. They use Codex (a model developed by Open AI). Codex is based on GPT-type of language models. It has especially been trained on source code that is available publicly on GitHub. The amount of training data is 54 million public software repositories with 179 gigabytes of unique python files, which is an enormous amount of data to train on. We don’t really know the final production model of the Codex model that is being used. They only released the paper that was a predecessor of the actual copilot model that is currently in production. In that paper, they were speaking more about a 12 billion parameter model, which was already showing really good numbers in terms of metrics when compared to other models that were not specifically trained on source code. So you can have the GPT-3 model (which is not trained on source code at all), and they found that GPT-3 was able to solve exactly zero percent of training examples from their test set that they produced. On the other hand, this other model GPT-J (which was built by Eleuther AI), which was trained on a dataset called the Pile. That Pile dataset contains a subset of GitHub code, and because of that, it was actually able to perform better than GPT-3.”

For more, watch the full video below & check out the Copilot Recipe.

Key Video Highlights:

00:57 Introductions
02:17 What are common ML tasks on source code?
06:58 What’s GitHub / OpenAI Co-Pilot and how does it work?
11:22 Resources for ML on Source Code (RECIPE)
12:39 Data Collection and Filtering
16:36 Data Formatting and Training
19:37 Evaluation
22:40 Co-Pilot Criticism and Challenges

Aggregate Intellect

Aggregate Intellect

Aggregate Intellect is a Global Marketplace where ML Developers Connect, Collaborate, and Build. Connect with peers & experts at https://ai.science or Join our Slack Community.

  • Check out the user generated Recipes that provide step by step, and bite sized guides on how to do various tasks
  • Join our ML Product Challenges to build AI-based products for a chance to win cash prizes
  • Connect with peers & experts through the ML Discussion Groups or Expert Office Hours

--

--