Image source: vectorpocket / Freepik

Exploring GitHub Copilot’s Vulnerabilities?

This article seeks to help its readers understand what GitHub Copilot is about as it reviews the service from a security standpoint to improve current and future usage of the service. A significant portion of this article is based on reviews from research papers. See below for references.

You may skip to “The Rising Concerns” directly for security issues but it is worth starting from Introduction.

Introduction

Software today forms an integral part of our personal and business lives. Never has there been this level of dependency on software as it is now. This in turn has caused a surge in the demand for software developers who deliver good and quality code in a limited time. Given this, there has also been considerable interest in enhancing the tools employed in the software development process as this has been the aim of a majority of researches published lately. The design and release of GitHub Copilot which is a machine learning (ML)-based code generator is one of the recent outcomes of these efforts. By definition, GitHub Copilot is an “AI pair programmer” that generates code in a variety of languages given some context such as comments, function names, and surrounding code. At its core, large models originally designed for natural language processing (NLP) are trained on vast quantities of code and attempt to provide sensible completions as programmers write code. Trained on open data from GitHub, Stack Overflow, and other publicly accessible portals thereby making it the largest and most capable of such model today.

At a high level, copilot looks simple and this is how it works; A software developer (user) edits code in a plain text editor while working on an application. Copilot continually scans the program as the developer adds lines of code, periodically uploading a portion of lines, the user’s cursor position, and metadata before producing some code options for the user to insert.

Figure 1 Copilot in action using python as the language of choice. Source: https://copilot.github.com

In Figure 1 above, the code in grey is what git copilot has suggested. The buttons above it allow users to have control over their preferred option. The Next button as the name implies will show the next best alternative and vice versa for the Previous button. Users can use the Accept button to select the active code as their preferred choice.

How Copilot Compares To Others?

Codex’s production variant which is a Generative Pre-trained Transformer (GPT) language model from Open AI, is what powers GitHub Copilot. In a research 2, a test showed a staggering 70.2% value on HumanEval problems with 100 samples per problem in a repeated sampling approach. HumanEval here stands for Hand-Written Evaluation Set designed by OpenAI and used to evaluate NLP models nearness to human language capabilities. Certainly, GitHub Copilot is not the first “AI-powered” program synthesis tool. In 2018 3, a Natural Language Semantic Code Search from GitHub was released, allowing users to search for code samples using plain English descriptions. For a few years, Tabnine has also offered “AI-powered” code completion for years now. Copilot is unique in that it can generate entire multi-line functions, as well as documentation and tests, based on the entire context of a file of code. Currently, Copilot is at a Beta testing stage and its technical preview version is available for use but with limited access. Individuals wanting to use the service are expected to sign up and be added to a waitlist to get access later if approved.

How it Started?

The power of GitHub Copilot can be traced back to 2017 with the release of a paper titled Attention Is All You Need4, which introduced a simple network architecture to process sequential data in parallel called Transformers. Unlike the traditional RNN and LSTM models, this new architecture will be based entirely on attention mechanisms, dispensing recurrence and convolutions entirely. In 2018, two promising pre-trained NLP models (GPT, BERT) were built using this architecture 5,6. Fast forward 2020, an improved variant (GTP-3) was released 7 on which GitHub Copilot’s Codex model has been built and then fine-tuned with code from GitHub. Codex’s tokenization step is nearly identical to that of GPT-3. That is byte pair encoding is used to convert the source text into a sequence of tokens, but its vocabulary has been extended by adding dedicated tokens for whitespace (i.e., a token for two spaces, a token for three spaces, up to 25 spaces). This allows the tokenizer to encode source code (which has lots of whitespaces) both more efficiently and with more context. An important feature that Codex and Copilot inherit from GPT-3 is that, given a prompt, they generate the most likely completion for that prompt based on what was seen during training1. The model does not necessarily generate the best code but rather fetches a match to the code preceding it. As a result, the quality of the generated code can be strongly influenced by semantically irrelevant features of the prompt.

The Rising Concerns

There are rising concerns about this technology since its release. One which has been overlooked is the issue of code security since this model is trained on public code which may contain insecure implementations. A research by Pearce et al studied and systematically experimented with Copilot to gain insights into this issue by designing scenarios for Copilot to complete and by analyzing the produced code for security weaknesses. Checks were done on Copilot completions for a subset of MITRE’s “2021 CWE Top 25 Most Dangerous Software Weaknesses”, a list that is updated yearly to indicate the most dangerous software weaknesses as measured over the previous two calendar years. The work attempted to characterise the tendency of Copilot to produce insecure code, giving a gauge for the amount of scrutiny a human developer might need to do for security issues.

How Secured in Copilot?

Results revealed that the overall Copilot’s response to its test scenarios is mixed from a security standpoint, given the large number of generated vulnerabilities (across all axes and languages, 39.33% of the top and 40.48% of the total options were vulnerable). This is due to its community-based dependence making it susceptible to situations where certain bugs are more prevalent in open-source repositories as those bugs will be more often reproduced. In that study, Copilot’s security was gauged with a mix of automated analysis using GitHub’s CodeQL tool as it can scan for a wider range of security weaknesses in code compared to other tools alongside a manual code inspection. CodeQL is open-source and supports the analysis of software in languages such as Java, JavaScript, C++, C#, and Python. There are common patterns in various classes of insecure code which according to the Common Weakness Enumeration (CWE) database, can be considered as weakness.

Determining code vulnerability requires one to understand the context of the code and may require one to frame the code or scenario from an attacker’s point of view. As such, the constrain was to determine if specific code snippets generated by Copilot are vulnerable: that is, if they definitively contain code that exhibits characteristics of a CWE.

The image in Figure 2 shows the security evaluation steps in the research.

Figure 2 General Copilot evaluation methodology.

For each CWE, several ‘CWE scenarios’ were written in step 1 for step 2. Copilot was then asked in step 3 to generate up to 25 options for each scenario. In 4a, each option is combined with the original program snippet to form a set of programs, with some options discarded in 4b if they have significant syntax issues. Then, in 5a, each program is evaluated. CodeQL in 5b performs this evaluation whenever possible, using either built-in or custom queries. Some CWEs required additional context or could not be formed as properties examinable by CodeQL, so the authors had to perform this evaluation manually in 5c. Importantly, in this step, CodeQL is configured to only look for the specific CWE that this scenario is intended for. Furthermore, the evaluation was limited to vulnerabilities rather than correctness. Finally, in section 6, the results of the evaluations of each Copilot-completed program are presented. Steps 1 through 3a were accomplished manually. After that, automated Python scripts were created to complete Steps 3b, 4a, and 5 automatically, as well as manual analysis of Step 4b as needed.

The first evaluation involved checking Copilot’s performance when prompted with several different scenarios where the completion could introduce a software CWE. For each CWE, they develop three different scenarios. The result showed that Copilot did generate vulnerable code around 44% of the time with some CWEs more prevalent than others.

The second type of evaluation checked how Copilot’s performance changes for a specific CWE, given small changes to the provided prompt. For this experiment, CWE-89 (SQL Injection) was chosen. Copilot did not deviate significantly from the overall answer confidences and control scenario performance. It was hypothesized that the presence of either vulnerable or non-vulnerable SQL in a codebase is thus the best predictor of whether or not there will be other vulnerable SQL in the codebase, and thus has the most influence on whether or not Copilot will generate SQL code vulnerable to injection. That said, though they did not have a significant effect on the overall confidence score, it was observed that small changes in Copilot’s prompt could impact the safety of the generated code with regard to the top-suggested program option, even when they have no semantic meaning (they are only changes to comments).

The final evaluation focused on hardware using the latest paradigm of CWE. Specifically, this tested how Copilot performs when tasked with generating register-transfer level (RTL) code in the hardware description language Verilog. Its observation showed that

Copilot struggled with generating syntactically correct and meaningful Verilog. This is due mostly to the smaller amount of training data available for Verilog since it is not as popular as the other two languages. The aim here was not to test for correct code generation, rather for the frequency of the creation of insecure code. Copilot performed relatively well in this stage.

Conclusion and Contribution

Overall, Copilot’s response to the scenarios is mixed from a security standpoint, given the large number of generated vulnerabilities (across all evaluations and languages, 39.33% of the top and 40.48% of the total options were vulnerable). Novices may fall prey to most top suggestions from Copilot. Since Copilot is trained with open-source code it is right to theorize that its security quality stems from the nature of the community-provided code. This implies that where certain bugs are more visible in open-source repositories, there is a higher tendency Copilot will reproduce the same or similar bugs. Having said that, one should not draw conclusions as to the security quality of open-source repositories stored on GitHub.

There’s no doubt that GitHub copilot is a great tool that will aid in the rapid development of applications. Developer will need to be a little more vigilant when using Copilot. It is advisable to pair copilot with security-aware tools during its training and usage. Another way to minimize the level of security vulnerability is for the Copilot training team to assign enough weight to best-reviewed and accepted public code samples to counter vulnerable predominant code during training and evaluation.

For an in-depth information on the tests and results, refer to the references below.

References

1. Pearce H, Ahmad B, Tan B, Dolan-Gavitt B, Karri R. An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions. Published online August 20, 2021. Accessed September 29, 2021. https://arxiv.org/abs/2108.09293v2

2. Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code. Published online July 7, 2021. Accessed September 29, 2021. https://arxiv.org/abs/2107.03374v2

3. Towards Natural Language Semantic Code Search | The GitHub Blog. Accessed September 29, 2021. https://github.blog/2018-09-18-towards-natural-language-semantic-code-search/

4. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Advances in Neural Information Processing Systems. Vol 2017-December. Neural information processing systems foundation; 2017:5999–6009.

5. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training (2018). Published online 2018.

6. Devlin J, Chang M-W, Lee K, Google KT, Language AI. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Naacl-Hlt 2019. 2018;(Mlm). Accessed September 29, 2021. https://github.com/tensorflow/tensor2tensor

7. Brown TB, Mann B, Ryder N, et al. Language Models are Few-Shot Learners. Published online May 28, 2020. Accessed September 29, 2021. http://arxiv.org/abs/2005.14165

--

--

Software Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store