GitHub CoPilot

Davy De Waele

Published in

Ixor

6 min readOct 27, 2023

Understanding the implications of Code Generation with GitHub Copilot

What is it ?

GitHub Copilot is powered by a generative AI model developed by GitHub, OpenAI, and Microsoft.

Together with the advent of LLMs and systems like ChatGPT, it has gained lots of traction within the developer community.

An IT developer using an AI tool to generate code with a futuristic dystopian backdrop (Stable Diffusion XL 1.0)

Aided by the tight integration in IDE and the ability for systems like Copilot to retrieve context straight from that IDE, allowing it to read open files, directory structures, project layouts, it can be a a useful tool to get coding suggestions from, all of it based on both your input prompts and the code that it sees.

In this article we are not going to be talking about

The dangers related to the quality of the suggestions it provides (buggy code, vulnerabilities in code, ….)
How developers incorporate these suggestions in existing code-bases (simply copy paste without thinking, ….)
Potential legal issues that could arise from using code snippets suggested by Copilot.

But rather we’re going to be focussing on the data you are potentially sharing and the consequences of that.

How it works

A lot of developers don’t really think about it all that much, but a lot goes on for a tool like GitHub Copilot to work

It needs to be trained on data, primarily source code. Luckily GitHub has a lot of source code, every public repository can be harvested by anyone, including GitHub obviously.
You need to give it input (prompts). These prompts are sent to the GitHub servers together with additional context (your own source code)
That additional context (your own source code) is of course part of your intellectual property. Sometimes protected even under certain copyright laws or other license agreements.
The code suggestions that GitHub performs based on your prompts are sent back to you

What has it been trained on

To get a better understanding of what we are sharing and why, we need to understand how GitHub Copilot is able to provide all of these interesting suggestions.

GitHub Copilot has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

Github Octocat sitting in a library reading books with souce code (Stable Diffusion XL 1.0)

Note that it hasn’t been trained on your source code (yet). In most cases, depending on how you use GitHub Copilot, you probably don’t want GitHub to be trained on your own copyright protected source code.

What data does it collect ?

GitHub Copilot gives you choices about how it uses the data it collects. This is outlined in the GitHub Copilot FAQ section “How can users of Copilot for Individuals control use of their Code Snippets Data”

The data that GitHub Copilot collects is :

User Engagement Data (which includes pseudonymous identifiers and general usage data), is required for the use of GitHub Copilot and will continue to be collected, processed, and shared with Microsoft as you use GitHub Copilot.
Prompts and Suggestions : Users of GitHub Copilot for Individuals can choose whether Prompts and Suggestions are retained by GitHub and further processed and shared with Microsoft by adjusting user settings.

When GitHub refers to User Engagement Data this typically means data needed for

Evaluating GitHub Copilot, for example, by measuring the positive impact it has on the user
Fine tuning ranking and sorting algorithms and prompt crafting
Detecting potential abuse of GitHub Copilot or violation of Acceptable Use Policies.
Conducting experiments and research related to developers and their use of developer tools and services.

None of this data contains your actual code. This user engagement data is in fact required for the use of GitHub Copilot and will continue to be collected, processed, and shared with Microsoft when you use GitHub Copilot.

But what about your code ?

So far we haven’t talked about the actual code snippets sitting in your IDE, or in an active window where you are prompting CoPilot and receiving suggestions.

The fact remains that as a developer working for a company writing software for a client that you typically wouldn’t want to send that source code to a third party.

a software developer wrapping source code as a present (Stable Diffusion XL 1.0)

According to the Github Copilot for Business page, in the Code Snippets Data Section

GitHub Copilot transmits snippets of your code from your IDE to GitHub to provide Suggestions to you. Code snippets data is only transmitted in real-time to return Suggestions, and is discarded once a Suggestion is returned. Copilot for Business does not retain any Code Snippets Data.
Users of GitHub Copilot for Individuals can request deletion of Prompts and Suggestions associated with their GitHub identity by filling out a support ticket.

This seems to suggest that only Github Copilot for Business does not retain your code snippets. But what about Github Copilot for individuals ?

Well, for that we need to look at the Github CoPilot configuration page :

Once you have an active GitHub Copilot trial or subscription, you can adjust GitHub Copilot settings for your personal account on GitHub in the GitHub Copilot settings. The settings apply anywhere that you use GitHub Copilot. You can configure the suggestions that GitHub Copilot offers and how GitHub uses your telemetry data.

As you can see on the settings page there is a checkbox (enabled by default) that allows GitHub to use your code snippets for product improvements.

Allowing GitHub to use your code snippets for product improvements, aka training their Codex AI model is probably something you don’t want to do in a corporate setting.

The implications of this can potentially trigger a whole chain of events that you don’t want to have. What if your source code or IP gets used as a suggestion to a competitor that is also using GitHub Copilot.

However, as noted in the GitHub documentation, GitHub does offer a solution for this by giving you a checkbox that you can uncheck to ensure that no code snippets are retained and used for training purposes

Conclusions

Any legal issues regarding copyright in the software world that you believed were already complex have only become more intricate with the introduction of LLMs.

Potential legal issues involving copyright aside, according to GitHub and their policies, any code that is sitting in your IDE will not be shared with GitHub, nor will it be used to train its internal models.

That being said, we will continue to provide training and education to our developers, come up with internal policies regarding the usage of these tools and keep on monitoring the space of AI assisted tools.

We already have a peer review process in place for every line of code that is being generated within the company, and we are convinced that GitHub Copilot can be seen as a great addition to the existing toolset of a developer.