Why is GitHub Copilot embroiled in so much controversy?

Yongtai Zhang
SI 410: Ethics and Information Technology
9 min readFeb 20, 2022
Photo by Mohammad Rahmani on Unsplash

GitHub, the company that owns one of the largest open source code platforms among the world, announced the technical preview for an AI pair programmer product on 29 June 2021. The product has a name which can be easily understood: GitHub Copilot. Although GitHub published lots of evidence, showing that how good GitHub Copilot can help programmers improving their coding efficiency, the product has never escaped controversies that are holding negative attitude toward it since it was launched. Instead of wondering whether this product is helpful in practice, most people are having concerns about whether its usage of public data is a “fair use.” In fact, just because some data is available to someone, does not mean someone is ethical to use it.

To start talking about this topic, one thing that we have to be clear is the general mechanism of how GitHub Copilot, as well as most of other AI pair programmer products, works, and how it takes advantages from public data.

GitHub Copilot is powered by OpenAI Codex, an AI system created by OpenAI based on GPT-3. Both Codex and GPT-3 are language models that use deep learning to produce text. If you are interested in AI products, you may find that there are products that can automatically generate human-readable stories based on given text snippets. Among those products, some of them are powered by GPT-3.

An example of GPT-3 producing human-readable text

Comparing with its predecessor, Codex is a special version of GPT-3 that focuses on producing machine-readable text, or as programmers might say, producing code.

Here, some readers might ask: “Okay, I can see what kind of system is supporting GitHub Copilot, but where is the relationship between GitHub Copilot and public data?” Well, the term “language model” contains the connection. In general, to make a language model work, you first need to feed it enough data through its corresponding algorithm. Just like helping a baby to induct what is a “bird” by showing the baby photos of different species of birds, “feeding” data to the algorithm can make it generate rules among those data, to apply to the tasks it might meet in the future. It is obvious that, in order to make the model learn rules that are detailed enough, the process of “feeding” requires a huge size of data, and one of the biggest problems here is how can people gather enough data. For GitHub, the owner of the large open-source code platform of the same name, one of the most straightforward solution is to take advantage of the projects that are stored on the platform. And this is what GitHub has done.

But wait a second, even if those projects are open-sourced, is it legal to use those projects on products like GitHub Copilot? In June 2021, Nat Friedman, GitHub’s CEO, said that:

In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler. (link)

Although the statement of Friedman sounds confident, the performance of the product is not as secured as the statement itself.

Recitation

Even though GitHub Copilot uses public data to produce code snippets, GitHub the company claims that the product is not simply reciting, and thus will not hurt originality. On their product website, they said that:

GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. (link)

However, GitHub the company has also admitted that there exists a small portion of code generated by GitHub Copilot that is verbatim from some other sources. For example, in the document about their research of recitation, they have shown GitHub Copilot printing “the Zen of Python” exactly line by line.

“The Zen of Python,” automatically generated by GitHub Copilot

Besides, cases where GitHub Copilot outputs that are regarded as “copying and pasting” have been reported occasionally. There are people claiming that they have seen GitHub Copilot producing either code snippets that are famous to be a part of some other products, or links to some personal websites, or even usernames of some accounts.

Screenshot claiming that GitHub Copilot recites the famous “fast inverse square root function” from Quake III

As a result, it is obvious that the answer for the question whether GitHub Copilot is “fairly using” public data turns out to remain being unclear: at least the outputs GitHub Copilot generates might have the risk of being critique for offending originality.

Ownership

Even if we can gurantee that GitHub Copilot will never produce code snippets that are exactly same with some parts of existing projects, it is hard to say who should be the owner of a code snippet GitHub Copilot produces.

If we regard GitHub Copilot the AI-powered product as a programmer who writes code, then just like an author who writes articles, the code it generate should by default belong to GitHub Copilot. But what if we regard GitHub Copilot as a tool which helps programmers to produce code faster, just like a good keyboard? If so, it seems like the programmer who is using the service provided by GitHub Copilot turns out to be the default owner of the output.

As for this question, GitHub the company “kindly” assigns the ownership to the programmer, as well as all of the corresponding responsibilities. On their website, they claim that:

GitHub Copilot is a tool, like a compiler or a pen. The suggestions GitHub Copilot generates, and the code you write with its help, belong to you, and you are responsible for it. (link)

For the first glance, the statement would benifits the programers a lot from the perspective of ownership, but this actually also helps GitHub the company to avoid the potential responsibility of helping the programers to fix their products if GitHub Copilot outputs some code snippets that can cause errors. Just think about whether companies who design and sell cars have to cover the finacial loss of car accidents. Although this analogy is not entirely appropriate, I hope this can show that GitHub the company can also benefit from this provision.

Here, even though we do not know if this provision should still be more detailed or not, it seems like the scope of the problem has been limited to the provision given by GitHub. However, what if there is a probability that the code snippets may not belong to either GitHub Copilot or the programmers? As we have known, the outputs language models generate are based on the inputs, the data that has been fed to the algorithms. By saying the models “learn” rules from data inputs and “apply” rule to outputs, it may make people feel like this is the similar process as a writer “reads” articles and then “create” new ones. But please keep in mind, either models or algorithms can be also regarded as numbers and formulas. We do not know how humans do through the process of learning, so that we also cannot know whether the process of a model “learning” can be regarded the same as a person. From another perspective, a language model might only produce hodgepodges of its input, which sounds far away from the sense of “creating.” In this case, won’t the ownership of the output actually belongs to the owner of the input data?

If we talk more about this issue, then we might turn the conversation into the realm of undecided philosophical questions. But at this point, the question of who owns the outputs of GitHub Copilot seems hard to be answered by the explanation from the product website of GitHub Copilot.

Difficult Questions about Usage of Public Data in Big Data

For now, let’s say GitHub Copilot has earned everyone’s trust toward the output it might generate. If so, can GitHub Copilot get away from the fate of being critiqued or questioned? In fact, despite the outcome, the behavior of using public data itself has already been containing a lot of problems to be discussed.

According to Critical questions for Big Data (boyd & Crawford, 2012), how to eliminate the probability of hurting people’s privacy when using public data, is a question that hard to be answered but still necessary to be considered.

Should someone be included as a part of a large aggregate of data? What if someone’s ‘public’ blog post is taken out of context and analyzed in a way that the author never imagined? What does it mean for someone to be spotlighted or to be analyzed without knowing it? Who is responsible for making certain that individuals and communities are not hurt by the research process? What does informed consent look like? (boyd & Crawford, 2012)

If any of those questions from Critical questions for Big Data (boyd & Crawford, 2012) sounds hard to be answered for you, then you might already got a sense of how hard it is to determine whether a specific behavior of using public data is ethical or not.

As for GitHub Copilot, the model it uses was trained on public GitHub repositories of any license, which contains billions of lines of public code, contributed by more than 73 million developers on the GitHub platform. First of all, GitHub the company has stated that the sources of the data they has used are all public. But secondly, we do not know if there are any kinds of licenses that are forbidding behaviors including using the project for big data or language model training. What’s more, although users of the GitHub platform may have already accepted some terms that allows the GitHub company to take advantages from their projects before they publish their code onto the GitHub platform, it is still unclear that whether the terms are still applicable for the new product GitHub Copilot, which is different from the platform.

Is “Public Data” being equally “public” to everyone?

In addition to the issue about the usage of public data, bussiness usage can become an even more special case for this topic. According to Critical questions for Big Data (boyd & Crawford, 2012), different people may have different ability and restrictions when trying to access public data. Thus, even though it sounds like “public data” is “public” to everyone, the degree to which data is open to different groups is actually different.

But who gets access? For what purposes? In what contexts? And with what
constraints? (boyd & Crawford, 2012)

When creating a post on a social media, we are using the service provided by a social media company. As a result, the social media company has full access to the “public data” on the social media, and they have the ability to take the data into computation and get some results as well. Normal users, in contrast, only have the access to parts of the “public data” due to the design of recommendation system or the application interface, and are also impossible to do computation on data with large size. So as to GitHub platform comparing with its users. Although everyone can get benifit from the open-source code platform, GitHub could always be one of the most benefited.

In conclusion, although the company GitHub seems confident toward GitHub Copilot in the topic of information ethics, there are several questions that they failed to answer for now: they have to make people believe that they are not copying existing code, and that the code they provide would legally belong to the users.

Going further, they also have to handle the common problems that most companies have when trying to show that they are using Big Data “ethically.”

Before most of them get answered, it would be extremely hard for me to say that I’m also confident to be ethical when using GitHub Copilot.

References:

Lui, W. by H. (2021, April 20). 9 examples of writing with OpenAI’s GPT-3 language model. Herbert Lui. Retrieved March 10, 2022, from https://herbertlui.net/9-examples-of-writing-with-openais-gpt-3-language-model/

Analyzing the legal implications of github copilot: Hacker news. Analyzing the legal implications of GitHub Copilot | Hacker News. (n.d.). Retrieved March 10, 2022, from https://news.ycombinator.com/item?id=27846324

GitHub copilot · your AI pair programmer. GitHub Copilot. (n.d.). Retrieved March 10, 2022, from https://copilot.github.com/

Research recitation. GitHub Docs. (n.d.). Retrieved March 10, 2022, from https://docs.github.com/en/github/copilot/research-recitation

GitHub copilot and the rise of AI language models in Programming Automation. Exxact. (n.d.). Retrieved March 10, 2022, from https://www.exxactcorp.com/blog/Deep-Learning/github-copilot-ai-pair-programmer

boyd, danah, & Crawford, K. (2012). Critical questions for Big Data. Information, Communication & Society, 15(5), 662–679. https://doi.org/10.1080/1369118x.2012.678878

--

--