Measuring code authorship in GitHub projects

aserg.ufmg
5 min readAug 27, 2017

--

2/3 of the contributors of GitHub projects do not make significant contributions to the implementation of any source code file (median value considering a sample of 133 popular GitHub projects).

Source code authorship is fundamentally different from authorship in other contexts, like in books or scientific research papers, when the authors are explicitly informed and do not change with time. In software, code artifacts are created by one developer, but lately changed by possibly hundreds of developers. Therefore, after years of maintenance, maybe the creator of a file a very limited knowledge on its code.

In 2015, we conducted a large scale study on code authorship, using a dataset of 133 GitHub systems. We study systems implemented in six popular programming languages (Java, C/C++, PHP, Python, Ruby, and JavaScript). As mentioned, our data was collected in 2015. In a previous work, We used this study to evaluate an algorithm that estimates the truck factor of GitHub projects (see this full paper or its short version).

To assess authorship, we rely on a Degree-of-Authorship (DOA) metric that considers not only first authorship events, but also further code changes and recency data.

We use this metric to infer the authors of a file, which is defined as follows:

The authors of a source file are the developers who made substantial and relevant contributions to its code.

Based on the DOA metric, we provide answers for the following questions:

  1. How many contributors of a system are authors of at least one file?

2. How (un)equal are the authorship distributions?

3. How frequent are single authorship files?

How many contributors are authors of at least one file?

The authors ratio is number of authors of at least one source code file divided by the total number of contributors of a open source project. Suppose a project with 100 contributors, but only 10 are authors of at least one source code file. The other 90 developers had just made minor contributions to the files they changed. In this system, the authors ratio is 10%.

We found systems with a high ratio of authors, which usually have a relevant number of paid developers and some of them are supported by commercial organizations. This is the case of four of the top-10 systems with the highest authors ratio:

  • v8/v8 (75% of the contributos are authors, i.e., made relevant contributions to a single source code file)
  • Jetbrains/IntelliJ-community (73% of the contributos are authors)
  • WordPress/WordPress (67% of the contributos are authors)
  • Facebook/osquery (62% of the contributos are authors).

We also detected two language interpreters among the top-10 systems: ruby/ruby (72%) and php/php-src (59%).

At the other extreme, there are systems with a very low authors ratio, like

  • sstephenson/sprockets (only 3% of the contributors are authors, i.e., made relevant contributions to a single source code file)
  • jashkenas/backbone (2% of the contributors are authors).

jashkenas/backbone for example has six authors (and 248 developers). These authors monopolize 67% of the commits in the system. A similar situation happens with sprockets (a Ruby library for compiling and serving web assets). Although 61 developers have made commits to the system, 95% of the commits are performed by the two developers classified as authors in sprockets. Moreover, 27 developers performed a single commit that modified a single line of code.

How (un)equal are the authorship distributions?

In the previous question, we did not consider the number of files authored by an author; a developer is ranked as an author independently of the number of files she was granted this status. In this question, we investigate the importance of the many authors of a system, by analyzing the number of files they are authors. The goal is to reveal how equal (or un- equal) are authorship distributions.

The following figure shows the percentage of files authored by the authors of each system in our dataset. Each bar represents a system and the colors represent specific percentiles. Red bars represent the percentage of files authored by the top-10% authors with more authored files; yellow bars represent the percentage of files authored by the next 10% authors; finally, the gray bars represent the percentage of files authored by the remaining authors of a system. Because a file can have multiple authors, the bars of some systems exceed 100%.

Percentage of files authored per author. The red bars represent the percentage of files authored by the top-10% most productive authors of a system.

This figure shows a high concentration of files in the hands of the top-10% authors, ranging from 45% (fzaninotto/Faker) to 100%, which happens in the case of three systems: getsentry/sentry, ratchet-php/Ratchet, and nicolasgramlich/AndEngine.

How frequent are single authorship files?

Our definition of authors accepts multiple authors per file; any developer with a normalized DOA greater than 0.75 for a given file is considered one of its authors. Therefore, in this section, we analyze how single authorship files are common when implementing open source software.

The top-3 system with the highest rate of single authorship files are koush/ion (100%), paulasmuth/fnordmetric (99.5%) and nicolasgramlich/AndEngine (99.5%). They are projects in which a single person has developed almost all the code and contributions of other developers are restricted to few files. For the first system, the main developer has made 93% of the commits and she is not the author of a single file. For the second system, the main developer has made 85% of the commit and he is the author of 96% of the files.

On the other hand, the repositories with the lowest rate of single authorship files are caskroom/homebrew-cask (24%), moment/moment (25%), and spring-projects/spring-framework (48%).

We counted only 894 files with more than three authors in the whole dataset (< 0.4%). The largest number of authors per file is 11, which is detected for ext/standard/dns.c in php/php-src. This file is a library of functions to deal with DNS features. It has 1,117 lines of code, which were changed (commits) 243 times by 40 different developers. The second highest number of authors per file is 9, detected both in a cryptographic library in php/php-src and in a file to generate code for double arithmetic binary operations in v8/v8.

Conclusion

Based on the results of this study, we draw three major conclusions:

  1. The percentage of developers classified as file authors reaches 70% in some open source systems, e.g., in systems with close links with commercial organizations. However, the median value is 23%.
  2. The distributions of code authorship in GitHub systems are highly unequal. The top-10% authors are responsible for more than 50% of the code in 129 out of 133 systems.
  3. The one-file- one author model is the norm in the analyzed systems; 100 systems (75%) have at least 66% of the files with a single author.

--

--