Analyzing the evolution of the Linux kernel using code authorship measures

In a recent study, we investigated the evolution of the Linux kernel using a code authorship measure. We provide answers to four major questions:

  1. What is the proportion of authors/developers? We call authors the developers who made significant changes to at least one Linux source code file. This may include the original file creator, as well as those who subsequently change it. Therefore, we focus on relevant developers (authors), instead of developers with minor or casual contributions.
  2. What is the number of files per author? The goal is to use authorship as measure of the amount of work performed by Linux developers.
  3. What is the proportion of specialists authors (i.e., authors who work in a single subsystem) vs generalists? The idea is to provide insights on how the development work is organized in the Linux kernel.
  4. What are the main properties of Linux co-authorship network? In our model, a source code file can have multiple authors. So, it is possible to produce a co-authorship network, where the nodes are authors and the edges connect authors who are co-authors in at least one source code file. We reveal the main properties of such networks, like mean degree.

Our analysis accounts for 56 stable releases (v2.6.12– v4.7), spanning a period of over 11 years of development (June, 2005–July, 2016).

Study Design

The study relies on the Degree-of-Authorship (DOA) measure to define the authors — i.e., the key developers — of each file in a system. DOA values are computed from commit histories, as follows:

  • the creation of a file f by a developer d initializes the value of DOA(d, f );
  • further commits on f by d increase DOA(d, f );
  • finally, commits by other developers decrease DOA(d, f ).

The DOA values are normalized per file; the developer with the highest DOA value in a file f has its normalized DOA equal to 1. A developer is considered an author of a file if its normalized DOA is greater than 0.75. To propose this threshold we manually validated the DOA results produced for six popular systems (more details in this paper).

Proportion of authors/developers

Table 1 shows the proportion of authors in each Linux subsystem.

In the last release, Linux kernel has 13,436 developers, but only 3,459 (26%) are authors of at least one file.

Throughout the kernel development, the proportion of authors is nearly constant (Std dev= ± 0.83 %). Thus, the heavy-load Linux kernel maintenance has been kept in the hands of less than a third of all developers.

Table 1 — Linux subsystems size and authors proportion

Number of files per author

Figure 1 presents boxplots of files per author across the Linux releases.

With exception of one release (v2.6.24), 50% of the authors responds to at most three files (median); for 75% of the authors, the number of files ranges from 11 to 16.

The number of authors with more than 100 files is always lower than 7% of the authors, ranging from 7 % in the first release to 3% in the last one.

Fig. 1. Distribution of the number of files per author in each release
Therefore, file authorship follows a pyramid-like shape of increasing authority;

At the top, Linus Torvalds acts as a “dictator”, centralizing authorship of most of the files (after all, he did create the kernel!). Bellow him lies his hand-picked “lieutenants”, often chosen on the basis of merit. Such organization directly reflects the Linux kernel contribution dynamics, which is itself a pyramid. However, as the kernel evolves, we see that Torvalds is becoming more “benevolent”.

The percentage of files authored by Torvalds has reduced from 45% (first release) to 10% in v4.7 (see Figure 2).
Fig. 2. Percentage of files authored by the top-10 authors over time. The line represents Linus Torvalds (top-1) and the bars represent the accumulated number of files of the next top-9 authors

We also apply the Gini coefficients to analyze the distribution of the number of files per author (Figure 3). In all releases, the coefficient is high, confirming skewness. However, we notice a decreasing trend, ranging from 0.88 in the first release to 0.78 (v4.7). Such a trend further strengthens our notion that authorship in the Linux kernel is becoming less centralized.

Fig. 3. Gini coefficients. It ranges from 0 (perfect equality) to 1 (perfect inequality).

Specialists (i.e., authorship in a single subsystem) vs Generalists

We call authors specialists if they author files in a single subsystem. Generalists, in turn, author files in at least two subsystems.

The number of specialists dominates the amount of generalists (See Figure 4). In the Linux kernel (All), any given release has at least 61% of specialist authors, with a maximum of 64%; at all times, 39% of the authors are generalists.
Fig. 4. Percentage of specialists and generalists

Looking at the work specialization in each subsystem also provides a means to assess how much the Linux kernel architectural decomposition fosters specialized work.

The architectural decomposition plays a key role in fostering specialists inside the Driver subsystem (more than 50% of specialists), but less so elsewhere.

Linux co-authorship network

Many files in the Linux kernel result from the work of different authors. As such, we set to investigate such collaboration by means of the properties of the Linux kernel co-authorship network. We model the latter as follows: vertices stand for Linux kernel authors; an edge connects two authors vi and vj if ∃f such that {vi , vj} ⊆ authors(f). In other words, an edge represents a collaboration. The figure that opens this post shows Linux co-authorship network (note the central role of Linus Torvalds).

We analyze the latest co-authorship network, as given in release v4.7 (Table 2). The number of vertices (authors) determines the size of a co-authorship network. The mean degree network, in turn, inspects the number of co-authors that a given author connects to.

In the system level (All), the mean vertex degree is 3.64, i.e., on average, a Linux author collaborates with 3.64 other authors.

At the subsystem level, Driver forms the largest network (2,604 authors, 75%), whereas Misc results in the smallest one.

Table 2. Co-authorship network properties (release v4.7)

The third property, clustering coefficient, reveals the degree to which adjacent vertices of a given vertex tend to be connected. In a co-authorship network, the coefficient gives the probability that two authors who have a co-author in common are also co-authors themselves. A high coefficient indicates that the vertices tend to form high density clusters. The clustering coefficient of the Linux kernel is small (0.080). Nonetheless, Net, Misc, and Fs exhibit a higher tendency to form density clusters (0.205, 0.188, and 0.175, respectively) in comparison to other subsystems.

Last, but not least, we compute the assortativity coefficient, which correlates the number of co-authors of an author (i.e., its vertex degree) with the number of co-authors of the authors it is connected to [26]. Ranging from -1 to 1, the coefficient shows whether authors with many co-authors tend to collaborate with other highly-connected authors (positive correlation). In v4.7, all subsystems have negative assortativity coefficients, ranging from −0.146 in Fs to −0.029 in Net subsystem.

Linux kernel developers often divide work among experts who help less expert ones. These experts (i.e., highly connected vertices), in turn, usually do not collaborate among themselves (i.e., the networks have negative assortative coefficients).

More Info:

Guilherme Avelino, Leonardo Passos, Andre Hora, Marco Tulio Valente. Assessing Code Authorship: The Case of the Linux Kernel. In 13th International Conference on Open Source Systems (OSS), pages 1–12, 2017