A journey at the heart of 2.4 million Maven artifacts

Benoit Baudry
5 min readFeb 22, 2019

--

Today we launched the third and last stage of a software research rocket that took us on a journey at the heart of Maven Central. The rocket was launched in July 2018, triggered by one question: what are the most popular libraries in Maven Central? In the last 6 months we have wandered deep in the extraordinary network of software artifacts of Maven Central to answer this question and discover the vibrant interactions between software artifacts. We brought back large amounts of data, a software infrastructure and a couple of surprising and actionable insights about the practice of software reuse.

Stage 1: Build the graph of Maven artifacts

In September 2018, Maven Central included 2.4 million artifacts and 9.7 million dependencies

The first stage of the rocket was meant to set the foundations for our journey: collect the complete graph of artifacts stored of Maven Central. In September 2018, we built the Maven Dependency Graph composed of 2.4 million nodes and 9.7 million edges. Each node represents a software artifact and each edge represents a dependency between two artifacts. These artifacts represent 2.1 million releases of 223,478 unique libraries. As part of this first step, we also built a dedicated infrastructure to query and analyze this large graph.

Excerpt of 1% of the whole graph of Maven artifacts

Stage 2: A pool of software diversity

The immutability of Maven artifacts favors software diversity

The second stage of our rocket captured our analysis of the diverse versions of artifacts and how they are adopted by other programs. We made the following hypothesis: the immutability of Maven artifacts and the ability to choose any version naturally fuels the emergence of software diversity within Maven Central. We analyzed the Maven Dependency Graph, focusing on the dependencies towards the different versions of artifacts. We observed a significant part of the libraries have multiple versions that are actively used. In the case of popular libraries, more than 50% of their versions are used. We also observe that more than 17% of libraries have several versions that are significantly more used than the other versions. This is illustrated in the figure below, where we see the example of JUnit: two different versions are very widely used. These observations support our hypothesis that Maven Central support a sustained level of diversity among versions of libraries in the repository. We delivered a package to reproduce these results.

Popularity of the different versions of Apache Commons IO, JUnit and XML APIs

Stage 3: The essential core of APIs

Maven APIs include a small, essential core of members that are used by most of their clients

The third stage of our rocket led us into the bytecode of Maven artifacts to verify the following hypothesis: a small portion of public APIs is essential for all users, while the rest of the API is seldom used. The figure below visually captures the intuition for this hypothesis: two classes of slf4j-api that are used a majority of clients (yellow nodes in the graph) and the rest of the API is used by much less clients (purple nodes).

Here, we studied the 2.3 million dependencies that exist between the 99 most popular libraries of Maven Central (present in 5431 versions) and their 865560 clients. Our key findings are as follow: 43,5% of the dependencies declared by the clients are actually not used in the bytecode; all APIs contain a large part of rarely used types and a few frequently used types, and the ratio varies according to the nature of the API, its size and its design; we can systematically extract a reuse-core from APIs that is sufficient to provide for most clients, the median size of this subset is 17% of the API that can serve 83% of the clients.

Excerpt of the usage graph of slf4j-api. Dark blue nodes are API types that are used by the majority of clients and light blue nodes are API types with regular usages. Yellow nodes are clients that depend on the most popular API types, while purple nodes depend only on other types. Links are API usages from client types to API types. Node size represents the number of calls to the API type

Key insights from this journey in Maven Central

  • Maven Central is massively used to distribute and release software artifacts: it contains more than 2.5 million artifacts, as of September 2018
  • Maven Central holds a treasure of extraordinary software development. From very a small API with only 7 annotations, used by thousands of other programs to some giant APIs which clients use in a very focused way.
  • The immutability of Maven artifacts sustains diversity radiation in this software ecosystem. More than 17% of the libraries have several versions that are actively used by a large number of clients.
  • 1.3 million dependencies declared are actually not used. This calls for novel techniques to detect and remove useless dependencies (to reduce the size of the built jar).
  • A vast majority of APIs can be reduced to a small, compact core and still serve most of their clients. The majority of the APIs we analyzed could be reduced to 17% of their types and still serve 83% of their clients.
  • Modules in the latest versions of Java provide a language support to fine tune APIs. Public packages can present a list of packages accessible from outside the module, while services offered can provide service implementations to be consumed by other modules.

The crew

Amine Benelallam, Nicolas Harrand and César Soto were the fearless pilots for this rocket, while Olivier Barais and myself were holding the mission control center. The journey benefited from the dynamic and extremely valuable advice from our“Trench” colleagues, Martin Monperrus, Zimin Chen, He Ye, Oscar Luis Vera Perez and Manuel Leduc. We relied on great open source technology to navigate through the Maven network: R, Neo4J, ASM, Gephi and the extraordinary JVM.

If you would like to be notified with other cool results that we have, shot an email at software-research.subscribe@4open.science

References

The Maven Dependency Graph: a Temporal Graph-based Representation of Maven Central

The Emergence of Software Diversity in Maven Central

Analyzing 2.3 Million Client-Library Maven Dependenciesto Reveal an Essential Core in APIs

--

--