Analyzing the NPM dependency network
This is not a knee-jerk action. I love open source and believe that open source community will eventually create a truly free alternative for NPM.
Removal of these modules immediately impacted many thousands of dependent projects, NPM reports:
Shortly after 2:30 PM on Tuesday, March 22, we began observing hundreds of failures per minute, as dependent projects — and their dependents, and their dependents… — all failed when requesting the now-unpublished package.
Further questions can be raised regarding the dependencies:
- What was the affected projects’ dependency network like?
- Aside from the popular modules like Babel and Node, what other central modules were affected because they are also dependent upon the removed modules?
- Which of the removed modules had the highest impact on the npm network? Were there other single point of failures aside from now famous left-pad module?
We may not have the answers to these questions after the fact, but we can map the current software package dependencies, which would allow us to measure network metrics, develop a sense of structural patterns, and evaluate the future possible risks.
Mapping the NPM dependency network
npm-dependency-network is a Python script that starts from a package, crawls links from the npm registry, and generates an interactive NPM dependency graph. The graph below is the top 100 dependent upon npm packages and their dependencies in 4 levels of depth.
In such a graph, the direction of the connections are critical. Consider the chain of dependencies below: if Module A is removed, then Module B will be affected, which will affect Module C. But if module B is removed, only C is affected, module A will be fine.
Module A <-DEPENDS- Module B <-DEPENDS- Module C
You will quickly notice that some modules have no dependencies and many dependents, others have many dependencies and no dependents, and many in between.
No need to say, there are many properties that contribute to the reliability of a software package from download counts to update frequencies, but by just looking at their dependency network structure we can start inferring metrics that would help us evaluate possible risks.
Which software packages have the most impact in a dependency network?
The most critical modules have cascading impact on other modules. In other words, looking at just the number of incoming dependency connections would not be enough (as in the chart above) to find the most dependent upon packages. Because there is a cascading relationship of dependency, a removal of a package, as seen on Azer’s case, affects some packages, which affect other packages, which affect many other packages and so forth. So, we should look at the incoming tree of modules for a given module to understand its impact in a dependency graph.
The async module has 12 incoming dependents and its tree of dependents contains 31 modules in total.
The number-is-nan module has just 3 incoming dependents, but it has a deeper tree of dependents, which contains a total of 55 modules.
The lodash module module has just 26 incoming dependents, and it has a quite fat tree of dependents, which the total breadth and depth contains 79 modules.
Looking at the incoming dependents tree we can understand which software packages have the most impact and therefore and may have more risk than others in a software package dependency network.
Which software packages are the most vulnerable in a dependency network?
We would scratch the surface, if we look at just the immediate outgoing dependencies (like the graph above) of a software package for its vulnerability. We should consider dependent upon modules, and their dependencies of dependencies of dependencies… Compare the following packages with variable number dependencies and tree size:
The engine.io module has 5 outgoing dependencies and its tree of dependencies contains 18 modules in total.
The uglify-js module has only 4 immediate outgoing dependencies, but it has a deeper tree of dependencies, which contains a total of 49 modules.
The cheerio module has 6 outgoing dependencies, but it has a fatter tree of dependencies, which contains a total of 99 modules.
So, the most vulnerable packages in a software package network can be discovered by traversing the connections and measuring the outgoing dependency tree.
Which software package have the most chance of being a single point of failure?
By looking at the interconnections between nodes, we can find organic clusters in a dependency network (colored clusters in the first graph image above). Among these clusters, few nodes make bridges, therefore have more likelihood of being a single point of failure in case of a package removal. To find such nodes, we can look at nodes with high betweenness centrality (bridge quality) and relatively low degree centrality (number of connections). In this case, the debug module seems to have a more critical location than others in the npm dependency network.
These network mapping and analysis methods can be applied to other software package ecosystems such as Ruby gems, Python PyPI and many others. For other software package networks, you can check out Andrei Kashcha’s fantastic Software Galaxies project, where you can fly through constellations of packages.
The mapping and analysis in this article were done on the Graph Commons platform, I recommend exploring these interactive graphs yourself.
This article was originally published at the Graph Commons Journal.