Analyzing Software Dependencies With deps.dev — Discover AuraDB Free (Week 49)
This week we looked at software dependencies, an important use case within software analytics for graph databases. Not only can you understand what libraries your software uses not just directly but also indirectly, but also how you’re affected by software vulnerabilities.
If you missed it - the call for papers for our online developer conference NODES 2023 is open till June 30th, but if you submit early you might be selected as a featured speaker.
Two years ago, Google launched https://deps.dev which is an open source package dependency database that makes package information from these systems available:
- npm (Javascript)
- PyPI (Python)
- maven (Java / JVM)
- cargo (Rust)
- NuGet (.Net)
- Go
It even talks about dependency graphs in its "How it works" section.
The service repeatedly examines sites such as github.com, npmjs.com, and pkg.go.dev to find up-to-date information about open source software packages. Using that information it builds for each package the full dependency graph from scratch—not just from package lock files—connecting it to the packages it depends on and to those that depend on it. And then does it all again to keep the information fresh. This transitive dependency graph allows problems in any package to be made visible to the owners and users of any software they affect.
If you rather watch the recording for the livestream, you find it here:
Back then I threw together a quick script to load the data via their unofficial REST API that powered the site.
And tweeted about it:
But meanwhile, they have published an API that we can use to access the data. The API docs are minimal, but good enough for our purposes.
The minimal API for getting information for package is straightforward but doesn’t give us a lot of data, more interesting is the information per version, which also lists licenses, security vulnerabilities, and links (homepage, repo, issue-tracker).
Here is the example for React (no security vulnerabilities):
https://api.deps.dev/v3alpha/systems/npm/packages/react/versions/18.2.0
{
"versionKey": {
"system": "NPM",
"name": "react",
"version": "18.2.0"
},
"isDefault": true,
"licenses": [
"MIT"
],
"advisoryKeys": [],
"links": [
{
"label": "HOMEPAGE",
"url": "https://reactjs.org/"
},
{
"label": "ISSUE_TRACKER",
"url": "https://github.com/facebook/react/issues"
},
{
"label": "ORIGIN",
"url": "https://registry.npmjs.org/react/18.2.0"
},
{
"label": "SOURCE_REPO",
"url": "git+https://github.com/facebook/react.git"
}
]
}
But we’re more interested in the graph, so let’s go directly for the package dependencies.
Dependencies of a package
You can find the dependencies of a package (like TensorFlow) in the UI
Loading the data for the TensorFlow packages via API uses the system
, name
and version
of a package in the URL.
https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies
And responds with a JSON that has already a graph format:
{
"nodes": [
{
"versionKey": {
"system": "PYPI",
"name": "tensorflow",
"version": "2.12.0"
},
"bundled": false,
"relation": "SELF",
"errors": []
},
{
"versionKey": {
"system": "PYPI",
"name": "absl-py",
"version": "1.4.0"
},
"bundled": false,
"relation": "DIRECT",
"errors": []
},...],
"edges": [
{
"fromNode": 0,
"toNode": 1,
"requirement": ">=1.0.0"
},
{
"fromNode": 0,
"toNode": 2,
"requirement": ">=1.6.0"
},
{
"fromNode": 0,
"toNode": 6,
"requirement": ">=2.0"
}, ... ]}
The response contains data in a graph format, first a list of nodes
then a list of edges
with fromNode
and toNode
(based on the index in the nodes
array) and semantic version requirement
.
To load the data from the API we use apoc.load.json
to provide the response as a Cypher nested structure result.
call apoc.load.json("https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies")
yield value as r
We can now import the data by creating the nodes first and then collecting them into an array again to provide the index lookup for the edges. We encode the "system", here "pypi" as an additional label :PyPi
on our :Package
nodes which then also hold the constraint by name
create constraint package_pypi if not exists for (p:PyPi) require (p.name) is unique
In a real system we would create separate version nodes on each package that we would then link to, here for simplicity we stuck with the :Package
nodes only.
And then iterate over the nodes with UNWIND
within a CALL
subquery to create the nodes. And then do a second subquery for the relationships.
with "pypi" as system, "tensorflow" as name, "2.12.0" as version
call apoc.load.json("https://api.deps.dev/v3alpha/systems/"+system+"/packages/"
+name+"/versions/"+version+":dependencies")
yield value as r
// create nodes
call { with r
unwind r.nodes as package
merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version
return collect(p) as packages
}
// create relationships by linking nodes
call { with r, packages
unwind r.edges as edge
with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge
merge (from)-[rel:DEPENDS_ON]->(to) ON CREATE SET rel.requirement = edge.requirement
return count(*) as numRels
}
return size(packages) as numPackages, numRels
Now we can visualize the data in the Query UI by running MATCH path=(:PyPi {name:"tensorflow"})-[:DEPENDS_ON*]→() RETURN path
Or we can head over to "Explore" and visualize it in the hierarchical layout and also find the shortest paths between packages visually.
We can also use the packages that we already have imported into our graph to fetch their dependencies.
To achieve that we replace the hardcoded initial data for package and version with data from the graph. We also set an additional property (or label) to indicate which packages have already been loaded.
match (root:Package:PyPi) where root.imported is null
set root.imported = true
with "pypi" as system, root.name as name, root.version as version
call apoc.load.json("https://api.deps.dev/v3alpha/systems/"+system+"/packages/"
+name+"/versions/"+version+":dependencies")
yield value as r
call { with r
unwind r.nodes as package
merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version
return collect(p) as packages
}
call { with r, packages
unwind r.edges as edge
with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge
merge (from)-[rel:DEPENDS_ON]->(to) ON CREATE SET rel.requirement = edge.requirement
return count(*) as numRels
}
return size(packages) as numPackages, numRels
Loading Dependents
The UI also shows dependents (i.e. packages that use the current package), which we could infer inversely from our imported data too. Unfortunately, there is no API call for this, so we need to get the REST API call for the UI, which is the following:
https://deps.dev/_/s/pypi/p/tensorflow/v/2.12.0/dependents
It has a different response format and only lists 100 results, but that’s better than nothing for demonstration purposes. We can pick the directSample
list of entries and connect them to our root package that we start with.
with "pypi" as system, "tensorflow" as name, "2.12.0" as version
merge (root:PyPi { name:name}) on create set root.version = version
with *
call apoc.load.json("https://deps.dev/_/s/"+system+"/p/"+name+"/v/"+version+"/dependents")
yield value as r
unwind r.directSample as entry
merge (dep:PyPi:Package {name:entry.package.name})
on create set dep.version = entry.version
merge (dep)-[:DEPENDS_ON]->(root)
Question from the viewers — Eshwar: How do I fix relationships that I imported wrongly?
Answer:
- find the relationship to delete or update properties
- e.g.
MATCH ()-[rel:SOME_TYPE]->() DELETE rel
- or us apoc refactor procedures to rename, change direction, redirect
- see also call apoc.help("refactor")
That was it for today. Happy graphing!
Don’t forget to share the episode or the "Discover AuraDB Free with Fun Datasets" series with your graph-curious friends and colleagues.