Faster machine learning on larger graphs: how NumPy and Pandas slashed memory and time in StellarGraph

Huon Wilson
May 7 · 6 min read
Image for post
Image for post
Photo by Hassan Amra on Unsplash

This week, StellarGraph released a new version of its open source library for machine learning on graphs. One of the most exciting features of StellarGraph 1.0 is a new graph data structure — built using NumPy and Pandas — that results in significantly lower memory usage and faster construction times.

Consider that a graph created from Reddit, with more than 200 thousand nodes and 11 million edges, required almost 7GB of memory with previous iterations of the library’s graph data structure. It also took 2.5 minutes to construct.

In the new StellarGraph class’s minimal form, that same Reddit graph now uses approximately 174MB. And this isn’t even a classic ‘memory versus time’ balance: the new form takes only 0.9 seconds to construct.

That is, the new StellarGraph class is 40 times smaller to store and hundreds of times faster to construct than the old StellarGraph.

In this article, we’ll explore how a careful focus on what works best for graph machine learning — in this case replacing StellarGraph’s graph data structure from one built on NetworkX to one built using NumPy and Pandas — led to big performance enhancements in the StellarGraph class.

A class above

The core abstraction in the library is the StellarGraph class, which is a graph data structure that manages all the information about the graph or graphs being used for machine learning.

Previous versions of the StellarGraph class were backed by NetworkX, which allowed for quick and effective development of many graph machine learning algorithms because of its convenient and flexible API, built using nested dictionaries.

However, this flexibility meant it wasn’t optimised for graph machine learning: NetworkX has different trade-offs than those best for machine learning, the most notable being the amount of memory required to store a graph.

So, over the releases leading up to 1.0, the NetworkX-backed graph data structure was replaced with a new one built using NumPy and Pandas.

How does it work?

  • Efficient storage of edges
  • Keeping node features available for quick indexing
  • Support for arbitrary node IDs.

Efficient storage of edges

Image for post
Image for post
A NumPy array can consist of a single chunk of memory with values stored inline. A Python list stores pointers to other Python objects, and each of these has extra metadata overhead, dramatically increasing the cost over NumPy.

These arrays lay data out contiguously in memory, so that an array of four byte integers (like the numpy.int32 type) requires only four bytes per element. This is different to a Python list of the same integers, which requires ten times more memory, at 43 bytes per element.

The edges of the graph are conceptually pairs of a source node ID and a target node ID, representing the connection between the two nodes. In the new StellarGraph class, the edges are stored as NumPy arrays containing the source and targets in a “structure of arrays” style.

This explains a significant fraction of the reduction in memory use for the Reddit graph. The edge information can be placed contiguously in memory with no overhead from Python objects. By using NumPy natively, we also get to easily leverage the efficient sparse matrix and graph algorithms provided by the SciPy library.

Keeping node features available for quick indexing

Image for post
Image for post
The graph on the left has two node types; “white background” and “black background”, along with some feature vectors of numbers for each. These get transformed into 2D NumPy arrays on the right, matching the order of the nodes.

The new StellarGraph class stores the node features by having a large rectangular 2D NumPy array for each node type: for a particular node type, the first row of the array represents the features for the first node of that type, similarly for the second row, and so on.

This encoding is very memory efficient, since numeric features can be all laid out with no overhead from Python objects, just like edges. It is also fast, because queries about the features of nodes at given locations in the array can be answered by slicing out rows from the array, which is implemented as native C code in the NumPy library.

Support for arbitrary node IDs

One approach is to just require that all node IDs are sequential integers, starting from 0, so all graphs with three nodes have IDs 0, 1, 2. However, this makes the library hard to use with real world datasets where the IDs might be UUIDs, random values, or some other non-trivial encoding. It quickly gets more complicated in heterogeneous graphs, with more than one type of node.

Image for post
Image for post
Nodes can be keyed with any ID values, by using a pandas.Index as a translation layer into numeric IDs, or ilocs. These ilocs can then be used in the NumPy arrays above.

Thus, the final key piece of the new StellarGraph class is using the pandas.Index types to provide a translation between convenient and conventional node IDs and the sequential integers so that any node ID values can work, and the users of the class don’t have to think about doing their own translation or management.

The Pandas Index types are being used as a time- and memory-efficient dictionary, mapping IDs to sequential integers (which both StellarGraph and Pandas call “ilocs”, meaning integer locations). It leverages Pandas’ optimised C and Cython native code.

These ilocs are used as part of the edge storage discussed above. The ilocs are small integers, and so the contiguous layout of a NumPy array ensures our edges are always stored efficiently. Many of StellarGraph’s algorithms work directly with these ilocs, and so see no overhead due to the “fancy” IDs.

There are some additional details here. For example, the nodes are actually numbered globally in the index, not per type, so the graph above would have A → 2, B → 3, …, but most of these are shielded from the user.

In summary —

The new StellarGraph class delivered in 1.0 uses NumPy and Pandas to reduce memory use and improve speed through careful selection of the moving parts, and by focusing on just what is required for graph machine learning. Building on existing libraries gives us the benefits of native code without the problems.

To get started with graph machine learning with the new StellarGraph class in the StellarGraph library, pip install stellargraph, and then create a new optimised graph object with your data with StellarGraph(nodes, edges). There are demos with all the details.

This work involved major contributions, research and code from Andrew Docherty, Geoff Jarrad, Huon Wilson and Kieran Ricardo. We’re working on even more improvements top of this optimised base, that should improve memory use and speed even further, such as #718.

This work is supported by CSIRO’s Data61, Australia’s leading digital research network.

Image for post
Image for post

stellargraph

We are a team of passionate engineers, designers, data…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store