Visualizing Node-Link Graphs

An essay on how to make graphs easier to understand. Topics include creating a visual graph language as well as utilizing the third spatial dimension, all with plenty of pretty pictures.

1.0 — Introduction

During Claude Shannon’s tenure at Bell Labs (a research arm of what we now call AT&T, responsible for things like the transistor, Unix, and the cell phone), the Bell Labs patent lawyers tried to figure out why some people were awarded more patents than others. After much consternation and head scratching, the patent lawyers “discerned only one common thread: Workers with the most patents often shared lunch or breakfast with a Bell Labs electrical engineer named Harry Nyquist. It wasn’t the case that Nyquist gave them specific ideas. Rather, as one scientist recalled, ‘he drew people out, got them thinking.’” [1]

In today’s world, it is in every company’s self-interest to know if they have any Harry Nyquists on staff. The moment a company identifies them, the individual would no doubt get a raise and be tasked with exclusively talking with other inventors and engineers. Given the size and complexity modern companies, though, it is doubtful that a few lawyers could solve the problem.

One way a complex company could identify its Nyquists, so to speak, would be to visualize internal collaborative relationships via node-link graphs.

The only problem is that node-link collaboration graphs are often too complex to understand. For instance, here is a visualization of patent collaboration at Apple:

A node-link graph of Apple Patent collaborators. From FastCompany / Periscopic

If you are like many, while you might find the above graph cool and somewhat interesting, you may have trouble figuring out what it is supposed to mean.

Before you go blaming yourself or figuring that you never really liked data anyways, understand that node-link graphs aren’t easy to understand, almost by default. This, of course, means that node-link graphs can be improved. To see how, let’s dive into where two-dimensional node-link graphs succeed and where they fail.

1.1 — Edward Tufte and the issue of information

If we want to spot the Harry Nyquist in the Apple patent graph, we have to ask a more basic question: What makes a node-link graph, or any data visualization, easier or harder to understand?

Left: Apple Patent Collaboration. Right: Google Patent Collaboration. From Fast Company.

Luckily for us, the answer to this question has been explored by Edward Tufte, a celebrated professor of Data Visualization and Statistics. His idea is that one of the better ways to quantify the quality of any data visualization is to look at the ratio between useful information and the amount of “ink” or “pixels” used to create the visualization.

If you are a stickler, Tufte’s definition works very well as a precise rule for black and white, printed charts. It is more of a guiding principle when talking about different colors and “extraneous” information. Tufte is also fond of using the term Chartjunk to refer to any redundant or non-information carrying “noise” that can be found in a data visualization.

Consider an example he uses:

A plot of 1980’s nominal diamond prices. From The Visual Display of Quantitative Information.

Most of the “ink” is fluff, defensively designed with the assumption that the reader won’t be interested in a plain ol’ line chart.

One of the purposes of any data visualization is to communicate information necessary to convey insights implicit in the data. Let’s call this “insight information.” Both the information-to-ink ratio and Chartjunk reflect the idea that, for any given data visualization, the “insight information” is often a subset of the information communicated by the visualization as a whole.

The closer the amount of insight information is to the total information (aka the higher the information-to-ink ratio) the higher the quality of the chart. If this can then be done aesthetically, all the better.

2.0 — Methods for better graphs

Assuming no egregious Chartjunk (say, like a top hat, hair, and high-heels), the information-to-ink ratio represents a fundamental tradeoff between informativeness and communicability found across languages.

It turns out that in any language (visual, verbal, musical, and otherwise), there is a tradeoff between how much information is communicated by a particular element (like a chart element, a word, or a musical chord), and how easy it is for a person with little prior knowledge to understand it.

This tradeoff is very likely due to how much information we can focus on at any given time; because most of the world is not conveniently arranged in ways that are easy for our pulsating neurons to understand.

Currently, node-link graphs suffer from the problem that the more information you visualize, the less meaningful it becomes. This makes even moderately complex graphs difficult to understand beyond a gut-feeling, intuitive level about how different aspects of the data connect to each other.

There several ways to overcome this limitation and increase the information density of node-link graphs. I’ll cover four here. The two traditional approaches are force-directed spring layouts and edge-bundling. Two new approaches involve developing a visual language, and using additional spatial dimensions. The promise is that by utilizing all four in combination, companies can more easily identify the “magic glue” in their collaborative processes.

2.1 — Force-Directed Layouts

Let’s say you had a bunch of entities and some notion of which ones connected to which. If you wanted to make a node-link graph by hand, the primary thing you’d have to figure out is what to do with all of the dots.

You could, if you wanted, figure out some artistic way to spread the dots out. But you’d probably have to go through several different iterations, and it would be very inefficient.

What would really be great is a systematic way of making the important bits more salient than the non-important bits. It turns out that pretending that all of the links connected the dots are springs is a reasonable approach. Often, this is supplemented with pretending that all the particles are repelled by electric charges, or attracted via gravity. By calculating the interactions of these forces with each other, a node with many connections can be “pulled to the center” of its connections, more so than a node with fewer connections.

To see the advantage of a force-directed spring layout, compare B, and C below. Keep in mind that both communicate the same basic information.

A comparison of layout techniques, utilizing the same information. C is rendered with a force-directed layout. From Nature Methods.

Clearly, the spring layout approach is successful in increasing the information-to-ink ratio in C, on the right. In fact, force-directed spring layouts are one of the better approaches that have been tried.

However, if the true goal is to increase the information-to-ink ratio, then spring layouts should be considered as one of several techniques that can work in tandem. Another option to increase the information-to-ink ratio is to create a shared vocabulary of different patterns of connections.

2.2 — Edge Bundling: The Lyft-Line / Uber-Pool approach

A third approach that can raise the information-to-ink ratio of node-link graphs is to use Edge Bundling and associate links that travel in similar directions with each other. Consider the following edge-bundled graph of parts of the brain:

Edge Bundling applied to brain connections. From Bottger (2013).

Because graphs often contain links that travel in the same direction for most of their journey, it can be fruitful to bundle them together. This is exactly like what Lyft Line or Uber POOL does when pairing riders who are going to similar destinations together.

The graphs below demonstrate the advantages of edge-bundling. Graphs A and B represent the same US flight data. In these graphs, each node represents a US city. Graphs C and D showcase the same poker matches between various players, before and after edge bundling is applied.

These examples also highlight the fact that edge-bundling works best in relatively interconnected graphs. If your graph doesn’t have too many edges and edge crossings, edge-bundling won’t raise the information-to-ink ratio too much:

An example of when edge-bundling a collaboration graph doesn’t get you too far… (We found this out the hard way.)

(It turns out there are a few different strategies for calculating how to bundle edges together. For an in-depth comparison, try this paper on Confluent Bundling.)

2.3— A visual language for node-link diagrams

In chemistry, scientists use a shorthand system to depict a chemical’s structure. For example, because it is considered background knowledge, researchers will commonly avoid explicitly drawing the hydrogen-carbon connections. In addition, various types of chemical bonds are also given their own symbols.

Consider the structure of Isobutanol, a solvent composed of hydrogen, carbon, and oxygen:

Isobutanol

Clearly, this diagram relies on pre-supposed understandings of what all the symbols mean. But one can take this a step further:

A node-link diagram of Isobutanol, (CH3)2CHCH2OH. The carbon-hydrogen bonds are understood to be implicit.

Representing hydrogen without the common “C” and “H” symbols (with the exception of the “OH” group) decreases the total ink on the page, making the information-to-ink ratio higher.

For an even more striking example, consider this diagram of escitalopram:

Shamelessly stolen from Wikipedia.

On Wikipedia, the caption for this diagram is: “The skeletal formula of the antidepressant drug escitalopram, featuring skeletal representations of heteroatoms, a triple bond, phenyl groups and stereochemistry.”

To understand how much information is communicated in these diagrams, consider how many years of study it would take to, or how many Wikipedia articles you’d have to read, to fully understand the caption. I don’t know about you, but the last time I thought about chemistry, I was in 10th grade. Clearly, I’d need to learn plenty in order to have more than a surperficial understanding of escitalopram.

In other words, the escitalopram diagram is consists of abbreviations that only make sense to people already in the know. This means that it communicates a good deal of information, and that most people are not in a state such that they can understand it.

Creating a visual language could be very promising for making node-link graphs easy to understand. It’s a classic chicken and egg problem: most graphs are hard to read, causing them to be underutilized, which in turn causes people to shy away from creating a visual language for them. And because there isn’t an agreed-upon visual language, people can’t learn how to read graphs, making them underutilized.

In my cursory search of the node-graph-visualization literature, I found one prior attempt to create a visual language:

2009 US Senate voting graphs. Senators are connected if they vote the same way more than 70% of the time. The graph on the right replaces large cliques with rounded star shapes.

While the graph on the right is simpler, it can be argued that too much information has been lost — all of the senate Democrats and left-leaning independents have been subsumed into a giant blue blob.


Having given the matter some thought, here are other potential starting points for a node-link graph visual language:

  • Nodes that don’t connect to anything should be represented by a much smaller dot.
  • Leaf-nodes (or the last, terminal link in a chain of connections) should be represented like the hydrogen atoms in chemistry diagrams, where the node is left off of its link. If, after collapsing all of the leaf nodes in this fashion, there are new leaf nodes, use two segments of a line to indicate that a node has two layers of leaf nodes associated with it.
  • Represent the parts where every node is connected to every other node, (technically, called a maximal clique), with a simple polygon whose shape corresponds to the number of nodes represented. For example, if you have three nodes that are each connected to each other, you could represent this with a small triangle.

To get a sense for what a visual language of graphs might do, when you look at the graph below, imagine that all of the single dots are hidden. While you are at it, think of all the ink that could be saved by using symbols for maximal cliques, orphan nodes, and leaf nodes. In the graph below, I’ve taken the liberty of highlighting some of the easy to spot maximal cliques. Each clique may indicate small teams or working groups or the outcome of a project with a few related patents.

Select highlighting of maximal cliques on the Fast Company Apple patent graph. The red clique in the upper right could be represented by a nonagon (a polygon with nine sides).

There are many more maximal cliques than I had time to highlight. And due to the force-directed layout, I wasn’t able to identify any of the cliques in the center of the visualization. But the moral of the story is that applying a visual language to this graph could render it much more readable than it currently is.

Another issue with designing a visual language meant to simplify graphs is that it can be very hard for a computer to figure out where the graphs can be simplified. Still, it may be worth the computational effort if the result is a graph that is easier to understand.

2.4 — Three uses for the Third Spatial Dimension

Another strategy for making better node-link graphs is to utilize three spatial dimensions rather than two. This strategy hasn’t been widely used, even though it seems rather promising.

An advantage to doing so is that when just doing a purely 3D force-directed layout, the extra dimension makes the rendering more efficient — there is more “space” for the springs to contract into, and the layout usually becomes less cluttered. (More details will be revealed in a forthcoming post.)

In the following video what initially looks like a messy graph becomes simpler and cleaner as its three-dimensional structure is revealed:

A three-dimensional collaboration graph we produced for one our clients.

Clearly, virtual reality could be useful for navigating this graph. Enabling a user to interact with the graph structure can also raise the information-to-ink ratio, as it allows the user to literally zoom-in on what they think is important.


In addition to making the force-directed layer easier to compute, and raising the information-to-ink ratio, the third dimension can allow for important differences in the graph to be visually reinforced. More technically, one can use the extra spatial dimension to encode multi-dimensional collaboration information.

Now, in all honesty, I have not been able to find a satisfactory example of what this might look like. So instead, let’s examine a two-dimensional visualization that utilizes multi-dimensional information.

If you head over to http://www.patentsview.org/web/#viz/relationships, you’ll see a neat visualization of information from the U.S. Patent office.

This graph shows the top 100 most cited patents, along with the assignees (usually the company, in orange) as well as the inventors, in yellow.

Imagine if each dot type was confined to a separate plane, and these planes were separated by different amounts of depth. If you look carefully, you’ll observe that a lot of the visual noise in the graph is due to the yellow dots being forced away from what they connect to.


Using the third spatial dimension can make it easier to compute the force-directed layout, reduce visual clutter, and reinforce different aspects of the data. Also, one could also use the third dimension as an excuse to “store” different two-dimensional planes, each containing part of the graph. Essentially, the graph could be “folded” in various ways to compress it and make it easier to understand.

3 — Conclusion

In 1736, Leonhard Euler published the first graph theory paper, in which he reformulated a problem about crossing bridges in Konigsberg to be about nodes and the links between them. Yet even in 2017, large scale node-link collaboration diagrams are still in their infancy. As visual vocabularies develop, and people get comfortable with navigating graphs in three dimensions, I predict they will be used more and more frequently. Taking a page out of the Bells Labs playbook, companies will likely use node-link graphs to visualize collaboration and identify which, if any, employees should be paid to have breakfast with everybody else.

Over the past twenty years, the ability to perform computations on large graphs has given rise to many successful projects. Both Facebook and Google are predicated on computing values across large and complex social and web graphs, respectively. But for both of these companies, effectively visualizing these graphs has been close to impossible. With new techniques (explored above) and new technologies (like 3D VR, AR, etc.), node-link graphs may just get their “fifteen minutes of fame” yet. At Kineviz, we work with our clients to explore how to utilize these new techniques and technologies to make graphs easier to understand and thus more insightful.

[1] Jon Gertner, The Idea Factory, pg 135.

Thanks for reading! If you are interested in having Kineviz visualize your data, whether or not it is suitable for node-link graph techniques, drop us a line — hello@kineviz.com

(And if you are interested in, or have ideas about, developing a visual language for graphs, feel free to reach out to me via Twitter: @anotherpianosan)