Using clustering to create a new D3.js color scale

TL;DR Check out d3-scale-cluster on Github and npm

D3 scales are an immensely useful tool for data visualization and many other applications. In visualization, one common use case is mapping numeric values to a discrete set of colors, as shown in the map below:

Obama’s Health Law: Who Was Helped Most — New York Times, Oct 29, 2014

D3’s quantile and quantize scales are frequently used for this purpose. Quantile scales assign roughly an equal number of values to each color.
Quantize scales create segments of roughly equal size between the minimum and maximum observed values. While these scales work well enough in most cases, they can have shortcomings, which we’ll illustrate by looking at the following set of 10 values:

Here’s what it looks like when we use a quantile scale to map these values to six discrete colors:

The quantile scale assigns one or two values to each color. This works alright, but notice that 5 and 12 end up in the same category even though 5 is actually much closer to 1, 2, and 4. Now let’s try quantize:

One might argue that quantize is the most visually accurate scale, as it just linearly maps values to discrete colors. However, the outlier 1244 heavily skews the distribution and assigns 80% of the values to a single color, despite there being considerable variation within that color.

Let’s see what happens when we apply a 1-dimensional clustering algorithm to the same problem:

The results are pretty compelling. The numbers assigned to each color are similar in magnitude, and the full spectrum of colors is utilized.

Finding “natural breaks” in the data is not a new approach by any means. In the 1970s, cartographer George Jenks wanted a similar solution for choropleth visualizations, and proposed the Jenks Natural Breaks Algorithm in a 1977 article “Optimal Data Classification for Choropleth Maps”. While this algorithm has been ported to many different languages over the past couple decades (JavaScript included), Tom MacWright was the first to create a “literate” Jenks implementation in JavaScript for use in his simple-statistics library.

In late July 2015, Tom replaced the Jenks algorithm in simple-statistics with a port of an algorithm called Ckmeans–a 1-dimensional clustering algorithm created by Haizhou Wang and Mingzhou Song that is a slight improvement over Jenks in most cases.

Despite being an improvement over Jenks, the Ckmeans algorithm still had a runtime complexity O(kn²) which would take 300–500ms for visualizations that have thousands of data points such as the US counties choropleth above. This is just too slow to be used reliably in applications where performance is important.

Luckily Wang & Song released Ckmeans 3.4.6 in May 2016, with a new core algorithm that uses a divide & conquer dynamic programming approach to achieve O(kn log(n)) runtime. For web applications this is a huge win–the same US counties choropleth now takes 30–50ms to compute clusters, which is a 10x improvement.

The new core algorithm in Ckmeans 3.4.6 led to a ~10x speed improvement for many sample sizes

In an effort to make it easier for anyone to use this technique in data visualizations, I’ve ported this new algorithm to JavaScript and created a custom d3 scale called d3-scale-cluster. You can find d3-scale-cluster on Github and npm–give it a try and shoot me a tweet @dschnr with your thoughts!