Using clustering to create a new D3.js color scale
D3 scales are an immensely useful tool for data visualization and many other applications. In visualization, one common use case is mapping numeric values to a discrete set of colors, as shown in the map below:
D3’s quantile and quantize scales are frequently used for this purpose. Quantile scales assign roughly an equal number of values to each color.
Quantize scales create segments of roughly equal size between the minimum and maximum observed values. While these scales work well enough in most cases, they can have shortcomings, which we’ll illustrate by looking at the following set of 10 values:
Here’s what it looks like when we use a quantile scale to map these values to six discrete colors:
The quantile scale assigns one or two values to each color. This works alright, but notice that 5 and 12 end up in the same category even though 5 is actually much closer to 1, 2, and 4. Now let’s try quantize:
One might argue that quantize is the most visually accurate scale, as it just linearly maps values to discrete colors. However, the outlier 1244 heavily skews the distribution and assigns 80% of the values to a single color, despite there being considerable variation within that color.
Let’s see what happens when we apply a 1-dimensional clustering algorithm to the same problem:
The results are pretty compelling. The numbers assigned to each color are similar in magnitude, and the full spectrum of colors is utilized.
In late July 2015, Tom replaced the Jenks algorithm in simple-statistics with a port of an algorithm called Ckmeans–a 1-dimensional clustering algorithm created by Haizhou Wang and Mingzhou Song that is a slight improvement over Jenks in most cases.
Despite being an improvement over Jenks, the Ckmeans algorithm still had a runtime complexity O(kn²) which would take 300–500ms for visualizations that have thousands of data points such as the US counties choropleth above. This is just too slow to be used reliably in applications where performance is important.
Luckily Wang & Song released Ckmeans 3.4.6 in May 2016, with a new core algorithm that uses a divide & conquer dynamic programming approach to achieve O(kn log(n)) runtime. For web applications this is a huge win–the same US counties choropleth now takes 30–50ms to compute clusters, which is a 10x improvement.