How to Pick the Optimal Color Palette from Any Image

Color quantization/approximation is a computer graphics process that reduces the number of colors required to represent an image while preserving the overall human-perceived quality.

An image reduced to a palette of 16 colors specifically chosen to best represent the image using color quantization. Left: Original

It utilizes both machine learning and color science concepts. In this post, I will talk about my learning on how to quantize colors using k-means++ clustering algorithm and the Lab color space. As the implementation part of my learning, I created a color picking app and open sourced the core logic as JS packages for your next color or data science related project:

  • colors picker — a handy mobile-first web app that picks the optimal color palette from any image
  • major-colors— color quantization with kmeans++ and CIEDE2000
  • k-means-plus — K-means++ clustering algorithm

Applications of Color Quantization

Color quantization simplifies an image into a more essential representation that is easier to understand, analyze and share. Extracting images’ essential colors as through human eyes is critical for:

  • displaying images with limited colors
  • enabling efficient compression(compression is much better than there are less colors to present)
  • generating optimal color palates from images
  • building UIs that accommodates to an image
  • dynamic coloring schemes based on a collection of photos
  • organizing/sort images by colors

Using K-means for Color Quantization

k-means in action clustering data into 3 clusters

One common approach to performing color quantization is to use a clustering algorithm to group similar colors in a color space and assign the group’s center vectors as the final colors. In my case, I went k-means which minimize the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). It was a natural fit for color quantization. For the scope of this post, I will assume the audience will already be familiar with the cluster analysis and the standard k-means algorithm.

K-means++’s Guarantee

Because the standard k-means is a heuristic algorithm which chooses the initial seeding clusters randomly, it can generate clusterings arbitrarily worse than the optimum. There is no guarantee that it will converge to the global optimum. The approximation heavily depends on the initial seeding of the cluster’s centers. By using an algorithm for choosing the initial seed clusters, k-means++ variance address the potential for sub-optimal clustering and guarantee an approximation ratio O(log k) in expectation (over the randomness of the algorithm), where k is the number of clusters used. Basically, k-means++ is a way of avoiding possible poor clustering found by the standard k-means algorithm. What’s left now is calculating the difference between two colors for the actual clustering.

Calculating Color Difference

By default, the algorithm determines distances between two points in a space using the Euclidean distance. In the RGB color space, the distance between two colors would be:

This works in cases when a single color is to be compared to a single color and the need is to simply know whether a distance is greater. However, this does not account for differences in human color perception at all. The human eye is more sensitive to certain colors than others. Color spaces like HSV, HSL tries to do better by placing the various colors within a three-dimensional space:

but most of these are just modifications of RGB they will tend to be on par with a simple Euclidean metric. I wanted a color space that accounts, describes and quantifies these non-uniformities in human color perception.

Lab Color Space

Lab color space is widely used in color imaging and printing industries: L* for the lightness and a* and b* for the green–red and blue-yellow color components. Unlike the RGB and CMYK color models, Lab color is designed to approximate and be perceptually uniform with respect to human color vision. The Euclidean distance in Lab space, Delta E:

Delta E tries to uniformly quantify the difference between two colors to fit human perception; however, this original formula rates regions like saturated ones too high as opposed to other colors. So, I went with the latest and most accurate Delta E 2000 formula which compensates for these non-uniformities. It is more complicated to implement and harder on the CPU. In performance-sensitive situations that don’t require high color accuracy, the original Delta E should suffice.

End Thoughts

The correct choice of the number of colors (k) is often ambiguous. It is a frequent and generic problem in data clustering. In the context of colors, the choice for k and the accuracy of Lab Color’s Delta E can only be judged by the human perception. In my implementation, I have left the users to select the number of colors/clusters to pick. So go give color palate picker a few spins and confirm through your own eyes. Make sure to try out major-colors, k-means-plus in your next project!