# How to Pick the Optimal Color Palette from Any Image

Color quantization/approximation is a computer graphics process that reduces the number of colors required to represent an image while preserving the overall human-perceived quality.

It utilizes both machine learning and color science concepts. In this post, I will talk about my learning on how to quantize colors using **k-means++** clustering algorithm and the **Lab** color space. As the implementation part of my learning, I created a **color picking app** and open sourced the core logic as JS packages for your next color or data science related project:

**colors picker**— a handy mobile-first web app that picks the optimal color palette from any image**major-colors**— color quantization with kmeans++ and CIEDE2000**k-means-plus**— K-means++ clustering algorithm

#### Applications of Color Quantization

**Color quantization** simplifies an image into a more essential representation that is easier to understand, analyze and share. Extracting images’ essential colors as through human eyes is critical for:

- displaying images with limited colors
- enabling efficient compression(compression is much better than there are less colors to present)
- generating optimal color palates from images
- building UIs that accommodates to an image
- dynamic coloring schemes based on a collection of photos
- organizing/sort images by colors

#### Using K-means for Color Quantization

One common approach to performing color quantization is to use a clustering algorithm to group similar colors in a color space and assign the group’s center vectors as the final colors. In my case, I went **k-means** which minimize the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). It was a natural fit for color quantization. For the scope of this post, I will assume the audience will already be familiar with the cluster analysis and the standard k-means algorithm.

#### K-means++’s Guarantee

Because the standard **k-means** is a heuristic algorithm which chooses the initial seeding clusters randomly, it can generate clusterings arbitrarily worse than the optimum. There is no guarantee that it will converge to the global optimum. The approximation heavily depends on the initial seeding of the cluster’s centers. By using an algorithm for choosing the initial seed clusters, **k-means++** variance address the potential for sub-optimal clustering and guarantee an approximation ratio **O(log k)** in expectation (over the randomness of the algorithm), where k is the number of clusters used. Basically,

**k-means++**is a way of avoiding possible poor clustering found by the standard

**algorithm. What’s left now is calculating the difference between two colors for the actual clustering.**

*k*-means#### Calculating Color Difference

By default, the algorithm determines distances between two points in a space using the **Euclidean distance**. In the **RGB **color space, the distance between two colors would be:

This works in cases when a single color is to be compared to a single color and the need is to simply know whether a distance is greater. However, this does not account for differences in human color perception at all. The **human eye is more sensitive to certain colors than others**. Color spaces like HSV, HSL tries to do better by placing the various colors within a three-dimensional space:

but most of these are just modifications of RGB they will tend to be on par with a simple Euclidean metric. I wanted a color space that accounts, describes and quantifies these non-uniformities in human color perception.

#### Lab Color Space

**Lab color space** is widely used in color imaging and printing industries: *L*** for the lightness and **a*** and **b*** for the green–red and blue-yellow color components*. Unlike the RGB and CMYK color models, ** Lab color** is designed to approximate and be

**perceptually uniform**with respect to human color vision. T

*he*Euclidean distance in Lab space,

*Delta E*

*:*** Delta E** tries to uniformly quantify the difference between two colors to fit human perception; however, this original formula rates regions like saturated ones too high as opposed to other colors. So, I went with the latest and most accurate

**formula which compensates for these non-uniformities. It is more complicated to implement and harder on the CPU. In performance-sensitive situations that don’t require high color accuracy, the original**

*Delta E 2000***should suffice.**

*Delta E*#### End Thoughts

The correct choice of the number of colors (k) is often ambiguous. It is a frequent and generic problem in data clustering. In the context of colors, the choice for k and the accuracy of Lab Color’s *Delta E* can only be judged by the human perception. In my implementation, I have left the users to select the number of colors/clusters to pick. So go give **color palate picker** a few spins and confirm through your own eyes. Make sure to try out **major-colors****, ****k-means-plus** in your next project!