Plant Similarity Index: Finding similar plants with PCA

Would a rose by any other name smell as sweet? Statistically, the Pink Elf French Hydrangea might work. Meanwhile, the Blue Rose Echeveria is a desert succulent.

Our goal at Bloom is to help gardeners find the best plants for their yard. We offer suggestions about what plants will grow well together, but sometimes a plant is not available, or won’t work with the particular lighting or soil conditions. Or maybe the gardener just hates roses.

To help find similar plants, we created the Plant Similarity Index, a simple metric designed to helper gardeners find similar plants, and a good jumping off point for exploring the data.

You can see this in action with our Similar Plant Finder.

Criteria for Success

To be considered successful, we should be able to identify plants that would fit in the same place spatially- plant size plays a big role in our bed design algorithms. It should also share similar characteristics, like leaf type and colors. It should be presentable in a useful way, not overwhelming with choices, and generally ‘make sense’ when seen together.

Failed Attempt: K-Means

Our first attempt was with a few clustering algorithms. After some attempts at scaling the data to bring out appropriate features, the results weren’t satisfactory.

On the surface, it looks like it might be ok:

K-Means with 20 Clusters

But it turns out we have thousands of plants packed into just a few clusters:

K-Means, lot of uneven clusters

Different algorithms and feature scaling produced similar results. Algorithms that determined their own cluster count often produced no more than 3 clusters, fitting thousands into each. Altogether, this would be a bad user experience.

Principle Component Analysis

We attempted to identify similarity with Principle Component Analysis (PCA), a process that identifies correlation in data and reduces the dimensionality. This is a common technique for image compression.

We used descriptive plant features- height, width, plant type, colors, etc- about 12 dimensions- and reduced them to one dimension. More priority was given to the size dimensions, which we consider important. We avoided common name- a rose is not a rose.

The results were satisfactory, and best of all it presents the data in a usable way with just one dimensions. This is a huge benefit for indexing, letting us quickly find similar plants in production.

Results

The index spreads a range from -85.62 (Golden Sunblaze Miniature Roseto) to 1356.59 (English Oak).

This short excerpt of the data shows PCA successfully identifying similar ‘Clematis’ vines. It also closely relates Silver Lace Vine and Willamette Raspberry, also vines.

Appropriate clustering of Clematis and a few

It’s important that the algorithm found similar plants based on something other than name. Finding other suitable vines in the example above is a huge show of confidence in this technique. This kind of classification would have taken days by hand.

It’s not perfect, though. We occasionally encounter results we don’t understand. In the example below, the Cloud Nine Tall Switch Grass was an odd result.

An ornamental grass is an odd choice here.

Conclusion

We’re ultimately satisfied with the results, enough to use it for recommendations. We’ll continue to iterate reduce the outliers that aren’t uncommon with machine learning.

You can explore this data in our Similar Plant Finder.

View the entire Similarity Index.

Learn more about the Bloom Landscape Assistant.