An excellent analysis.
Leland McInnes

Hi Leland!

First of all, thank you for authoring HDBSCAN — it was a great clustering algorithm that folks like me (more on the “applied” side of things) could use off-the-shelf that could flexibly deal with the messiness of real-world data. I was starting to run out of ideas that could exhaustively deal with each of the millions of data points, and finding HDBSCAN was a lifesaver.

As for your note about cosine similarity, you are absolutely correct that it doesn’t play nicely. I was imprecise with my language/recollection so I’ll go back and correct that. I actually read your notes on github and stackexchange and used euclidean distances between l2-normalized doc vectors as a substitute.

Oh, and I definitely didn’t use the algorithm in anger — I was quite happy to see it working! ;-)

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.