An excellent analysis. I must admit that I’m impressed that you managed to use HDBSCAN with cosine distance on 100,000 samples! Cosine distance doesn’t play nicely as a metric (it fails the triangle inequality) which makes it hard to work with when trying to accelerate the algorithm. I would love to see that corrected, but only have approaches that approximate the correct result . Since you clearly use the algorithm in anger I would be interested to know your opinion on the scalability/accuracy trade-space.