First of all, thank you for authoring HDBSCAN — it was a great clustering algorithm that folks like me (more on the “applied” side of things) could use off-the-shelf that could flexibly deal with the messiness of real-world data. I was starting to run out of ideas that could exhaustively deal with each of the millions of data points, and finding HDBSCAN was a lifesaver.
As for your note about cosine similarity, you are absolutely correct that it doesn’t play nicely. I was imprecise with my language/recollection so I’ll go back and correct that. I actually read your notes on github and stackexchange and used euclidean distances between l2-normalized doc vectors as a substitute.
Oh, and I definitely didn’t use the algorithm in anger — I was quite happy to see it working! ;-)