Machine Learning² : Hierarchical Clustering Algorithm : — : Part[2]

£ha2
£ha2
Feb 23, 2017 · 4 min read

Well here we are about to start and …

So Start Already!

Alright then!. So what is the Hierarchical Clustering? Well essentially you

  1. Take and clusters and find the two closest clusters.
  2. Once the two closest are found, we make new clusters from their average and nest the parent clusters in the newly created merged cluster
  3. We then add the newly merged cluster to the remaining clusters and start the process all over again.
  4. We then continue this until we have only one cluster left, the cluster whose parents have parents who have parents who are all the clusters we had before. Hence our hierarchical clustering.

The goal? to have one cluster that has it’s parents, who have their parents, and so on and so forth, bonded by their closeness notion of distance. Its almost like the hierarchy of the most related clusters, the most inbred (sorry for using that) being the one at the bottom.

How Are We Making it Fast?

Well lets put it like this. 2 heads are better than one. This idea works this same in computing. I am computing we using concurrency and asynchronous construction to distribute the work load to other chips that have some spare computation power. In Scala we do this in something called a Future. It’s name is very indicative of what it does. If you construct a future with a nested computation, you can move on with other things and get the result when you need it. We call this kind of thing non-blocking. In our case we need to compare every cluster to every other cluster whith is n² work. We can put each comparison in a different Futures and wait for the results.

Generating a List of Futures

With a little magic from foldLeft where we can encase all the computations in a Future. It allows us to create a new list containing all the constructed futures.

Now that we are have a function in here that deserves a little bit of attention and that is computeDistances. To run it we have been passing it increasingly shorter lists to compare to the designated cluster x. This way we prevent overlapping computations. All clusters will always be compared to the cluster before it in the previously constructed future. Distances is a Map ( think of it as a list of Key Value Pairs where the value is a collection of objects of the same type) that contains all the distances between clusters we currently know. This way computations can share some of the results of distances between clusters. This becomes quite key later on, but it’s advantages are clear to see in principle now.

But What is ComputeDistance doing though?

Well look..

See it’s really simple. It essentially runs in 3 parts. The updatedDistances simply takes the current pair into consideration and checks if we have already computed the distance. If we have we return the current Map as is, if we have not we update the Map we a newly computed distance. RelevantDistance is the information for the current cluster pair, computed previously or currently. We then have updatedPairs that compares the distance for this pair to the closest we currently know of. The the closest pair wins!.

Well That Was Easy

Yeah right? And I love how functional programming makes it all easy to see logical. So now that we have most of the ingredients, we can really construct our clustering algorithm. so lets start..

Baaammm there you go!

Your Doing This A Lot Now, Slow Down…

Well you were the one so eager before. Let me soften the blow. On line :20 to 22: we generate an initial distance that we pass into out generateClusterCalculation function. The calculationsFutures holds all our computations which we collect in line :26:. Await.result takes all the computations and waits for them ALL to finish. After which we pass all the results through updateMostSimilar. This function simply merges all the distances between pairs, and the pair with the smallest distance. In line :27: we do just that. Now that we have what we want we can start deciding what to do wth the algorithm from line :30 to 45:. We merge the clusters we that are closest, update the newly merged cluster with it’s parents, and remove it’s parents from the general list.

The next step is to simply decide to do it again or not. Our goal is to have one cluster that is related to every other. If we only have one cluster we return it, if not we continue.

I Noticed Something.

The Pearson Correlation score? It’s our distance metric we’ve used in all over the place to talk express the concept of distance. Essentially the Pearson Correlation looks through the vector data and gives a score between 1 and -1 relating high correlation at 1 to high negative correlation at -1. The Pearson Correlation score makes it more about hough strongly correlated something is from 0 to 1.

Next we will talk a bit about how we send and receive our results out from a server. See you in Part 3!

£ha2

Written by

£ha2

software developer | Functional programmer | Consultant

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade