Machine Learning²: Hierarchical Clustering

I promised myself I would blog about my journey with machine learning. Machine learning learning ( hence the square). A year later, here I am writing this sentence. So without further ado lets get started!

The over arching goal is to implement an algorithm that works as quickly as possible by taking advantage of all resources available in a modern computer. Once we have implemented the algorithm, we render the results visually to a browser. I choose the browser for its familiarity and ubiquity. However, I also believe that simple data representation is great in the browser. If you can make a game in the browser you can render some data.


We will build using the Scala language. If you are not familiar with the Scala I will give you a small introduction. Scala in a language constructed by Martin Ordensky, a German computer science professor ( among other things) at École Polytechnique Fédérale de Lausanne in Germany ( I can’t say it either). It allows you to build programs both in Object Oriented style and Functional programming style (recursive methods or functions). In the lesson I will be be building this with a functional programming style. State is the enemy in programming and among many reasons it would be interesting to build with a functional programming style. Lets see if you like it!

Lets Get Started…

Oh yes I almost forgot! What is a Hierarchical Clustering Algorithm? Well image you have 5 different items in space and you want to determine how close there are together one at a time.

If you start of with these object in different position,

we simply look for those that are closest together initially, relative to the rest of the object.

We then consider the object one cluster, and look for the object that is close to that new cluster. The merged cluster in circled in grey and merge into a cluster circled in red.

We repeat the process on and on until we have a single cluster.

What make this really interesting is the notion of space and distance which are abstracted to make clustering something we can us to determine similarities between lots of different things. There is lots to say about this and whole mathematical fields (topology and metric theory) which really make the meat of this but I this blog is about implementation.

Lets Get Started Already!!

Ok ok fine! So lets thing about what we will need to make all this possible on a high level of abstraction. We need to

  1. Ingest the data from a file ( a server end point would be fine too but in this example a file is most suitable.)
  2. Structure it so it can be read by algorithms meaningfully.
  3. Create the algorithm with concurrency in mind.

To ingest the data lets make a simple class called DataImporter.scala. When we import the data we store the result in an object that acts like an in memory table for our data.

It is in this “container” that we store what we import from our table.

importFile is a function that import the the file that we want and checks that it exists.

On line :9: we then get all the lines of the data file and verify that each line is as long as the header to make sure no data is missing. We then validate each cell in line :14 to 20: returning an Option[Double] in case we find an invalid cell. Now all we have to do is check that our Vector[Option[Double]] always has something in it.

One reason we do this is since the description of the object has missing pieces, we don’t want to pretend we know what it looks like. Think of it as face with a nose or eyes missing. Any missing part of a face and it could be an entirely different person than who you thought.

We then do something a little bit crazy. We want to store our results that were both successful, and failure. We can use this sort of information for analytics and such. The foldLeft creates a 2-Tuple of Vector[Vector[Option[Double]]. These vector of vectors in the tuple store a set of successes and a set of failures. In this case we use this information to assert a certain percentage of clean data of 10% by default. As we compute :25 to 28: updates the failures while :17 to 20: updates the success. We use the values to check for failure rates within :40 to 45: and pass and based on a percentage that is lower.

Do You Have Any Intention Of Getting to the Algorithm…

Yes I do please wait for the next installment called Machine Learning²: Hierarchical Clustering Part 2. :-D It will be all about the nitty gritty of clustering algorithms. Trust me you needed a bit of an intro. You’ll appreciate it when you get older.

This series will be in 5 parts.

last edited date: 2017 January 20th.