Opening the Black Box of Machine Learning: let’s see what’s happening

Geke Pals
emptycart
Published in
6 min readAug 10, 2018

In my last post I discussed how machine learning can improve product curation in e-commerce. Instead of manually adding soft specs, I explained how a clever machine learning algorithm could take over the work and classify on the basis of a small example-set instead. However, one of the downsides of such an algorithm is that we don’t know exactly what it is doing. In this final blog, I discuss my efforts to open up this black box.

The Black Boxes of Machine Learning

The ‘black box’ is a popular metaphor for a system where you can only observe the given inputs and the outputs, while the system’s inner workings are unknown to you. The term is most commonly used in computer science, but it also has some mentions in mathematics and physics. So, generally you could say that a black box is most common in abstract systems that stand far away from human intuition.

Input → something happens of which I have no idea → Output

With the rise of machine learning, a rise in black box mentions also came about. There are two kinds of black boxes in machine learning (ML): a ML-specific black box for a ML-developer to understand how a ML-algorithm learns, and a more general computer science black box for the ignorant users of computer systems — and thus also ML-systems — to have an understanding of what a computer system does.

The Machine Learning-specific black box

There are two ways of programming an ML-algorithm: you can use existing libraries without extensive knowledge of how the inner algorithms work, or you can code an algorithm from scratch, for which you need some theoretical and mathematical knowledge of machine learning. For our case, let’s assume we have a very knowledgable ML-developer, who has studied the inner workings of different algorithms. Even for such a well-educated developer, there will still remain some parts of machine learning she will never understand: the black boxes.

Let’s revert back to our example of adding soft specs to a dataset of headphones (check out the previous blogs if you haven’t read them yet!). I presented the ML-algorithm with a set of 20 headphones that fall into the ‘run-and-piano’-headphones category, and a set of 20 headphones that fall into the ‘no run-and-piano’-headphones category. On this basis, the algorithm classifies every item in the set of over 10,000 headphones as belonging to either of these categories.

This categorization is based on the product’s properties: with statistical analyses, the algorithm can ‘recognize’ that run-and-piano-headphones tend to have the property of being lightweight, which is indeed convenient for running. The combination of all the properties that can be found in the positive set — and are explicity not found in the negative set — will allow the algorithm to classify the headphones. We can analyze these properties, and conclude that the algorithm has likely selected on these properties — but we don’t know this for sure! The algorithm won’t hand us a report after classifying, in which it explains that it found these properties to be relevant for this case. It will only hand us the results of the classification.

Broadly said, this means that we cannot zoom in on how an ML-algorithm learns and makes predictions. A classic example of a black box.

The general computer science-black box: the problem of understanding computer systems

People outside the IT-world usually don’t understand the exact workings of a computer system. That makes sense, since it is not an expertise that they need to master. However, it can be frustrating when it feels like information is held from you. Think, for example, of how certain ads are offered to you, or how Amazon decides what to recommend to you. Users can’t access the information that brought about a certain decision. This black box can quite simply be solved: just be transparant about what information is used. So, for example, Amazon can tell you:

“on the basis of you buying book x and book y, and the fact that a lot of people that bought these books also liked book z, I recommend you to look at book z as well.”

This makes the process insightful, even for users that are completely ignorant of how computer systems work. It is a sign of user friendliness and fairness to your user.

Resolving both black boxes

So, in short, we need to make machine learning algorithms insightful for both ML-developers as well as ML-ignorant users. For the soft specs-case this means showing on which basis (which properties) the algorithm learns. This is where, with some clever thinking, we can resolve these black boxes. The approach I take can be seen as reverted machine learning-statistics.

Fair warning: there will be (very elementary) maths.

After learning, the ML-algorithm will give me the classified set of headphones. Let’s say that 1,241 out of 10,000 have been classified as ‘run-and-piano’-headphones. This set of headphones will be called the cluster. The task is now to revert-reason what the algorithm has done. First, I calculate the proportion of the cluster in comparison to the whole set: (1,241/10,000)*100 = 12.41%. Next, for each property, I calculate the proportion of the property-in-cluster in comparison to the property-in-whole-set. For example, the property ‘bluetooth’ appears in the cluster 582 times, and in the whole set 4870 times. This gives a proportion of (582/4870)*100 = 11.95%. In comparison, the property ‘lightweight’ appears in the cluster 893 times, and in the whole set 1478 times, which gives a proportion of (893/1478)*100 = 60.42%.

The proportion of ‘bluetooth’ is quite alike the cluster-proportion (11.95% vs 12.41%), which makes the amount of products with property ‘bluetooth’ in the cluster as expected and therefore not remarkable. However, the proportion of ‘lightweight’ is much higher than the cluster proportion (60.42% vs. 12.41%). This makes the amount of products with property ‘lightweight’ in the cluster much higher than expected, and therefore very remarkable. We can say that the property ‘lightweight’ is above-average present in the cluster, and thus important for this cluster.

The ideal case would be to find a property with a proportion of 1. This means that all instances of that property have also been classified in that cluster, i.e. the property completely matches the cluster. The soft-spec describing the cluster can then be linked to that property. For example, if I want to classify the soft-spec case of ‘headphones suitable for making phone calls’, the property ‘bluetooth’ will likely have a high proportion, and thus be very relevant to the cluster. You can conclude that ‘headphones to use for calling’ will always have the property ‘bluetooth’.

A simple algorithm would calculate the proportions of all properties, and return the top 10 properties that are most relevant to the given cluster. This serves a double function:

  1. It is insightful for the ML-developer to verify whether the most relevant properties make sense. If they do, the algorithm has likely classified most products correctly. If, however, the algorithm returns the property ‘heavyweight’ as most relevant for running-headphones, something has likely gone wrong.
  2. It is educational for a user to see what a certain soft-spec actually means: if the property ‘lightweight’ is most relevant for running-headphones, the user knows that when looking for a running-headphone, he actually is looking for lightweight headphones (in combination with the other relevant properties). This enables the user to learn about headphone properties in a low-threshold way.

This simple math-algorithm will not completely resolve the machine learning black box, but it will at least put the box ajar and give some insight in what is happening in the machine’s brain.

This was the last part of my three-part series about machine learning and product curation. Did you read all of them?

  1. Introduction: The Problem of Product Curation
  2. E-Commerce & Machine Learning
  3. Opening the Black Box of Machine Learning: let’s see what’s happening (you are here)

If you enjoyed this article, please hit those little 👏 below (you can “clap” multiple times!). Want to know more about our curation system? Find us at meetfeli.com.

--

--