Representations: The Most Important Thought Framework in Machine Learning

Published in

Codable

8 min readNov 27, 2021

Thinking with representations and thinking about how the information is processed is a very useful and productive way to both understand and build machine learning models, even coming up with new ones. And I believe it is not used intentionally as frequently as it should be.

OK. “Most important” is really a very subjective term and the most important thing would depend on what you are doing. But how you can approach and think about problems and potential solutions is not as widely talked about directly as the technical stuff. And I believe it is of great importance.

What is this article about?

I am going to try to convince you that what we do in all machine learning is transforming the information from one representational space to another. I would even go further to say that it’s even how brains and general intelligence work.

Then I will try to use this as a thought framework and try to re-evaluate some known concepts in the light of this to see if they make better sense.

Representation + Evaluation + Optimization = (Machine) Learning

It is not a new but a very useful concept to look at machine learning as a collection of representation, evaluation, and optimization.

With a very brief overview:

Representation: How you (and your model) see the data. Basically, the mathematical space information resides in. (Example: encoding)
Evaluation: How you measure how good you are doing (Example: loss functions)
Optimization: Your strategy to search for better solutions. (Example: gradient descent)

For more on this. You can read this article

Machine Learning = Representation + Evaluation + Optimization

tl; dr: You can think of machine learning algorithms as the combination of landscape, preference, and strategy

medium.com

or you can also read the referenced paper (which is a good read as well):

Understanding Representations

In this article, we will focus more on representations as subjectively the most important part of the trio.

Back in the days before deep learning was mainstream, feature engineering was the main factor for successful machine learning models (it still is sometimes). The reason behind this is that no matter the data you are working with, if features are more processable and important aspects of it are more accessible, then your model would perform better.

Today we get very successful deep learning models. And essentially what deep in deep learning means is that we introduce more layers over each other. Each of the layers and modules in the models tries to get better representations for the solution of our problem. It is what we call Representation Learning (maybe the most important resource in this topic).

In the end, what learning comes to is this:

Learning is having better representations of the data for the solution of your problems.

There is the theory that says learning is compressing the data. But I think transforming it to a more usable space is more important than just decreasing the volume. But compressing it by taking the important parts also helps.

Representations in real life

I will give a few examples of how representing similar information in different representations makes a difference for not only machines but also humans as well. I’m sure at some point in your life you thought how you say something is as important as what you are saying. This is a similar situation.

If we were to add two numbers written in roman numerals:

CCVII +MIX = ?

Probably how your brains would approach the problem is, first to “decode” the numbers to a format (representation) your brain can make sense of. That would probably be the decimal system because we are so used to working with it.

207 + 1009 seems to be an easier way to look at it since we already learned ways to add numbers in the decimal system. Plus you can do the addition with fewer rules and exceptions.

Even your personal perspectives and ways of looking at things can be related with how you represent things in the brain. An art major or an average Joe would see different things while looking at the same painting. This might differentiate how we process that information.

Representations in computers

Let's talk about data structures. Let’s say you have a collection of numbers in a sequence.

If you put the numbers in an array it would be easier to access the 42nd number. This is a good way to represent the collection if you want to do random access. But if you want to occasionally remove some of the elements from the middle of the sequence, keeping the order using linked lists could possibly be a better solution.

So it all comes to what information is present in which ways. This can make the solution of the problem way easier or harder.

Machine Learning Examples of Representational Thinking

In this section, we will look at some known methods with a focus on how the information is represented and how new representations are calculated.

Kernels

Kernels can basically be thought as transformations to the representational space.

The reason we used to use them so much with SVM’s is that SVM’s try to separate the data linearly. So if the data is represented in a space so that it's not possible to separate classes linearly, then we should try to represent them in another space. In this case, space transformation is fairly similar and adds another dimension so that it helps to separate them with a plane.

In most cases what deep learning does is similar. With each layer, we want to transform the space into a more useful one that we work with.

Autoencoders

Now let's try to see the intuition behind autoencoders.

We saw that we can learn better representations consecutively if we have a definition of a problem and related data. But what if we don’t have enough labeled data to learn those representations? Can’t we still learn a better representation for that information in a way that it will hopefully be useful for other tasks? The answer is obviously yes and it’s the intuition behind autoencoders.

What we do is to try to transform the information in a way that we can still reconstruct nearly the same data.

General architecture of Autoencoders. Source

In this case, we force it to a smaller and smaller representational space then try to recover the information from that. What happens usually is that to recreate the information correctly it is forced to squeeze and summarize the data in a meaningful way. Then we can use the compressed meaningful representation in our tasks. Because it is a better representation overall it generally improves the performance.

Word Embeddings

Word embeddings work in similar ways with autoencoders. But this time instead of creating a representation to recreate the data, we try to build a representation to learn a word’s relation with its neighbours. Two popular ways to do it are Continuous Bag of words and Skip-gram. But we will not dive into the details in this article.

An alternative to using word embeddings is using the bag-of-words representation which would represent a document by the count of each word in the document. Although this would be a really large and very sparse vector and it is not very information-dense. Word embeddings are a lot smaller and dense compared to those and can represent semantic relations for words.

Therefore we can say that word embeddings work because they provide a smaller and much better representational space. Now we work in a space where we can process things relating to meanings of words rather than words themselves.

CNNs

What do CNNs do in terms of representations?

Why do CNNs work? We might as well create a fully connected network that we connect every pixel of the image to every node in a network and build a multi-layered network. Theoretically, it has the capacity to do same processing. But in practice, they don’t work nearly as well. Why?

One obvious reason is we don’t have to learn that many parameters using CNNs. It’s a much efficient way of using our data. Another way to look at it is again how we consecutively represent the image in each layer. At each layer, we learn a new representation of the image. At first layers, representations are more interpretable. It learns some sorts of Gabor Filter’s borders and pattern changes in images. Using it, it learns to represent more complex high-level patterns in the image. For example, if we are detecting human faces it might learn to represent if an eye is present in some location.

At the end of layers, we have a much better representational space to look at the image. Much better than just looking at the intensity of the pixels and this makes most computer vision applications possible.

Transfer Learning

It is very easy to understand why transfer learning works in that perspective.

Let’s say you have trained a model for general object detection using ImageNet or COCO datasets. What you want to do is to detect another object that is not present in those datasets.

You can use the same architecture with the data you have. But the data you have will almost always be much less than those big datasets. And even if you are detecting different objects, some visual properties are shared by many things in the world.

So you take a pre-trained model trained in those datasets. You are not interested in the predictions so you cut the head. What you have is a model that can give you a smaller useful representation of that image. Starting from this representational space is much more useful because there are already lots of learned properties of the image that we can work on. We would need less effort to warp the space to achieve a better representation for our problem.

Conclusion

It’s not an easy task to understand how certain machine learning models work. It is even harder to decide which one would be a better solution than others. Since you do not have the time or resources to try them all most of the time, you need to make informed decisions.

This way of thinking about learning and machine learning helped me in that regard.

Asking “What would be a better space to represent this?” or “What would be a good way to learn better representations?” might help you on your machine learning journey.