Neural Networks: It’s all about the embeddings

Simplifying hybrid and complex models by understanding feature embeddings

Faris Hijazi
10 min readJan 24, 2022
Neurons. (source: Pixabay)

Introduction

Neural networks are a type of AI. And understanding neural network embeddings makes it possible to build complex models from reinforcement agents to beat video games or creating deepfakes.

Many times neural networks are described as black boxes that are laborious to understand. This is since they automatically extract their own features through multiple layers. I’m here to tell you that it’s possible to develop an intuition about how they learn, and how to understand complex and hybrid models.

Who am I?

I’m a BSc. Computer Engineering graduate that later specialized in machine learning and computer vision. Currently working as a computer vision researcher.

And this article reflects my learning experience in deep learning with advanced/hybrid models.

What excites me in machine learning (particularly computer vision), is the ability to process and solve real-world scenarios, problems like face recognition, self-driving cars, etc. To contrast to regular hand-crafted features (classical methods) which work well only in constraint and narrow environments, requiring many engineering tricks like careful lighting and positioning of cameras and objects.

Example of classical image processing for bottle defects — taken from Parisa Tech

Who is this article for?

I’m writing this article for anyone with a basic background in machine learning or deep learning, and is interested in expanding their understanding beyond ImageNet classifiers.

Transfer learning, may be a topic that is taken for granted (and I don’t blame you), online courses go over them quickly and act as it’ll magically work every time, and it doesn’t help that it can be implemented with a few lines of code. Because in deep learning there’s a lot of cramming at first since most concepts are new to beginners.

I’ve read many transfer learning articles but I just keep coming across the same information:

use a pre-trained model (like Resnet or VGG16) to extract features learned from a similar dataset, and then retrain the last few layers (the classifier) for our specific use-case.

And I just accepted that and took it for granted and later realized that I don’t actually understand what I read, I didn’t question it enough (we all need to be more critical when taking in information).
I was missing details like, how close does the data need to be? why does this even work? why only remove the last few layers?

I concluded that all these questions could be answered by understanding one thing: what on earth are those features extracted? what do the numbers mean and what do they represent?

Understand the numbers, and the rest will come. These values are usually a feature embedding vector.

For the sake of simplicity, most examples will be focused on image classification problems, since it’s most familiar for beginners.

Article notes

To keep this article concise, I will try to link to other articles when possible, so that any needed extra reading can be done individually.

This article will focus on:

  • an intuition on what these extracted features/embeddings mean
  • some example applications

and will NOT focus on:

  • Code implementation
  • Re-explain transfer learning
  • How a network is trained

By the end of this article, I’m hoping complex models will seem a bit less like magic.

Recap Deep Neural Networks (DNNs)

Deep neural network forward pass — source

Fundamentally, Deep neural networks learn pattern recognition, we will be focusing on the example of CNN classifiers (Convolutional Neural Networks) . They are made to automatically extract features in an image.

Intermediate layers can be interpreted as a processed representation of raw data.

1. Dimensionality reduction

Given raw unstructured data (such as pixels from an image), it’s difficult for the network to directly make a decision (classifying dogs or cats).

Extracting an intermediate layer as a representation of the image significantly reduces the original data size, making it more usable and easier for deeper layers to finally decide on an output

A CNN will reduce the dimensionality, extracting only the important features, and then make a decision. After extracting the important features using the convolution layers, then the data is passed to 1 or more fully connected layers, these fully connected layers receive high-level features and it’s easier for them to make a decision for the final output.

These dimensions typically look like dimension 512 for high-level features, compared to the original input image at 224x224=50176 dimensions.

This is a key distinction between shallow learning (regular machine learning), and deep learning. We can use shallow learning for structured data/high level features like CSV files containing structured data, and deep learning for raw unstructured data like audio, video, images, text and more.

Once the data becomes high-level features, you can process it using shallow learning methods, like logistic regression or SVMs work better with a small representation of an image

You can think of the fully-connected layers (fc_3, fc_4, …) at the end of the network as logistic regression layers.

2. Projects to a simpler space

Simpler means that it’s easier to deal with for the problem, simpler could be: linearly-separable, linearly-correlated with the output, or simply containing high-level features that could be directly used to make a decision

Linearly separable just means that it’s easier/possible to draw a line.
A single neuron can draw one line, which is why we need many neurons in a DNN.

A: Linearly separable (can draw line), B: not linearly separable (can’t draw line)

If the starting features (input to the DNN) look like B, then a single neuron can’t separate that, but many neurons can draw something close to a circle if needed.

FYI: it turns out that it’s easier to just make the network deeper instead of having a shallow network with many neurons (as that could lead to overfitting).

What happens is that with a deep network, the input space (B) actually gets projected/transformed to another domain (A) that is easier to deal with (linearly separable). So the neural network is just making things simpler and simpler until a final decision can be made. As mentioned before, at the final layer you can even use simple machine learning methods like logistic regression or SVM.

Embeddings

Embedding/latent vectors/feature vectors are vectors representing some information about our domain/task, these vectors are outputs from the network or intermediate values in the network.

Information inside embeddings

What the embeddings contain really comes down to the training process and loss function. Training networks can have many objectives, be it to discriminate classes using a softmax function, separate classes apart using a contrastive learning metric, BERT-style training, or just having a GAN discriminator telling us if an image is real or fake. Each of these methods (even if they all had the same embedding shapes will have totally different meanings of the values in these embeddings.

Considering the class data is also important. For example, a face recognition network is made to extract discriminative facial features, meaning it shouldn’t be paying attention to facial hair or eyeglasses as we want it to identify the same person whether they shaved their mustache/beard or not.

Examples

This section will be explaining by example

If you’re more comfortable with the encoder-decoder paradigm, then just think of embeddings as the encoding

An encoder-decoder will effectively compress the data to the z latent vector (you can still call this an embedding).

Encoder-decoder

If you have any NLP experience, then you have probably heard of word-embeddings.

Word embedding space

Embeddings are a more meaningful way to represent a class than a one-hot vector. One-hot embeddings are a subset of embeddings. The idea for word embeddings is to have embedding vectors of words similar in meaning, to be very close, while being far from other words. Close and far, mean that a distance metric, this could be any distance metric such as L2 distance (Euclidean distance) or cosine distance. One property of using cosine distance is that it doesn’t depend on the magnitude of the embedding vector and instead depends on the direction.

An example of movie embedding in two-dimensional space — taken from here

The idea of an autoencoder is to have a domain specific encoder to compress the input to only the most important features. The proof that this works, is that the decoder can somehow recreate the input, which means — just as stated in information theory — the fixed-sized embedding vector (or latent vector) must have the capacity to hold the information needed to reconstruct the compressed input.

Key point: the numbers in the embedding vector aren’t always disentangled; each corresponds to something directly like the movie embeddings above (dimensions blockbuster and adult).
It doesn’t have to be human-readable, but it should contain the information, as long as the neural network layers after the embedding can make sense of it, then it should work.
What does it mean to make sense? Well, this depends on the use case. If the task is facial recognition, then the embeddings of similar faces should have a short distance apart, and visually different faces should have a large distance in the embedding space.

Face recognition objective using triplet loss
Extracting face embeddings

Using embeddings

Where the fun begins

I’ll be going over some examples of slightly more complex models and demonstrate the power of having different types of embeddings.

Multispeaker TextToSpeech toolbox

that extends a regular TextToSpeech but is conditioned on speaker embeddings to allow you to choose the voice. This works by having a SpeakerEncoder network, trained on extracting the style embeddings for many voices, forcing it to extract features that separate speakers, features like timbre and pitch, and ignore the content. While another encoder is in charge of only the content.
The power of having 2 encoders means the content and speaking voice are disentangled, allowing you to swap speaking voices for the same sentence.

TextToSpeech toolbox architecture. https://arxiv.org/pdf/1806.04558.pdf

Image translation (FUNIT paper)

This architecture aims to take the pose from one image and apply it to another image.
It works by using 2 encoders and 1 decoder, each encoder outputs embeddings, and then a decoder (blue) reconstructs the image from both of these embeddings back to pixels.

FUNIT architecture

A content encoder (pink) is in charge of the animal’s pose, it takes in the image on the left, and outputs embeddings containing the pose of the animal.
The class encoder (green) is in charge of the animal’s breed/style, it takes one or more images, and outputs the average embedding of the breed, you can even merge different animals if they look similar enough. And finally, the decoder takes both of these embeddings and turns them back into an image.
You can even try out the interactive demo here.

FUNIT example output

If you’re interested in GANs (like StyleGAN: the cool AI that generates realistic human faces and deepfakes) I highly advise reading this article: A-new-way-to-look-at-GANs

Bringing everything together

Now to test your understanding, think about this problem: what if I had many images and I wanted to cluster similar images together? Let’s say these images were 512x512=262144 pixels, comparing pixel-wise is neither effective, nor efficient.

See if you could come up with a method, take about a minute to think, then continue reading.

Photo by Elijah Hiett on Unsplash

Done thinking? Here’s my take:

One possible solution is to use a pre-trained model to extract features, those features would have a fixed-length feature vector. So for each image, we pass it through the feature extractor network, and out we have the fixed-length feature vector (for example 1024 values). We have reduced the image from 512x512=262144 values down to 1024 (for example). Now that we have these 1024 features, we can use simple shallow learning (SVM, linear regression, etc..) or some type of clustering (KMeans, mean shift, etc).
After all, the original deep learning model only had a fully connected layer after the feature extractor, which is basically a logistic regression classifier.

Doing so would result in an image similar to the below:

Images mapped for similarity using t-SNE, based on Google ML

Summary

  • The last few layers are just logistic regression and you can use SVMs too
  • The meaning of the embeddings depends on the training process and loss function
  • You can use these embeddings in creative ways other than just classification, this is especially useful in GANs

Resources

Code Resources

This was my first article, I hope you benefited, and feel free to leave feedback and constructive criticism. Have a good day.

--

--

Faris Hijazi

I love making scripts and automating workflow, interested in ML, CV and CS