Improving Siamese Network Performance

Disclaimer: I am by no means an expert on Machine Learning nor Deep Learning. All of things I stated below, purely based on my experiments.

Let’s start with deciphering the title. Improving, Siamese Network, and Performance. First, what is Siamese Network?

fig 1. Siamese Network Architecture

For someone who does not know about what is the diagram above, basically a Siamese Network is a Neural Network Architecture which compares two input images, and decides, whether those two images are same or not. The definition of ‘same’ may varies. It is usually used for face recognition.

So, what is wrong with Siamese Network?

Let’s take an example, a face recognition system. Maybe, your company has 100 employees. A bigger company maybe has 500 employees, or maybe 1000 employees. And much bigger company maybe has tens of thousand employees. As AI Team, your employer wants you to create a face recognition system to replace your usual fingerprint system and remove name tags to get inside your company. Let’s say, you have succeeded create a state of the art face recognition system, which can recognize with 99.9 % accuracy.

After deploying to production use (used daily by the employees), you calculated that the system has a 0.005 second (5 milliseconds) on comparing a pair of images (which is pretty good). But your company has 1000 employees, means you need a whole 5 seconds to make a decision whether the employee is really one of the employee on your company, so that, he/she will be given access or not. In a company of that size, maybe you have 5 lanes of entry. So, on average you will need around 1000 / 5 x 5 = 1000 seconds (around 17 minutes) of full lanes to make sure all of the employee crossed the entry lane.

Okay, now stop that hyperbolic example. But, I hope you have grasped the problem. Evaluating Siamese Network on your whole data set is slow.

How to make it faster?

So, I made some experiments on this network.

fig 2. Siamese Network Initial Architecture

Credits to for introducing me on the network I draw above. The dataset used on this example (and other following examples are from INRIA Holidays Dataset. This idea is including vectorized the images first to make it faster to train. The vectors are derived from forward propagating ResNet50 with some modifications on the network. With 0.5 dropout on each FC layer, I can get around 97% validation accuracy after 20 epochs.

fig 3. Example Query for The Image Recognition

Let’s evaluate the image by taking a random image from the dataset and make the network to find the most similar image. (It should return the same image)

fig 4. Looping Implementation

The dataset has 1491 images, and it seems the network needs more than 2 seconds to find the most similar image from datase (Much faster than the hyperbolic example, this is because of the images are already vectorized).

Someone may scream, why didn’t you use batch? Well, let’s take a look

fig 5. Batched Implementation

Yeah! This is it, we can get a 10x improvements by just changing the implementation to batches. But, is this really the best performance? I hope it doesn’t, and I really don’t think so. Sub 300 ms is still bad for me. There should be another thing to make it less than 100 ms.

Convolutional Siamese Network

After doing some research, and doing some courses on Coursera (big thanks to Andrew Ng for his Deep Learning class), I stumbled upon this piece of technology advice which I think will be the solution for this problem.

If doing evaluation using for loop is slow, then why don’t we do it convolutionally? It will look like this

fig 6. Convolutional Siamese Network

I am using cubes for convolution example since I believe most of you who know about convolutions will think about cubes (Although in reality, I am using 1D Convolutions over 2D planes instead of 2D Convolutions over 3D planes)

Let’s say you have an image vector with size (1, 3840) *see fig 3, what I do is combining all 1491 vectors into one vector of size (1491, 3840). That will be our ‘Image 2 Vectorized’ from the fig 6. The ‘Image 1 Vectorized’ we know is (1, 3840), so we can easily tiled it (maybe using numpy.tile) into (1491, 3840). Then we can merge it along the second axis, and do C1, C2, and the sigmoid. Well, wait. What is C1?

It is a convolutional layer with kernel size of 1 and stride of 1. The number of kernels can follow the size we set on the previous Siamese Network experiments. C2 and the sigmoid is using the same convolutional layer.

With this architecture, I can just input all 2 x (1491, 3840) outputting 1491 probabilities for each 1491 pairings and get pretty impressive result!

fig 7. Single Evaluation on Conv. Siamese

I finally get below 100 ms for this problem! Can I improve it? Of course I can. Let’s make batches(assuming evaluating whole 1491 images is hard for the network).

fig 8. Batches!

Wow! nearly 50 ms. So, it is really possible to do it sub 100 ms. But, how is the accuracy?

fig 9. Non Conv. Siamese Network
fig 10. Conv. Siamese Network

Actually the accuracy is not that different. It is hovering in 96% — 98 % (And from the figures above, you can tell that the convolutional one is more stable)

YOCO (You Only Compare Once)

The idea is coming from YOLO (You Only Look Once) object detection algorithm, which can detect many objects in one forward propagation step.

YOCO can compare many images (even thousand pairings) in one forward propagation. With this algorithm, you can reduce execution time to make your face recognition works even faster. And we hope, you will not need 17 full minutes for all of your employees to cross over the entry lanes.

That’s it the idea about how to improve your already working Siamese Network. Decreasing from 240 ms to 60 ms (around 4x) seems not that significant. But, in a production level system, doing so can determine a good or bad response from users.