Real World ML:Siamese Networks

6 min readSep 15, 2020

Siamese Networks belongs to particular class of learning called one shot learning.

One Shot/ Few Shot Learning

One shot learning is the technique of learning representations from a single sample. I will generalize One Shot learning as few Shot learning which refers to the fact that we are dealing with few labelled samples rather than a single labelled sample.Tutorials in One/Few Shot Learning Networks have typically concentrated on using them for facial recognition systems (as part of computer vision) but there is a huge class of Structured Data learning Problems where One/Few Shot Learning is more useful.

Lets come to the question of you would need one/few shot Learning. The most useful application of one shot learning is in the real world example of Fraud Detection. In fact any case where there are very few labelled data sets or when the datasets are highly imbalanced with very few examples leads to One shot Learning as the natural solution.

Siamese Networks and Triplet Loss.

Let me explain it with a small graphic.

Siamese Networks for Image Classification

Here the CNN with Shared Weights takes 3 images (From Top to bottom. Image 1: Homer , Image 2: Bart Face , Image 3: Bart whole body)as input. It computes a triplet loss which is essentially a way of indicating that image 2 and 3 are of the same person(Bart Simpson) and should be closer together whereas Images 1 and 2 are of different people(Homer and Bart Simpson) and should be farther apart.

This is what it looks like mathematically.

WARNING: Math ahead. Let us ignore the margin in the above formula.

Lets call Image 2 the anchor image (denoted by a in above formula) ,Image 3(positive Image denoted by p in above formula) and Image 1(negative Image denoted by n in above formula).

Let us go through this with an example:

a and p will be both images of Bart Simpson. So ideally their distance denoted by d(a,p) should be as close to zero. assume to be ~0 here.

a and n will be images of Bart and Homer Simpson. So ideally their distance denoted by d(a,n) would be very high. Let us say d(a,n) would be very high. for example 0.8

so d(a,p)-d(a,n) would be 0–0.8 which comes to be negative (-0.8).

max(d(a,p)-d(a,n)) would be max(-0.8,0) which would be zero.

So loss is very low which is correct as both images of Bart are the same. Hence we can say learning is proceeding in the right direction

So this neural network when trained on all such image pairs has learned the weights which allow it to group images of the same person together.

What are the possibilities here. Think of Face Detection for identification or for a real world situation, in the below example the authors used a siamese network to classify whales without enough labeling examples.

https://towardsdatascience.com/a-gold-winning-solution-review-of-kaggle-humpback-whale-identification-challenge-53b0e3ba1e84

Siamese Networks and Structured Data

Let us come back to our original problem Where does this fit in with Structured data

Another digression: embeddings

Lets come to the definition of embeddings.

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

Lets look at the diagram below:

The above diagram shows an embedding visualization where you see kitchen items like refrigerators, microwave,oven are closer together and colored the same. (the yellow dots are closer). On the other hand garden, hose and sprinkler (items related to gardening) are closer together but farther away from the kitchen items.

Thats fine, but how does that help.

Imagine if we could visualize an embedding of the credit card fraud dataset where we could group the frauds differently from the other valid card transactions. If we could produce such an embedding would it not be easy to separate the frauds and non frauds visually or through the simplest of classifiers.

Introducing Ivis

Reference: https://bering-ivis.readthedocs.io/en/latest/metric_learning.html

ivis is a package which runs siamese networks on structured data. Let us see an example of ivis on the credit card data set for fraud detection. Fraud detection in the real world is normally unbalanced with very few instances of actual frauds.

Below is a superficial explanation of the code. I have added the line numbers in brackets to refer to the line number in the code below.

We will come to the nitty gritties of the code below, but for now note that this essentially builds an embedding of the data in 2 dimensions(2). Any learning system which learns an embedding generally requires scaling the inputs before hand (1). This is required because different features might be at different scales. This package essentially learns the embedding of the data in a low dimensional space as indicated above. Note higher dimensions learnt might require more advanced classifiers rather than a simple logistic regression classifier.

Hyperparameter explanations:

(2)also means that It will check for 5 epochs for progress. Also supervision_weight refers to the weight attached to labeled examples versus unlabeled ones. the higher the value of supervision_weight , the more is the value attached to labeled examples. Use a high value when you want the classes to be separated more cleanly. These are hyper parameters and need to be searched over the hyper parameter space.

(1)minmax_scaler = MinMaxScaler().fit(train_X)
train_X = minmax_scaler.transform(train_X)
test_X = minmax_scaler.transform(test_X)(2)ivis = Ivis(embedding_dims=2, model='maaten',
            k=15, n_epochs_without_progress=5,supervision_weight=0.8
            classification_weight=0.80,
            verbose=0)
ivis.fit(train_X, train_Y.values)train_embeddings = ivis.transform(train_X)
test_embeddings = ivis.transform(test_X)

Let us plot the embeddings learnt

With anomalies being shown in red, we can see that ivis:

Effectively learnt embeddings in an unbalanced dataset.
Successfully extrapolated learnt metrics to a testing subset.

Linear Classifier

Once you have such an embedding learnt , all it takes is a simple classifier to separate out the classes.

We can train a simple linear classifier to assess how well ivis learned the class representations.

clf = LogisticRegression(solver="lbfgs").fit(train_embeddings, train_Y)labels = clf.predict(test_embeddings)
proba = clf.predict_proba(test_embeddings)print(classification_report(test_Y, labels))print('Confusion Matrix')
print(confusion_matrix(test_Y, labels))
print('Average Precision: '+str(average_precision_score(test_Y, proba[:, 1])))
print('ROC AUC: '+str(roc_auc_score(test_Y, labels)))precision    recall  f1-score   support           0       1.00      1.00      1.00    270100
           1       1.00      0.99      1.00       467    accuracy                           1.00    270567
   macro avg       1.00      1.00      1.00    270567
weighted avg       1.00      1.00      1.00    270567Confusion Matrix
[[270100      0]
[     3    464]]
Average Precision: 0.9978643591710002
ROC AUC: 0.9967880085653105

How does it work internally?

Lets look at the architecture below.

Here we see that its the same concept we feed in 3 examples, anchor, positive and negative and expect it to learn embeddings while at the same time minimizing the triplet loss.

(Note above is an example of a szubert architecture. The maaten architecture used in our example is similar and has 500–500–2000 instead of 128–128–128 above)

So whats next:

This is only scratching the surface of real world ML. We will look at other ways to learn from imbalanced datasets in the next parts.