An Introduction to Semantic Matching Techniques in NLP and Computer Vision

Published in

Georgian Impact Blog

12 min readSep 8, 2021

Are these the same person? As humans, we can see that they are the same person despite differences in facial hair. But how can a machine learning system come to the same conclusion? With semantic matching!

Semantic matching is a technique to determine whether two or more elements have similar meaning.

While the example above is about images, semantic matching is not restricted to the visual modality. It is a versatile technique and can work for representations of graphs, text data etc. In addition, it is a core component of semantic search. Whenever you use a search engine, the results depend on whether the query semantically matches with documents in the search engine’s database.

In this blog post, we give a brief introduction to semantic matching and review how it has evolved in two of the dominant sub-fields of AI: natural language processing (NLP) and computer vision (CV).

Semantic Matching in Natural Language Processing

In NLP, semantic matching techniques aim to compare two sentences to determine if they have similar meaning. Note: A sentence can be a phrase, a paragraph or any distinct chunk of text. This is especially important in search. Let’s bring this to life with an example.

Let’s say we are developing software that leverages NLP techniques to improve our lead qualification process. We want to help our sales team have a more efficient and effective cold outreach process. To do so, we can use semantic matching to find commonalities in target companies’ culture, team and product based on available text sources. Then, our sales team can use these common points for selling!

For instance, say we have a short description about our company as:

Provider of an AI-powered tool designed for extracting information from resumes to improve the hiring process. Our tool leverages novel techniques in natural language processing to help you find your perfect hire.

We can look for relevant materials in our target companies such as blog posts or homepage text that is semantically similar to our company description.

This example also shows the typical workflow of semantic search. We have a query (our company text) and we want to search through a series of documents (all text about our target company) for the best match. Semantic matching is a core component of this search process as it finds the query, document pairs that are most similar. The same technology can also be applied to both information search and content recommendation.

The field of NLP has recently been revolutionized by large pre-trained language models (PLM) such as BERT, RoBERTa, GPT-3, BART and others. These new models have superior performance compared to previous state-of-the-art models across a wide range of NLP tasks. Our focus in the rest of this section will be on semantic matching with PLMs.

Semantic Matching with Transformer-based PLMs

Main Ideas: Bi-Encoder and Cross-Encoder

Methods that aim to find semantically similar text typically fall under three categories: Bi-Encoders and Cross-Encoders, or a mix of the two.

With the PLM as a core building block, Bi-Encoders pass the two sentences separately to the PLM and encode each as a vector. The final similarity or dissimilarity score is calculated with the two vectors using a metric such as cosine-similarity.

Cross-Encoders, on the other hand, simultaneously take the two sentences as a direct input to the PLM and output a value between 0 and 1 indicating the similarity score of the input pair.

Typically, Bi-Encoders are faster since we can save the embeddings and employ Nearest Neighbor search for similar texts. Cross-encoders, on the other hand, may learn to fit the task better as they allow fine-grained cross-sentence attention inside the PLM.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT is the core method behind Bi-Encoders. Building on the success of BERT, this paper finds an effective embedding method for sentences. It follows the idea that a good sentence embedding would mean similar sentences are close in vector-space. The main contribution is applying the triplet loss function, often used in the vision domain, to sentence embeddings. Given an anchor sentence, a positive (similar) sentence, and a negative (dissimilar) sentence, we want to minimize the distance between the anchor and positive sentence while maximizing the distance between the anchor and negative sentence. This is achieved from the below equation:

This loss function combined in a siamese network also forms the basis of Bi-Encoders and allows the architecture to learn semantically relevant sentence embeddings that can be effectively compared using a metric like cosine similarity.

Sentence-BERT was evaluated on the STS (Semantic-Similarity-Test) Benchmark.

Code

The team behind this paper went on to build the popular Sentence-Transformers library. Using the ideas of this paper, the library is a lightweight wrapper on top of HuggingFace Transformers that provides sentence encoding and semantic matching functionalities. Therefore, you can plug your own Transformer models from HuggingFace’s model hub.

Sentence-Transformers also provides its own pre-trained Bi-Encoders and Cross-Encoders for semantic matching on datasets such as MSMARCO Passage Ranking and Quora Duplicate Questions. Understanding the pre-training dataset your model was trained on, including details such as the data sources it was taken from and the domain of the text will be key to having an effective model for your downstream application.

Poly-Encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

Poly-Encoders aim to get the best of both worlds by combining the speed of Bi-Encoders with the performance of Cross-Encoders. The paper addresses the problem of searching through a large set of documents. Thus, all the documents are still encoded with a PLM, each as a single vector (like Bi-Encoders). When a query comes in and matches with a document, Poly-Encoders propose an attention mechanism between token vectors in the query and our document vector.

In the paper, the query is called the context and the documents are called the candidates.

Given a query of N token vectors, we learn m global context vectors (essentially attention heads) via self-attention on the query tokens. This gives us m context vectors.

Next, the document vector attends to these m context vectors. To follow attention definitions, the document vector is the query and the m context vectors are the keys and values.

The authors of the paper evaluated Poly-Encoders on chatbot systems (where the query is the history or context of the chat and documents are a set of thousands of responses) as well as information retrieval datasets. In every use case that the authors evaluate, the Poly-Encoders perform much faster than the Cross-Encoders, and are more accurate than the Bi-Encoders, while setting the SOTA on four of their chosen tasks.

The paper uses BERT as the PLM and there is an unofficial implementation on Github.

Key Limitation of Transformer-based PLMs

With all PLMs that leverage Transformers, the size of the input is limited by the number of tokens the Transformer model can take as input (often denoted as max sequence length). For example, BERT has a maximum sequence length of 512 and GPT-3’s max sequence length is 2,048. If the text exceeds the maximum sequence length it will be cut off. We can, however, address this limitation by introducing text summarization as a preprocessing step. Other alternatives can include breaking the document into smaller parts, and coming up with a composite score using mean or max pooling techniques.

GPT-3 Semantic Search

Although they did not explicitly mention semantic search in their original GPT-3 paper, OpenAI did release a GPT-3 semantic search REST API . While the specific details of the implementation are unknown, we assume it is something akin to the ideas mentioned so far, likely with the Bi-Encoder or Cross-Encoder paradigm.

Semantic Matching in Computer Vision

Computer Vision (CV) has taken great leaps in recent years. From self-checkout stores to self-driving cars, CV is revolutionizing several industries. Check out this blog to learn about the state of Computer Vision in 2021!

To give you a sense of semantic matching in CV, we’ll summarize four papers that propose different techniques, starting with the popular SIFT algorithm and moving on to more recent deep learning (DL)-inspired semantic matching techniques.

Distinctive Image Features from Scale-Invariant Keypoints [3] (SIFT)

Scale-Invariant Feature Transform (SIFT) is one of the most popular algorithms in traditional CV. It explores the idea of semantic matching on images. Given an image, SIFT extracts distinctive features that are invariant to distortions such as scaling, shearing and rotation. Additionally, the extracted features are robust to the addition of noise and changes in 3D viewpoints.

SIFT offers three advantages:

Locality: Extracted features are local, making them robust to occlusion and clutter.
Distinctiveness: Individual features can be matched against a large database of objects.
Quantity: Several features can be guaranteed, even for small objects.

Figure 1 (https://courses.cs.washington.edu/courses/cse455/10au/notes/SIFT.pdf ): Extracted SIFT features for a given pair of images.

Under the hood, SIFT applies a series of steps to extract features, or keypoints. These keypoints are chosen such that they are present across a pair of images (Figure 1). It can be seen that the chosen keypoints are detected irrespective of their orientation and scale. SIFT applies Gaussian operations to estimate these keypoints, also known as critical points. To achieve rotational invariance, direction gradients are computed for each keypoint. To learn more about the intricacies of SIFT, please take a look at this video.

Figure 2: SIFT associates similar keypoints between different viewpoints of the same building.

Once keypoints are estimated for a pair of images, they can be used for various tasks such as object matching. To accomplish this task, SIFT uses the Nearest Neighbours (NN) algorithm to identify keypoints across both images that are similar to each other. For instance, Figure 2 shows two images of the same building clicked from different viewpoints. The green dots show the extracted keypoints in the two images. The lines connect the corresponding keypoints in the two images via the NN algorithm. More precisely, a keypoint on the left image is matched to a keypoint on the right image corresponding to the lowest NN distance. If the connected keypoints are right, then the line is colored as green, otherwise it’s colored red. Owing to rotational and 3D view invariance, SIFT is able to semantically relate similar regions of the two images. However, despite its invariance properties, it is susceptible to lighting changes and blurring. This can cause keypoints to be falsely matched with each other. Furthermore, SIFT performs several operations on every pixel in the image, making it computationally expensive. As a result, it is often difficult to deploy it for real-time applications.

SIFT is available in the OpenCV library. Note: SIFT is patent-protected so please check if the patent is enforceable in your country before using it for commercial purposes.

Siamese Neural Networks for One-shot Image Recognition [4]

Proposed in 2015, SiameseNets is the first architecture that uses DL-inspired Convolutional Neural Networks (CNNs) to score pairs of images based on semantic similarity. Siamese Networks contain identical sub-networks such that the parameters are shared between them. Unlike traditional classification networks, siamese nets do not learn to predict class labels. Instead, they learn an embedding space where two semantically similar images will lie closer to each other. On the other hand, two dissimilar images should lie far apart in the embedding space.

Figure 3: SiameseNet architecture overview.

For an image pair <I₁, I₂> (similar to <X₁, X₂> in Figure 3), the CNN extracts features from each image. Next, the features are fed to a multi-layer perceptron to obtain <F(I₁), F(I₂)> and the L1 distance between the two features are calculated. Finally, the L1 distance feature space is connected to a one-node final layer that computes the similarity between the two images. The sigmoid function is used which outputs a score in the interval [0, 1], where 1 resembles maximum similarity between the two images, and 0 represents minimum similarity. To demonstrate the effectiveness of the learned feature space, the authors test the trained network at one-shot learning. More precisely, for a given test image It and a set of images sampled from C classes, the task is to predict the correct class for It that leads to maximum similarity. On this task, SiameseNet achieved performance comparable to the state-of-the-art (SOTA) method.

The implementation of SiameseNets is available on Github.

Multi-Image Semantic Matching by Mining Consistent Features [5]

As mentioned earlier, methods like SIFT and [6] have their shortcomings. One drawback of these methods is that they can produce several false matches. To address this issue, this paper proposes a technique for finding the most consistent and repeatable features across multiple images. Here, repeatable means features that are universally present for a particular object class. For instance, whiskers are a repeatable feature of the class cat since they appear consistently across all cats. Furthermore, to find the most repeatable features across all instances of an object class, the proposed method can explore large scale datasets!

Conventional methods use graph matching algorithms to solve the optimal associations between a pair of image features (output of CNNs) [7]. The authors build on this and further introduce the notion of cycle-consistency to match pairs of images. The association between all pairs of images is cyclically consistent if the following equation holds for all image triplets. <I ᵢ , Iⱼ, I ₓ> : P ᵢⱼ= P ᵢ ₓ Pₓⱼ. Here, P ᵢⱼ is a permutation matrix that computes pairwise feature associations between images <I ᵢ , Iⱼ>, calculated by graph matching algorithms [8]. To generalize this framework from one image Iₓ to the entire image collection, the authors replace x by a universe u, where u is a set of all unique features present in the image dataset. However, not all features in the universe are relevant for matching and hence the irrelevant features should be excluded. As a result, the most repeatable k features should be selected s.t k <u which helps reduce false matches. By solving this framework, the proposed method achieves SOTA on several semantic matching tasks.

Figure 4: Comparison of pairwise semantic matching [6] (top row) and proposed method (bottom row) between two sedan images. True and False matches are indicated by blue and red lines. The proposed method is able to prune several false matches as it finds more repeatable and compact k features.

As an additional experiment, the framework is able to detect the 10 most repeatable features across the first 1,000 images of the cat head dataset without any supervision. Interestingly, the chosen features roughly coincide with human annotations (Figure 5) that represent unique features of cats (eyes, whiskers, mouth). This shows the potential of this framework for the task of automatic landmark annotation, given its alignment with human annotations.

An implementation of this paper is available on Github.

Figure 5: Colored crosses represent feature points. Columns 1–10 show the 10 most repeatable feature points identified by the proposed framework. Last column shows initial feature candidates (top) and human annotations (below).

Semantic Matching Based on Semantic Segmentation and Neighborhood Consensus [9]

Intra-class variations, meaning an object can appear in different shapes and sizes, and the unconstrained nature of images result in false associations. The authors attribute this problem to the tendency of previous methods that match local features without any spatial contextual information from the neighborhood. To this end, the paper introduces an architecture that explores contextual information via 4D convolution operations. For a given pair of images <Iᴀ, Iʙ>, semantic features <Fᴀ, Fʙ> are extracted from the images using a CNN model. Next, a correlation map is estimated between the extracted features: C = (Fᴀᵀ) (Fʙ), where C∈R ʰ ˣ ʷ ˣ ʰ ˣ ʷ . The correlation map computes similarities between local regions of the two images (Figure 6). Similar features result in a greater correlation, whereas dissimilar features suppress the correlation value. Finally, 4D convolution operations (Figure 7) are applied to aggregate the local information contained in the correlation maps to approximate global statistics.

Figure 6: Overview of proposed architecture.

Figure 7: Overview of four-dimensional convolution operation.

This method is compared with several methods on the PF-PASCAL and PF-WILLOW datasets for the task of keypoint estimation. The percentage of correctly identified key points (PCK) is used as the quantitative metric, and the proposed method establishes the SOTA on both datasets.

Citations:

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
https://aclanthology.org/D19-1410.pdf
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring.
https://arxiv.org/pdf/1905.01969.pdf
Distinctive Image Features from Scale-Invariant Keypoints. https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf
Siamese Neural Networks for One-shot Image Recognition. https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf
Multi-Image Semantic Matching by Mining Consistent Features. https://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Multi-Image_Semantic_Matching_CVPR_2018_paper.pdf
Multi-Image Matching via Fast Alternating Minimization. https://openaccess.thecvf.com/content_iccv_2015/papers/Zhou_Multi-Image_Matching_via_ICCV_2015_paper.pdf
A Spectral Technique for Correspondence Problems Using Pairwise Constraints. https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/leordeanu-iccv-05.pdf
Reweighted Random Walks for Graph Matching. https://link.springer.com/chapter/10.1007/978-3-642-15555-0_36
Semantic Matching Based on Semantic Segmentation and Neighborhood Consensus. https://www.mdpi.com/2076-3417/11/10/4648