Similarity Search: A Crash Course

Shekhar
11 min readMar 4, 2024

--

Search/Information Retrieval as explained in RAG eventually ends up being the lynchpin of the entire RAG pipeline. But what is search?

Let us explore classical Similarity Search first as it is still very much relevant, alive and kicking. We will first discuss algorithms, explore the entire search process under elasticsearch and then move on to algorithms around type ahead search. In subsequent articles we will cover Semantic Search which is used in RAG.

Similarity Search

Popular Algorithms for similarity search include TF-IDF and OKAPI BM25.

TF-IDF

Imagine you have 4 documents with these texts.

“The Black Umbrella”

“The Black Car”

“The Black and Blue”

“The Blue Sky”

If we were to search for the word Black. For the First 3 documents, it has one occurrence so all of them should be scored heavily for it to be considered a relevant result. On the other hand if it is too common a term (The word “The”), we do not want to get all the documents. We would rather get a document where it scores better.

The scoring mechanism is called a TF-IDF Score. Mathematically the formula is as given Below. It has two parts.

TF:- Term Frequency. More the number of occurrences in a document, higher the score. It is a document Specific score for each term for each document.

IDF = log(N/df) - The second term called the inverse document frequency which downvotes extremely common terms. For example,IDF for the word “The” would be 0 thus not returning any documents as a result.

TF-IDF Score
Tf idf of the word black in document 1 is 1*log(4/3)

OKAPI BM25

BM25

Salient features of the formula:-

First term

The first term is the IDF of the ith query term.

f(qi) is the number of documents which contain the ith query term.

docCount is the total number of Documents.

So if a term occurs in let us say all (100) documents. docCount-f(qi) would go to zero. So the first term would become

ln(1 + [(100–100+0.5)/(100+0.5)] = ln(1+ 0.5/100.5)~ln(1)~0.

Second Term

The second term is analogous to the term frequency.

f(qi,D) is the number of occurrences of qi in Document D

The combined term fieldLen/avgFieldLen determines how much the document is larger than average. If fieldLen > avgFieldLen, the document is larger than average and hence this term is larger. If this term is larger the second term ends up being smaller as this combined term is in the denominator. Similarly a document smaller than average is scored much higher.

The term b indicates how much you want to penalize the combined term fieldLen/avgFieldLen.

Similarly the term k is used to place an upper bound on the term frequency which didnt exist in TF-IDF.

So when are these scores calculated?

These scores are calculated as shown above for each term and document (document field actually) combination. The entire process is called Indexing and it looks as shown below for ElasticSearch.

ElasticSearch indexing

As shown above,

  • Coordinator Node is what gets fed the documents for Indexing.
  • Translog works similar to a WAL in a database and allows reindexing in case the indexing fails.
  • Each Document/parts of a document is fed to every shard and possibly multiple indexes.
  • The Ingestion Process in Each Shard consists of a Tokenizer and a Filter. As a user you determine which parts of a document should be processed by which tokenizer and filter.
  • The Tokenizer splits the Document parts by Whitespace, n-grams etc. For example the sentence “The Black Dog” in a document gets split as 3 tokens [The,Black,Dog]. There are other advanced options to use patterns to split tokens as well.
  • This is followed by a Filtering Process. The Filtering process can process synonyms,stem words , remove stop words etc.
  • Each token is then replaced by its vocabulary index (token id) and each document by it’s posting. (Document id)
  • The resultant document is then used to create an Inverted Index Segment Buffer in Memory. You can think of this a 3 tuple containing the token id, Document id, and token count in that Document. It can also contain additional information like offsets for example to highlight in the search response.
  • The coordinator node then returns. The in memory Index segment is then synced in background to disk.
Conceptual Representation: An Index is split into multiple mini indexes which are independently searchable called segments. Each Segment contains the Document ids indexed for that segment which further links to a structure containing fields and their values. Each field further links to a dictionary which is the inverted index for that field. Note updating an index is an expensive operation as underlying lucene segments are immutable. So older segments need to be soft deleted and newer ones created. This also means a merge + compaction is required.The reason the segments are kept iimmutable is for the OS to ensure they are cache friendly . Plus no race conditions and high multithreading performance.

The multiple data structures stored for each term. One is an inverted postings list index for each term which indicates which documents contain the term. Second is a score for a term and a document pair which is the term frequency for that document. Third is the IDF which is calculated across all the documents for each term.(*1)

Furthermore, field configurations (Field types) control the indexing of the fields. These contain information on whether

  1. The field value should be stored (y/n)
  2. Whether we need to store Frequencies/Frequencies+Positions/Frequencies+Positions +offsets? (For Highlighting)
  3. Query/Index time boosting of Fields? Elastic recommends query time boosting.
  4. Whether the field value should be tokenized (y/n)

This is just building the Index. What about Search

Sure. So the search phrase is also run through an analyzer and a filter which mostly is different from the index document analyzers and filters and creates the resultant terms. The search diagram is the same except that the search happens parallelly for the search query for each shard.

The resultant terms are then compared with a postings list, it finds the documents of interest for every term in the query. Once these documents are found for every term, the term and document combination’s corresponding tf and idf precomputed scores are used in the below formula.

score(q, d) = coord(q, d) * queryNorm(q) * ∑(tf(t in d) * idf(t) * t.getBoost() * norm(t, d)) (for each term t in query q)

In the above formula,

  • tf and idf are precomputed as in (1*) above. tf-idf for different fields (title and content) are typically calculated separately and added as a larger score for the title for example indicates a higher relevance.
  • norm(t, d) = 1 / (sqrt(dl(d)) * avgFieldLength(f)). Larger documents tend to have larger term frequency because of their size and this helps prevent them from unfairly dominating the ranking. Similarly the second term further normalizes the score based on the average length of the field containing the term. Terms in shorter fields generally carry more weight compared to those in longer fields, as their presence is proportionally more significant in shorter contexts.
  • t.getBoost() is a boost configuration which boosts certain fields. For example you might want to score a term in the title higher than the content. This can be specified during the search or the indexing phase.
  • queryNorm calculated as the square root of the sum of squares of term frequencies (TFs) in the query and is calculated at search time
  • coord(q, d) is the coordination factor, which reflects how many query terms are present in the document and is calculated at search time.

The documents with the highest scores are then returned.

So one way to do an efficient search would be to store the terms in the term dictionary in a sorted order. So would an LSM tree serve this purpose?

Credits: https://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html

LSM trees are typically optimized for Write Heavy Workloads. In Lucene, as the segments are immutable, the workloads are read heavy and thus requires an automaton called FSTs. This works similar to a trie (below) but allows sharing both suffixes and prefixes. The output is the term ordinal and is the sum of numbers along a specific path. For example the term ordinal for stop is 4 which is used as entry to the Term Inverted Index.

What about google search as you type?[2]

There are also advanced options around search like type ahead search. Options available for implementation include using a trie data structure [5] for suggesting recommendations.

Tried data structure Image from wikipedia. When someone types t, his search gets narrowed to the left branch candidates, te narrows it to all the children of node te and so on. The more the characters the user types, the more precise is the search. The response is normally the most popular children under any node.
Type Ahead Search

How do we store such tries?

One way to do so is to store the root of the trie as a key in a document database and the trie as a json structure. The disadvantage is that the trie may be too large to store against a single key.

The other option is to store non leaf trie nodes as keys in a key value store against the leaf nodes.

In practice however FSTs are used instead of tries for type ahead search.

What if the user mispells his queries?

The user query might often contain misspellings Or there may be an inexactness to the query itself that would need correction.
Levenshtein and Jaro winkler similarly algorithms are used generally whenever we want to introduce some level of fuzziness in search using elasticsearch.

Levenshtein distance

The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

For example, the Levenshtein distance between “smitten” and “sitting” is 3, since the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits:

  1. smitten → sitten (Deletion of “m”),
  2. sitten → sittin (substitution of “i” for “e”),
  3. sittin → sitting (insertion of “g” at the end).

Note there is a variant of Levenshtein called the the Damerau–Levenshtein[3] distance which includes transpositions(paermutations of adjacent characters) in addition to insertion, addition and substitutions which is used in practice.

For example for two strings CA and ABC , The Damerau–Levenshtein distance LD(CA, ABC) = 2 because CAACABC

On the other hand, classic Levenshtein Distance would be 3 as CA -> C/A -> ABC

Jaro Similarity

Jaro similarity [4] is defined as

Jaro similarity score is 0 if the strings do not match at all, and 1 if they are an exact match. In the first step, each character of s1 is compared with all its matching characters in s2. Two characters from s1 and s2 respectively, are considered matching only if they are the same and not farther than

characters apart.
For example, let us say we have two strings.

  • String 1: “MARTHA”
  • String 2: “BARITHA”
  1. Length: Both strings have the same length (6 characters).
  2. Matching characters:
  • ‘A’ matches ‘A’ (2nd character)
  • ‘R’ matches ‘R’ (3rd character)
  • ‘T’,’H’,’A’ match “T”,”H”,”A” in String 2 because they are not farther than max(6,7)/2–1 = 2.5. They are only one character apart.
  • Number of Matching characters m in the above formula = 5
  • Number of transpositions i.e. number of matching characters not in the correct order =0.
  • Jaro Similarity = (1/3)*(5/6 +5/7+ 1) = 84.9.

Jaro Winkler distance

The Jaro similarity is a number between 0 and 1.

Jaro–Winkler similarity uses a number p which gives more favorable ratings to strings that match from the beginning for a set prefix length l

. Given two strings s1 and s2, their Jaro–Winkler similarity

Jaro Winkler Similairty

where:

  • sim j is the Jaro similarity for strings
  • l is the length of common prefix at the start of the string up to a maximum of 4 characters
  • p is a constant for how much the score is adjusted upwards for having common prefixes.
  • The jaro winkler distance is defined as
Jaro Winkler Distance

I have heard that Semantic Search is all the rage now, Why am I learning Similarity Search then?

Well, In a lot of problems domains like ecommerce product search, similarity search is used as the primary mechanism with semantic search as an adjunct or in parallel.
A text search for “nike shoes” With a boost for “nike” would give a better match using a similarity search.

On the other hand, a text search for the phrase “shoes used for running” would yield better results with semantic search techniques .

So no Machine Learning? :-(

SPLADE

One popular use of Machine Learning in Similarity Search is the a new Algorithm called SPLADE which uses the BERT transformer architecture.

BERT

Let us first understand how BERT works. At its heart the BERT transformer architecture contains a set of encoder and Decoder layers.

While feeding the inputs, BERT takes a single segment or a pair of segments. Each segment is split into tokens. Additionally, two special tokens are passed to the input:

  • [CLS] — passed before the first segment indicating its beginning. At the same time
  • [SEP] — passed between segments to indicate the end of the first segment and the beginning of the second.

The tokenization process consists of finding the location of the word in BERT’s vocabulary in addition to the above tokens. (101 for CLS for example)

BERT Embedding Layer.

After this, it is added to a positional embedding layer which indicates relative position of each word in eachs equence.

This is further added to a token type embedding which is used to identify the segment the token belongs to and normalized. The result is the matrix shown in the extreme right.

Each Encoder Layer of BERT outputs a certain embedding layer.

Masked Language Modelling

Authors of the original BERT paper propose training BERT by masking a certain amount of tokens in the initial text and predicting them. For example, the sentence “John loves to draw fruits on a canvas” might become “John loves to [MASK] fruits on a canvas”

The output of BERT in this case might be a probability distribution over all the words in BERT’s vocabulary. So in this case BERT might predict [MASK] as the tokens for the words “draw” and “paint” as high probability.

This helps in term/query expansion where the queey could be expanded semantically to allow for better searches, a sort of semantic search without requiring an LLM if the term expansion can be done as part of the indexing itself.

[1]https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Note for ingesting logs into elasticsearch a popular strategy used is splitting logs by timestamp which confines your indexes to only a few documents. This is generally used when you want to use elasticsearch for searching through logs.

[2] https://www.elastic.co/guide/en/elasticsearch/reference/7.17/search-suggesters.html

[3]https://en.wikipedia.org/wiki/Levenshtein_distance

[4]https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

[5] https://en.m.wikipedia.org/wiki/Trie

--

--

Shekhar

Team Incubator, Pragmatic Data Scientist, Software Architect , Amateur Product Manager, Geek, Hacker, Father, Hardware Tinkerer