Getting started with semantic search

Learn about this rapidly developing capability

David Mezzetti
NeuML
6 min readNov 12, 2022

--

Semantic search is a new category of search built on recent advances in Natural Language Processing (NLP). Large-scale general language models have rapidly pushed the field forward in ways unimaginable only a few years ago.

This article gives a brief overview of semantic search and how it contrasts with keyword search. Examples are backed by txtai, an open-source framework for building semantic search applications. See the links below for more on txtai.

Keyword search

Before semantic search, systems would typically build a keyword-based index to help find data. Examples of this are TF-IDF and BM25. Apache Lucene and Elasticsearch are enterprise-grade, large-scale implementations of keyword search.

At a very high level, these methods tokenize or split text into tokens and calculate metrics on the importance of those terms. This process can get quite complex to account for case sensitivity, stemming of terms, removing stop words and more. Keyword search has held up very well over time and isn’t going anywhere anytime soon.

Let’s cover a keyword search example. The following code creates a BM25 index over a list of text elements and runs a series of searches.

from txtai.scoring import ScoringFactory

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create a BM25 index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("lottery winner", "canadian iceberg", "number of cases",
"rising tensions", "park service"):
# Get index of best section that best matches query
results = scoring.search(query, 1)
match = data[results[0][0]] if results else "No results"

print("%-20s %s" % (query, match))

Notice that each query has a term found in the matching output result. For example, lottery is in Maine man wins $1M from $25 lottery ticket. Also notice that not all the query terms are required in the results. Overall, keyword search does a pretty good job as the calculated token metrics help produce the best matching result.

But what if we want to find conceptual matches where the query term isn’t a part of the result? See below.

from txtai.scoring import ScoringFactory

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create a BM25 index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "public health story",
"war", "wildlife", "asia", "lucky", "dishonest junk"):
# Get index of best section that best matches query
results = scoring.search(query, 1)
match = data[results[0][0]] if results else "No results"

print("%-20s %s" % (query, match))

No results for any of the queries, why is that? Let’s look at the first query, feel good story. None of the terms feel,good or story are in any of the input text elements. Let’s see how semantic search can help!

Semantic search

Semantic search overview

Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords. The diagram above illustrates how semantic search works.

The first step is using a large language model to vectorize input content. Vectorization transforms inputs into arrays of numbers. Similar concepts will have similar values.

Next the vectors need to be stored somewhere. Vector databases are systems that specialize in storing these numerical arrays and finding matches. They are typically backed by Approximate Nearest Neighbor search (ANN) indexes. txtai supports writing ANN indexes directly to disk as well as integrating with vector databases.

Once vector indexes are created, the last step is search. Vector search transforms an input query into a vector and then runs an ANN query to find the best result, matching the best conceptual match.

Now that we have that explained, time for an example. The code below is almost identical to the previous example, except it uses a txtai embeddings instance.

from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
embeddings.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "public health story",
"war", "wildlife", "asia", "lucky", "dishonest junk"):
# Get index of best section that best matches query
uid = embeddings.search(query, 1)[0][0]

print("%-20s %s" % (query, data[uid]))

Notice the matches here. The example above shows for almost all of the queries, the actual text isn’t stored in the list of text sections. This is the true power of semantic search over keyword-based search.

Semantic search doesn’t just support conceptual matches. For completeness, the last example runs the previously run keyword-based queries against a semantic index.

from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
embeddings.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("lottery winner", "canadian iceberg", "number of cases",
"rising tensions", "park service"):
# Get index of best section that best matches query
uid = embeddings.search(query, 1)[0][0]

print("%-20s %s" % (query, data[uid]))

Same results as with the keyword index.

Wrapping up

This article gave a brief introduction on semantic search and how it can help with conceptual search. Keyword search still has it’s place. It is less computationally intensive and can be quite effective. However, semantic search is rapidly growing in popularity. Models continue to improve both in speed and accuracy. If you haven’t tried semantic search now is the time to give it a look!

--

--

David Mezzetti
NeuML
Editor for

Founder/CEO at NeuML. Building easy-to-use semantic search and workflow applications with txtai.