NeuML
Published in

NeuML

Getting started with semantic search

Learn about this rapidly developing capability

Keyword search

Before semantic search, systems would typically build a keyword-based index to help find data. Examples of this are TF-IDF and BM25. Apache Lucene and Elasticsearch are enterprise-grade, large-scale implementations of keyword search.

from txtai.scoring import ScoringFactory

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create a BM25 index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("lottery winner", "canadian iceberg", "number of cases",
"rising tensions", "park service"):
# Get index of best section that best matches query
results = scoring.search(query, 1)
match = data[results[0][0]] if results else "No results"

print("%-20s %s" % (query, match))
from txtai.scoring import ScoringFactory

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create a BM25 index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "public health story",
"war", "wildlife", "asia", "lucky", "dishonest junk"):
# Get index of best section that best matches query
results = scoring.search(query, 1)
match = data[results[0][0]] if results else "No results"

print("%-20s %s" % (query, match))

Semantic search

Semantic search overview
from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
embeddings.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "public health story",
"war", "wildlife", "asia", "lucky", "dishonest junk"):
# Get index of best section that best matches query
uid = embeddings.search(query, 1)[0][0]

print("%-20s %s" % (query, data[uid]))
from txtai.embeddings import Embeddings

data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
embeddings.index(((x, text, None) for x, text in enumerate(data)))

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("lottery winner", "canadian iceberg", "number of cases",
"rising tensions", "park service"):
# Get index of best section that best matches query
uid = embeddings.search(query, 1)[0][0]

print("%-20s %s" % (query, data[uid]))

Wrapping up

This article gave a brief introduction on semantic search and how it can help with conceptual search. Keyword search still has it’s place. It is less computationally intensive and can be quite effective. However, semantic search is rapidly growing in popularity. Models continue to improve both in speed and accuracy. If you haven’t tried semantic search now is the time to give it a look!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
David Mezzetti

Founder/CEO at NeuML — applying machine learning to solve everyday problems. Previously co-founded and built Data Works into a successful IT services company.