Introducing txtai, semantic search and workflows built on Transformers

Add Natural Language Understanding to any application

David Mezzetti

Published in

NeuML

7 min readAug 19, 2020

An updated version of this article is available.

Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.

This article introduces txtai, an open-source semantic search platform that enables Natural Language Understanding (NLU) based search in any application.

Introducing txtai

txtai is an open-source platform for semantic search and workflows powered by language models.

Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.

txtai builds embeddings databases, which are a union of vector indexes and relational databases. This enables vector search with SQL. Embeddings databases can stand on their own and/or serve as a powerful knowledge source for large language model (LLM) prompts.

txtai is open-source (Apache 2.0 licensed) and available on GitHub.

neuml/txtai

txtai is an open-source platform for semantic search and workflows powered by language models.

github.com

The following is a summary of key features:

🔎 Vector search with SQL, object storage, topic modeling, graph analysis, multiple vector index backends (Faiss, Annoy, Hnswlib) and support for external vector databases
📄 Create embeddings for text, documents, audio, images and video
💡 Pipelines powered by language models that run LLM prompts, question-answering, labeling, transcription, translation, summarization and more
↪️️ Workflows to join pipelines together and aggregate business logic. txtai processes can be simple microservices or multi-model workflows.
⚙️ Build with Python or YAML. API bindings available for JavaScript, Java, Rust and Go.
☁️ Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes)

Integrate conversational search, retrieval augmented generation (RAG), LLM chains, automatic summarization, transcription, translation and more.

txtai is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers and FastAPI

The following applications are powered by txtai.

txtchat — Conversational search and workflows for all
paperai — Semantic search and workflows for medical/scientific papers
codequestion — Semantic search for developers
tldrstory —Semantic search for headlines and story text

In addition to this list, there are also many other open-source projects, published research and closed proprietary/commercial projects that have built on txtai in production.

Install and run txtai

The following code snippet shows how to install txtai and create an embeddings model.

pip install txtai

Next, we can create a simple in memory model with a couple sample records to try txtai out.

Basic Embeddings Instance

Running the code above will print the following:

The example above shows for almost all of the queries, the actual text isn’t stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is 🔥🔥🔥!

Build an Embeddings index

For small lists of texts, the method above works. But for larger repositories of documents, it doesn’t make sense to tokenize and convert all embeddings for each query. txtai supports building pre-computed indexes which significantly improves performance.

Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search.

Build an Embeddings Index

Once again the same results will be returned, only difference is the embeddings are pre-computed.

Save and load an Embeddings index

Embeddings indexes can be saved to disk and reloaded.

Save and load an Embeddings Index

The results of the code above:

Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg

Update and delete from an Embeddings index

Updates and deletes are supported for Embedding indexes. The upsert operation will insert new data and update existing data

The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results.

Update and delete from an Embeddings index

The results of the code above:

Initial:       Maine man wins $1M from $25 lottery ticket
After update:  See it: baby panda born
After delete:  Maine man wins $1M from $25 lottery ticket

With a limited amount of code, we’re able to build a system with a deep understanding of natural language. The amount of knowledge that comes from Transformer models is phenomenal.

Sentence Embeddings

txtai builds sentence embeddings to perform similarity search. txtai takes each text record entry, tokenizes it and builds an embeddings representation of that record. At search time, the query is transformed into a text embedding and then is compared to the repository of text embeddings.

txtai supports two methods for creating text embeddings, sentence transformers and word embeddings vectors. Sentence Transformers should be used in most cases.

Sentence Transformers

huggingface/transformers

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers (formerly known as…

github.com

Creates a single embeddings vector via mean pooling of vectors generated by the Transformers library.
Base models require significant compute (GPU preferred). Smaller models that trade off some accuracy for speed are available.
Supports models stored on Hugging Face’s model hub or locally.

Word Embeddings

Building a sentence embedding index with fastText and BM25

This article covers sentence embeddings and how codequestion built a fastText + BM25 embeddings search. Source code can…

towardsdatascience.com

Creates a single embeddings vector via BM25 scoring of each word component. Article above describes this method in detail.
Faster inference times with default models on CPUs.
Good option for low-resource languages with limited training data.

Similarity search at scale

As discussed above, txtai uses similarity search to compare a sentence embeddings against all sentence embeddings in the repository. The first question that may come to mind is how would that scale to millions or billions of records? The answer is with Approximate Nearest Neighbor (ANN) search. ANN enables efficient execution of similarity queries over a large corpus of data.

A number of robust libraries are available in Python that enable ANN search. txtai has a configurable index backend that allows plugging in different ANN libraries. At this time, txtai supports:

facebookresearch/faiss

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search…

github.com

spotify/annoy

Annoy ( Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that…

github.com

nmslib/hnswlib

Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment NEWS: Thanks…

github.com

txtai uses sensible default settings for each of the libraries above, to make it as easy as possible to get up and running. Faiss is the default index backend.

The libraries above either don’t have a method for associating embeddings with record ids or assume the id is an integer. txtai takes care of that and keeps an internal id mapping, which allows any id type.

Benchmarks for each of the supported systems (and others) can help guide what ANN is the best fit for a given dataset.

Question-Answering

In addition to similarity search, txtai supports question-answering over returned results. This powerful feature enables asking another series of questions for a list of search results.

The following shows how to create a QA component within txtai.

Extractive QA Model

Next step is to load a set of results to ask questions on. The following example has text snippets with sports scores covering a series of games.

Extractive QA Example

Results for the section above.

We can see the extractor was able to understand the context of the sections above and is able to answer related questions.

The Extractor component works with a txtai Embeddings index as well as with external data stores. Both extractive question-answering and large language models (LLMs) are supported.

Wrapping up

NLP is advancing at a rapid pace. Things not possible even a year ago are now possible. This article introduced txtai, an open-source semantic search platform that enables quick integration of robust models with a deep understanding of natural language. Hugging Face’s model hub has a number of base and community-provided models that can be used to customize search for almost any dataset. The possibilities are limitless and we’re excited to see what can be built on top of txtai!

Introducing txtai, semantic search and workflows built on Transformers

Add Natural Language Understanding to any application

Introducing txtai

neuml/txtai

txtai is an open-source platform for semantic search and workflows powered by language models.

Install and run txtai

Build an Embeddings index

Save and load an Embeddings index

Update and delete from an Embeddings index

Sentence Embeddings

Sentence Transformers

huggingface/transformers

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers (formerly known as…

Word Embeddings

Building a sentence embedding index with fastText and BM25

This article covers sentence embeddings and how codequestion built a fastText + BM25 embeddings search. Source code can…

Similarity search at scale

facebookresearch/faiss

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search…

spotify/annoy

Annoy ( Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that…

nmslib/hnswlib

Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment NEWS: Thanks…

Question-Answering

Further reading

Wrapping up

Written by David Mezzetti