Semantic search on the cheap

For the runtime performance aware developer

David Mezzetti
NeuML
4 min readOct 20, 2021

--

Photo by Marc Sendra Martorell on Unsplash

Transformer networks have dramatically changed the landscape of Natural Language Processing (NLP) over the last couple of years. The release of BERT has transformed search for the better, leading to a host of new semantic-driven applications. Data can now be found by topic, concept or similarity vs being purely keyword driven.

Innovative models are being released at a blistering pace, with different architectures and better scores against the benchmarks. The models are almost always bigger networks, with billions of parameters, requiring more and more GPU power. These models are extremely expressive, dynamic and can be fine-tuned to solve a multitude of problems.

This is great when it’s needed. But what if our problem is straightforward? Do we need a model with so much generalized knowledge? What if we don’t have unlimited cloud compute budgets or access to large GPU networks? Maybe we care about power consumption and are energy conscious.

This article will cover building a sentiment-based semantic search index using txtai and fast-performing simple machine learning models.

Install txtai

txtai is an open-source platform for semantic search and workflows powered by language models. While txtai has robust support for Transformers models and is built with that in mind, it also has support for less complex models. More background on txtai can be found in the article below.

txtai can be installed via pip and PyPI as follows. The datasets package will also be installed along with optional components from txtai.

pip install txtai[pipeline,similarity] datasets

Define the model

The section below defines a simple PyTorch model to compute semantic similarity.

This model contains two layers, an embeddings layer and a linear classifier. Next let’s train this model on some data!

Train the model

The code below trains the model using the emotion dataset. The emotion dataset contains labeled sentences with with six basic emotions: anger, fear, joy, love, sadness, and surprise.

Accuracy =  0.883

88% accuracy, not bad for such a simple model. Now that the model is trained, we’ll isolate just the embeddings layer for semantic search.

Build an Embeddings index

Now let’s use the model to build an embeddings index and test a couple queries.

happy thoughts
What a cute picture 0.6930891275405884
Glad you found it 0.6846731305122375
Happy to see you 0.6515571475028992

mad
I'm angry 0.766613781452179
That is so troubling 0.7032073736190796
That's upsetting 0.2715720236301422

wow
Never thought I would see that 0.4499872624874115
Didn't see that coming 0.42482495307922363
A shocking development right now 0.39596471190452576

Solid results. The embeddings layer looks to do a good job of understanding basic sentiment! Note that the queries and results often don’t have the same words in them but words with similar meaning (i.e. semantic similarity).

No neural networks

Let’s say PyTorch is still too much for us. We’d rather not run on anything that performs best on GPUs. That can be done! Next we’ll look at using a scikit-learn model to rank text semantically.

Accuracy = 0.8595

Once again, not bad accuracy, 86%. txtai has the capability to use standard text classification models for similarity queries. The only caveat is that the queries must be pre-canned (determined at model training time).

Next we’ll run similarity queries for a couple of the trained labels.

joy
What a cute picture 0.9270861148834229
Glad you found it 0.9023486375808716
Happy to see you 0.8416115641593933

anger
I'm angry 0.9789760708808899
Didn't see that coming 0.2017391473054886
That's upsetting 0.16476769745349884

surprise
That is so troubling 0.04044938459992409
That's upsetting 0.03875105082988739
Never thought I would see that 0.030828773975372314

Not quite as good as the simple embeddings model but not too bad either. Remember that this is just a simple TF-IDF + Logistic Regression model!

This model can be put on top of a traditional search system to filter or re-rank results based on sentiment. Additionally, this same methodology can be used for a different dataset with different labels, lot of different possibilities.

Performance

Everything up to this point has been results driven, how good is the output. But we already know that complex Transformers models will outperform the results of any of the discussed methods. The upside of these methods should be better runtime performance. Let’s validate that.

Testing speed of 2000 items
Transformers time = 17.116953372955322
PyTorch time = 2.049492835998535
TF-IDF + Logistic Regression time = 1.1091313362121582

The code above trains a Transformer text classification model on the emotion dataset and tests the performance of all models discussed here. As expected, the simpler methods are faster. The simple PyTorch model is 8 times faster and the TF-IDF + Logistic Regression model is 16 times faster! The Transformer model trained is actually on the small side, the performance delta is even larger with many other models.

Wrapping up

This article demonstrated how to build a simple sentiment-based embeddings index. Make no mistake, these models will not generalize past the data it was trained on and will not capture sentiment as well as a more complex model. But it is an option and template for building task specific models that may be good enough, with the upside of significantly better runtime performance. 🚀

Code based on these methods can be found in the following notebook.

--

--

David Mezzetti
NeuML
Editor for

Founder/CEO at NeuML. Building easy-to-use semantic search and workflow applications with txtai.