Finding the best affordable NLP model for semantic search

We have benchmarked popular latent encoder models for k-NN text search. Here are the results.

Published in

Aquila Network

2 min readDec 29, 2021

Aquila X Matthew Ball bookmarks screenshot

With today’s advancements in NLP architectures, now we can build consumer-quality semantic search engines that are very good at matching sentences by their meaning. At aquila network, we’ve decided to benchmark popular pre-trained models available out there for free. And here are the results:

NLP sentence encoder models for semantic search (accuracy benchmark)

Interpretations of data above:

We’re measuring the retrieval accuracy of a document (Wikipedia article) for a given query. We’ve used the SQuAD dataset for convenience.
Cells marked green yield better accuracy by comparison. We’re measuring top 1 | top 3 | top 5 retrieval accuracies separated by pipes.
Normalization has no effect for Fasttext when ANNOY angular/euclidean metric is used (Annoy internally normalizes vectors to then run euclidean metric when the angular metric is chosen)
ANNOY no. of trees (100) increase increases accuracy, but slow search, 500 is max. affordable by speed
Manhattan metric produces the same results as euclidean and is insignificantly slightly different (+-1%) result from angular — means, vectors are clustered around an approximately unit noisy sphere
Dot (vectors must be normalized) is slightly better than all the other metrics for Fasttext.
Cleaned text is making it worse for transformer models.
Refer our post ranking algorithm here.

Summary:

Our experiments show that MSMarco models show better accuracy, utilizing affordable resources that a personal computer can offer. With this, we’re upgrading our default model from Fasttext to MSMarco in Aquila Hub v0.1.0 release. Feel free to check it out and send your feedback.

Finding the best affordable NLP model for semantic search

We have benchmarked popular latent encoder models for k-NN text search. Here are the results.

Interpretations of data above:

Summary:

Written by Jubin Jose