3 simple tricks to get the most out of your BERT-based Text Similarity system

To Bert or not to Bert? A practical approach.

Amnon Lotanberg
Analytics Vidhya
10 min readJan 20, 2020

--

Semantic Search, where we compute the similarity between two texts, is one of the most popular and important NLP tasks. It is a crucial instrument in Summarization, Question Answering, Entity Extraction, etc.

Nowadays, it’s become easier than ever to build a Semantic Similarity application in a few lines of python, by using pre-trained Language Models (LMs) like Universal Sentence Encoder (USE), Bert, RoBERTa, XLNet and co. But between your proof-of-concept, based on the amazing online example you copied, and deploying to production, you will probably encounter several hurdles in dark corners. In this post I’d like to shed some light over them.

Start with a (great) baseline Textual Similarity system

Let’s take a text-book python example of a modern Text Similarity (TS) function, copying from the example set up by Sentence-Transformers:

Installation:

We first generate an embedding for all sentences in a corpus:

Then, we generate the embeddings for different query sentences:

We then use scipy to find the most-similar embeddings for queries in the corpus:

The output looks like this:

So these functions work very well, right out of the box, as advertised.

However, you may still hit some bumps when running it on your own real data: out of vocabulary (OOV) issues, time/memory constraints, and, of course, performance issues (i.e., you still get spurious similarities here and there). So let’s consider some potential challenges, and solutions.

“Language, please!”

What’s the problem?

If your function accepts some form of user generated text (a text box, free style web pages, a local DB…), you’re gonna encounter sketchy, misspelled text; the kind your LM didn’t train on. This can cause painful results. Suppose your corpus includes a mixture of product categories, foods and company names (I made up):

Now, suppose a user searches for a legitimate product category like “sleeping bag”. You’d get:

While it’s great that Bert scored the relevant texts higher, why would you want any unrelated proper nouns to appear in the result set at all?

In other words, in most real corpora, there will be many texts that you never want to match, and should therefore try to filter out of the embedding scheme altogether.

Not convinced? Suppose your user typed a misspelled category name like “Sliping bag”. This would thwart the pre-trained LM even more, and you’d get:

Much worse. Now the most relevant result (“beds”) appears at the bottom, after a mediocre hit, and three junk hits. You can read all about how wacky LMs behave, when presented with texts not typical to what they trained on, in this great analysis.

What can we do?

At the very least, we can easily identify and retain the texts composed of real English words, that aren’t proper nouns, i.e. retain English dictionary words. This would weed out all the confusing texts composed of proper names. Without those, the results would look more like:

Voilà, all the most confusing hits are gone, and no search query will ever miss them. Unfortunately, the similarity void they left behind has been filled by unrelated whitelisted words (vegetables, sweet potatoes, french fries). I want to discuss ways to remove those in a future post.

How can we do that?

We could of course spell check our texts, and retain only text whose words check out. A good, publicly available, option would be spacy_hunspell. But we’ve learned that spell checkers, by and large, do not cover enough vocabulary in order to recognize something like “all proper English words”.

Fortunately, we have a pretty comprehensive list of all English words (lemmas, actually) at our disposal, which we can use as a white list. It’s in the amazing Wiktionary. So, to download all the words in the “English lemmas” category, you can run this code:

Or, you can just download my whitelist containing > 570K lemmas. It doesn’t matter much that the list contains a lot of junk. What matters is that most of our domain words (English lemmas) are in it, and nearly all possible junk words, or Out Of Domain (OOD) words, are not.

What if my domain is different/narrower?

Then you might be able to cover most of your domain vocabulary with a few other Wiktionary categories like accounting, sports or climate. Or look to other language resources. If you suffer from many OOD terms in your corpus, it will be worth the effort.

Oh, and don’t forget to give back to Wikipedia :)

Choose the right package and LM for Semantic Similarity, and for your application

This is the more obvious part of your planning, and I’m only getting into it now, because switching frameworks and LMs has proven less impactful on results than the above mechanical tricks.

So what is a good python package for Text Similarity? You may have already guessed that I recommend the above mentioned sentence-transformers, especially with the ‘distilbert-base-nli-stsb-mean-tokens’ LM (which I’ll explain below). I arrived at this conclusion after testing many popular packages and LMs on my Semantic Similarity app, including: averaged GloVe embeddings, FastText (averaged-word- and phrasal- embeddings), Universal Sentence Encoder, Bert as a Service, Spacy Transformers (which support Bert, RoBERTa, XLNet and DistilBert).

One good reason for choosing this package, is it’s ease of use, allowing us to switch the LM in the above example, by simply replacing this line:

With something like

See the details on DistilBert here.

However, the more important motivation for my choice is that it shines in all the following criteria, that I suggest you consider.

Best zero-shot performance

A zero-shot setup is one where you apply a pre-trained LM w/o fine-tuning it on train examples. And zero-shot is what most of us in industry want: to get good inferences and get them now, without the extra work of tuning.

Notice that in the case of UKPLab’s distilbert-base-nli-stsb-mean-token, they did a lot of tuning for us at no extra cost (along with 14 other models), as detailed here. Training on the SNLI, MultiNLI and STS datasets is probably one of the reasons DistilBert works so much better for me in this package, than, for instance, the DistilBert model in the Spacy package, mentioned above.

How to best evaluate candidate LMs?

You must test their performance on a test set of solved text-similarity test-cases, like the ones demonstrated above. Ideally you’d compose a test set of hundreds or thousands of examples, so that you could robustly and automatically score each LM (I’d like to write a future post on this). However, if time and/or test examples are scarce, just a few dozen examples will do.

So, in my error analysis, most LMs committed painful errors, like attributing

or

and these would look really bad to my users. But the DistilBert LM makes the least amount of such errors, and when it does, it assigns such low similarity scores, that I can automatically flag them as sketchy and drop them. This was a pleasant counter intuitive surprise, since DistilBert is a, well, distilled version of Bert, and gets lower scores on the General Language Understanding Evaluation (GLUE) benchmark. But you go where the data leads.

About comparing similarity scores

Remember you should interpret similarities like in the above example, in terms of the relative rank of a text, i.e.,

sim(X, Y) > sim(X, Z), therefore X and Y are more closely related than X and Z

And you should not infer much from the absolute value, i.e.,

sim(X, Y) = 0.85, therefore X and Y are similar/dissimilar

As 0.85 could be high for some pivot texts, and low for others. Reason being, that similarity functions (usually cosine) give us a distorted signal, in the sense that, given X and Y, some dimensions of their vector embeddings encode more significant differences than others, but the cosine gives all dimensions uniform weight.

Most lightweight and fast

This should be a no-brainer. If time or memory are a concern (they usually are), then you’ll probably want to use a lightweight LM, like ALBERT, Q8Bert or DistilBert, These mini-LMs achieve scores on GLUE in the same ballpark as the larger Bert LMs, so, as our experience shows, you probably won’t find much difference in performance.

For illustration, here’s the RAM footprint of a few popular packages and models:

the RAM footprint of a few popular packages and models

On the other hand, if you can afford throwing lots of memory at all your LM instances, or prefer to have one fat centralized Semantic Similarity server and many thin clients, you could also consider the heavier ones.

Easiest to fine-tune

If you don’t get good enough performance from the start, you may want to adjust the model to your data. Remember that, by rule of thumb, one model is better than two chained models. Therefore, it’s wiser to try to improve a mediocre model than, say, to train an extra classifier to better interpret the former’s scores or embeddings.

But wanting to fine tune isn’t enough. You need to compose a text-similarity dataset, and write the code to train the model on it, in a <text A, text B, score> task setting. And here again the sentence-transformers package shines as the most user-friendly. This is thanks to its Siamese network architecture, built for the Text Similarity task, in opposed to programming interfaces written with the Text Classification task in mind, where the input is one text instead of two.

As for dataset building, this is normally a tedious chore, especially if your application is required to accept user-inputted texts, which can vary greatly. So I want to address this in a future post. For now, I encourage you to experiment and test your strength, as even a few dozen examples can sometimes make a difference, as discussed here.

What doesn’t matter?

Surprisingly, in our experiments it didn’t matter much performance-wise, which particular LM we tested: Bert, USE, XLNet, DistilBert, Albert or Roberta. The other discussed factors moved the needle much more.

Summary

If I’ve done my job well, you’ve learned some important dos and don’ts that will help you productionize your Semantic Similarity function right away, such as:

  1. Filter the domain/vocabulary inputted to your function, to avoid garbage-in-garbage-out situations
  2. Choose the right package and LM, suited for your problem, production environment, fine-tuning and maintenance needs. These are more important than choosing the highest-scoring model.

Happy searching!

--

--

Amnon Lotanberg
Analytics Vidhya

I’m a Data Scientist with rich experience in NLP and Cyber, in startups, corporations, and independent consulting