On Making A Multilingual Search Engine

Pratik Bhavsar | @nlpguy_
Modern NLP
Published in
3 min readDec 20, 2019

In my previous posts, I have been talking theory around semantic search and so I thought, why not also do a starter code for making a multi-lingual search engine — something that understands the semantics of a language and doesn’t need any machine translation engines.

Want to know more on semantic search?

I am taking these components for doing the POC

Model — Multilingual Universal Sentence Encoder

Vector search — FAISS

Data — Quora question pair from kaggle

You can read more about USE in this paper. It supports 16 languages.

STEP 1. LOAD DATA

Let’s first read the data. Because the quora dataset is huge and takes a lot of time, we will take only 1% of the data. This will take around 3 minutes for encoding and indexing. It will have 4000 questions.

STEP 2. CREATE ENCODER

Let’s make encoder classes that load the model and have an encode method. I have created classes for different models which you can use. All models work with English and only USE multilingual works with other languages.

USE encodes text in a fixed vector of size 512.

I am using TFHub for USE and Flair for BERT for loading the models.

STEP 3. CREATE INDEXER

Now we will create FAISS indexer class which will store all embeddings efficiently for fast vector search.

STEP 4. ENCODE AND INDEX

Let's create embeddings for all the questions and store them in FAISS. We define a search method which shows us the top k similar results given a query.

STEP 5. SEARCH

Below we can see the results of the model. We first write a question in English and it gives expected results. Then we convert the query to other languages using google translate and the results are great again. Even though I have made a spelling mistake of writing ‘loose’ instead of ‘lose’, the model understands it as it works on the subword level and is contextual.

As you can see, the results are so impressive that the model is worth putting in production.

Want more?

Find different models for encoding text for semantic search over here!

--

--