On Making A Multilingual Search Engine
In my previous posts, I have been talking theory around semantic search and so I thought, why not also do a starter code for making a multi-lingual search engine — something that understands the semantics of a language and doesn’t need any machine translation engines.
Want to know more on semantic search?
I am taking these components for doing the POC
Model — Multilingual Universal Sentence Encoder
Vector search — FAISS
Data — Quora question pair from kaggle
You can read more about USE in this paper. It supports 16 languages.
STEP 1. LOAD DATA
Let’s first read the data. Because the quora dataset is huge and takes a lot of time, we will take only 1% of the data. This will take around 3 minutes for encoding and indexing. It will have 4000 questions.
STEP 2. CREATE ENCODER
Let’s make encoder classes that load the model and have an encode method. I have created classes for different models which you can use. All models work with English and only USE multilingual works with other languages.
USE encodes text in a fixed vector of size 512.
I am using TFHub for USE and Flair for BERT for loading the models.
STEP 3. CREATE INDEXER
Now we will create FAISS indexer class which will store all embeddings efficiently for fast vector search.
STEP 4. ENCODE AND INDEX
Let's create embeddings for all the questions and store them in FAISS. We define a search method which shows us the top k similar results given a query.
STEP 5. SEARCH
Below we can see the results of the model. We first write a question in English and it gives expected results. Then we convert the query to other languages using google translate and the results are great again. Even though I have made a spelling mistake of writing ‘loose’ instead of ‘lose’, the model understands it as it works on the subword level and is contextual.
As you can see, the results are so impressive that the model is worth putting in production.
Want more?
Find different models for encoding text for semantic search over here!