Building a Multilingual (Cross-Language) Semantic Search Engine using Cohere

Building a Multilingual Semantic Search Engine is an old problem in NLP that took a lot of work to solve. With the emergence of the latest Large Language Models (LLMs) building a Multilingual Semantic Search Engine is simple, easy, and efficient.

Ahmad Anis

Published in

Red Buffer

9 min readJul 3, 2023

The Idea of Semantic Search

An example of semantic search using Cohere.

Traditional search works on the basis of “matching”. You input a query, and it matches the query/keywords with the items in the database and returns the most relevant/matched results from the database. While this is super fast, this approach has a very big problem, it does not take into regard the meaning or the context of the input. Let’s take an example:

Query: Famous Person

Dataset:

Well known personality
Popular personality
A person who is not famous

If the search is working on the basis of keyword matching, it is going to match the keywords with the 3rd item in the dataset, which conveys the exact opposite semantic of what we originally wanted to search in our query.

Results: “A Person which is not famous”

The idea of semantic search is to compare the semantics or the understanding of the given query with the semantics or the information that is available in the database.

In the above example, “Famous Person” has similar semantics or understanding to “Well-Known Personality” or “Popular Personality” hence it will match it with these results instead of “A Person who is not famous” which has the totally opposite semantics or understanding.

The problem of semantic search comes under the field of

How Semantic Search Works

Semantic search uses the idea of text embedding, which can be understood as an intelligent representation of a piece of text that can be a word, a sentence, a paragraph, or multiple paragraphs. This intelligent representation is a combination of numerical numbers of a specific size. So the sentence “A Person which is not famous” will be converted to a sequence of numbers also known as a vector of some specific size such as [1, 512] or [1, 1024], etc.

We train a neural network that learns these representations and stores them in a vector space, which is a home for embeddings. All the sentences/words/paragraphs which are “semantically” related lives together as a neighbor. Those which are semantically similar are matched best in the search. So the sentences “Well known personality” and “Popular personality” are going to reside close in vector space as compared to our 3rd sentence “A person which is not famous” which is going to be far away from the above 2 due to being completely different in the semantics.

A common way to perform the semantic search is,

Calculate the embeddings of all the texts in your database (1 time compute).
Each time a new query comes, calculate the embedding of the query.
Perform a similarity search between the input query and the database and return the closest matching results.

An example of a semantic search represented as a scatter plot, each dot is an embedding vector of some size.

Multilingual Search

The idea of multilingual search is simple, as the name shows it suggests across multiple languages. But the real question is how are you going to do so? A traditional way is to use word mapping (Map all possible words to the equivalent in the other language) and use the same keyword-matching approach. This approach has a lot of problems.

Word-to-word Translation is not Accurate:
Many times we see word-to-word translation used somewhere and it is superbad. We can see some examples of Google Translate Fails here.
The Problem of Semantics:
Matching keywords (even if the translation is accurate) is not the best approach as it does not consider the understanding or semantics of the keywords as discussed above.

How Multilingual Semantic Search Works

Multilingual semantic search works similarly to how semantic search works, but the difference is in the training data for the neural network and how the embeddings are trained. The input data contains multiple languages and is organized in a special order that it learns to place the semantically similar embeddings really close in the vector space.

Here is an example of the dataset:

{
"id": 28,
"language": "en",
…
"PRODUCT_TYPE": "pants",
…
},
{
"id": 28,
"language": "ru",
…
"PRODUCT_TYPE": "брюки",
…
}

In this example, we have multiple languages for the same example with the same id. Now when our Neural Network (Large Language Model in these types of problems) is being trained, they are told that these 2 different examples with the same id are to be treated as the same, hence the model learns similar embeddings for multiple languages. There are a lot more details involved in training a multilingual model, which is out of the scope of this article.

Multilingual Semantic Search using Cohere

Cohere is a company that is focusing on the power of language understanding, creating super cool Large Language Models that we can simply use via an API.

Cohere gives us multilingual-22–12model which is focused on multiple languages (supports over 100 languages) and works really well compared to other models by Google and UKPLabs (Sentence Transformers).

Step 1: API Key

Cohere provides a free trial API key which gives 100 API calls per minute. You can sign-up for a paid account if you want.

Step 2: Installations

You can install the libraries required using pip.

$ pip install cohere 
$ pip install numpy 
$ pip install pandas
$ pip install altair 
$ pip install umap

Step 3: Dataset

We will go with a dummy dataset in our case which contains examples from multiple languages. Let’s create a CSV file with a few examples. It has 2 sentences,
1. Islamabad is the capital of Pakistan.
2. Delhi is the capital of India.
in three different languages (Urdu, Spanish, and Arabic).

These are answers to a few questions, hence we named the column answers.

answers,
پاکستان کا دارالحکومت السلامآباد ہے،
انڈیا کا دارالحکومت دہلی ہے،
Islamabad es la capital de Pakistán,
Delhi es la capital de la India,
إسلام أباد هي عاصمة باكستان,
ديلهي هي عاصمة الهند,

Step 4: Dataset Creation

Cohere provides us few endpoints that we can use via the API, each endpoint is linked to a model or service which does a different task. The embed endpoint can be used to calculate the embeddings of the dataset. We can use English-only models which can perform really well in the English language, and for multiple languages, we can use multilingual models. The multilingual-22-12 model is one of the best multilingual models out there which we are going to use to calculate the embeddings.

import pandas as pd

df_multi = pd.read_csv('dataset.csv')
df_multi = df_multi[['answers']]  # Remove extra columns

df_multi.head()

We can calculate the embeddings using co.embed endpoint. We have to pass the text column as a list to this endpoint.

import cohere
api_key = "XXXXXXXXXXXXXXXXXXXXXXXXXXXX"  # Get your own API key

co = cohere.Client(api_key)
embeddings = co.embed(list(df_multi["answers"]), model="multilingual-22-12").embeddings

Embed endpoints return a list of embeddings of shape [n, 768] where n is the number of examples in your input list. This 768 vector encapsulates all the semantics in your input text.

embeddings_np = np.array(embeddings)  # Convert the list to numpy array 
print(embeddings_np.shape)

Step 5: Visualize Embeddings

Visualizing Embeddings is something really interesting, and super useful for the intuition of your input data. We are going to use umap library for compressing the data up to 2 dimensions and altair for interactive visualization of embeddings.

import umap
reducer = umap.UMAP()

umap_embeds = reducer.fit_transform(embeddings_np)

Now we will create a dataframe of the 2 columns that we obtained via umap and visualize the x and y-axis.

df_explore = pd.DataFrame(data={'answers': df_multi['answers']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x= alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y= alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['answers']
).properties(
    width=700,
    height=400
)
chart.interactive()

Each dot in the scatter plot shows each text example, in our case, it was 6, and each dot represents a single embedding calculated by our multilingual LLM. If you notice, there is a specific pattern in the examples which you can explore by hovering our these dots and seeing how each related example is together in the vector space.

Step 6: Semantic Search

Now that you have embeddings, you can simply calculate the distance between query embedding (calculated similarly to other embeddings) and our dataset embeddings.

The best yet simplest distance metric is Cosine Similarity which can calculate the similarity between 2 vectors. If 2 vectors reside close in the vector space, their cosine similarity will be more as compared to those which are distant.

In Numpy and PyTorch, if the input data is 1D (Vectors) the dot product or “@” operator calculates the cosine similarity between those vectors. Since the shape of our vectors is [1, 768] (Query vector) and [6, 768] (Dataset vectors). To multiply both in the vectorized format, we need to have the columns of the first and rows of 2nd to be the same (High School Linear Algebra 😄). Hence, we are going to take the transpose.

query = "What is the capital of India?"
query_embedding = co.embed([query], model='multilingual-22-12').embeddings

results = np.array(query_embedding) @ embeddings_np.T # Shape [1, 768] @ [768, 6] = [1, 6]

Each element in the resultant vector [1, 6] is going to show the similarity of the query with all the 6 items in the database.

We can sort them and get the index using argsort function from numpy and see the elements in the sorted order by which it matches the most.

sorted_index =np.argsort(results)[0][::-1] # Sort in Descending Order
df_multi['answers'][sorted_index].reset_index(drop=True)

So the top 3 examples returned by our semantic search are those which are the correct Answer (Delhi is the capital of India) in all the languages (Urdu, Spanish, and Arabic) while the next 3 are not relevant (Since we don’t have any other correct examples).

Now we can change the query and see the other results as well. This time, we are going to use a query for which there is no similar keyword at all in our examples. And we are going to use the Hebrew language hence there won't be any change of keyword matching.

We are going to use the same piece of code again.

query = "מהי בירת פקיסטן"

query_embedding = co.embed([query], model='multilingual-22-12').embeddings

results = np.array(query_embedding) @ embeddings_np.T


sorted_index =np.Topargsort(results)[0][::-1]

df_multi['answers'][sorted_index].reset_index(drop=True)

And now you can see the results. The top 3 examples are the right answers even though there is no chance of keyword matching, due to the semantic search.

If you visualize the Hebrew embedding with our dataset, you’ll find it very near to the right answers in the vector space(that’s something for you to do).

Learning Outcomes

In this article, you gained an idea of,

Drawbacks in traditional keyword-based search.
Semantic Search and how it works.
Multilingual Semantic Search and how it works.
Performing Multilingual Semantic Search using a Large Language Model by Cohere.

Explore further | References

Ahmad Mustafa Anis is a Machine Learning Engineer at Red Buffer. You can reach out to Ahmad on Twitter and LinkedIn.

Like the article? Let me know your thoughts in the comments section!