Similarity Search Engine for Millions of Math Problems

Published in

Photomath Engineering

7 min readMay 4, 2022

Photo source: https://www.pexels.com/photo/magnifying-glass-on-textbook-4494641/

For all developers, business people, tech companies who are curious how to lower costs in tasks where the business process requires manual processing of various documents or images, and for those who are looking for an answer to the question of how to quickly and easily make your own text based similarity search engine, well this is the right place for you.

At Photomath we had one optimization problem that we solved with one such suitable similarity search engine. The result was a reduction in time (and money) spent on the math task-solving process. We accelerated the entire task-solving system ~10 times and reduced the cost ~100 times. I know, it sounds unreal, but it’s possible.

Once upon a time…

…there was one problem that needed to be optimized. Suppose we have millions of math word problems to solve and hire a team of mathematicians to solve those problems. In fact, that’s exactly what we did at Photomath. We started naively by solving every math task, but soon the question popped up: Is it possible to optimize the whole process?

Solving each math task costs N money and M time. So what can we do, as ML engineers, to minimize money and time spent? Drop table math_word_tasks? I’m just kidding, the problem isn’t exactly trivial. Like most things in ML, solving this problem starts with observing data distribution. We quickly came to the conclusion that the vast majority of mathematical tasks are repetitive. And that was it, with such a fact we had a lot of material to optimize the problem.

Task #1, #2 and #4 are identical so they save same solution

The idea was quite simple: if you find an identical task to a solved task, you don’t have to solve it. So we said let’s go for each task solved so far to find an identical task and ask our mathematician to confirm if these are identical tasks. What will we gain from this? Well a lot of that:

It is much faster to compare two math tasks if they are identical than to solve an entire math task
We don’t need as many skilled mathematicians to compare math problems as we do to solve them (in fact, we need completely different people for that, data annotators)

As the sum of these clues, we get that the cost of comparing tasks is much less than solving a task. In this way, we basically save money and speed up the task-solving process.

Therefore, we have built a system that will find pairs of semantically similar tasks and suggest them to our mathematicians as potentially identical tasks.

Similarity search engine

Isaac Newton once said: “If I have seen further it is by standing on the shoulders of Giants”.

Therefore, let us introduce our giants on this project. Since we work with texts of mathematical problems, we decided to use some kind of transformer architecture and we found great support from the creators of SentenceTransformers library, our first giant.

Our second giant is a popular library for efficient similarity search FAISS. By combining these two technologies, we have created the system shown in the image below.

Architecture of similar math task search system

Sentence-BERT

The first and most important component of our system is a sentence encoder whose task is to assign an appropriate vector to each math task text. We can also say that this model has the ability to extract image features and represent them by a vector. There are literally hundreds of ways this task can be solved (SIFT, ORB, SURF etc.), but we chose this way because it was SOTA at the time and it was relatively easy to implement.

So what did we do?

We took a pretrained model ‘stsb-roberta-base’ and fine-tuned it to our large corpus of mathematical task texts. Simple as that. We named that model Math-STSB-RoBERTa. If you are building a model like this, there’s a collection of all SBERT models. Training such a model takes about a week on the NVIDIA V100 GPU. A basic example of training such a model is shown in the code below. This code is taken from the official SBERT website and it is important to note that this library also supports many models, functionalities, etc. Therefore, I highly recommend this library.

FAISS

After we assign an appropriate vector to each math task text it’s FAISS’s turn to shine. FAISS (Facebook AI Similarity Search) is a library developed primarily at Facebook AI Research. FAISS is written in the C++, but it has support for working in Python. The main purpose of this library is efficient similarity search and clustering of dense vectors.They offer a large collection of indexes and additional functionalities such as normalization, quantization, IDMap etc. FAISS also has support for GPUs and on-disk indexes.

We used FAISS to store all our task vectors in the FAISS index and to be able to search for similar tasks in the index efficiently. One of the most important criteria for us was the scalability of the system, i.e. that the system can support millions of tasks, and FAISS is a complete success. The best way to learn how to work with FAISS is to write your own code script like the one below. Working with this library is relatively simple, the instructions are pretty straight forward.

This example is taken from the official github page of FAISS where you can find many more examples and also good documentation.

One step further

Before we move on to the results, we will go one step further in designing our system. We wanted to try out what our system would look like if we added some image features as additional information. Therefore, we added a (ImageNet) pretrained ResNet34 model in parallel with our Math-STSB-RoBERTa text encoder. We combined and normalized embeddings at the output of these systems, so now the system looks like this.

Final architecture for feature extraction

Probably, in this case it would be better to use some kind of architecture similar to CLIP, but it was easier for us to just add ResNet architecture in parallel. Also, this is just one interesting way to design a system. One can be creative here on how to combine different architectures, models and approaches to extract features and combine them in order to come up with the final solution.

And the results are …

Very good. As a comparison, we calculated the recall on the old system (Elastic Search) and the new system. An approximation of the metrics was made for the old system. Can you guess why the approximation? Because that system is too slow on a scale of millions of tasks and it was necessary to make an approximation metric.

So the progress is significant, the new system works much faster and more accurately than the old one. As an interpretation of the results, we can say that there is a 96% probability that at least one of the four proposed tasks is actually identical to the task in the query. Can it get any better than this? Probably yes, maybe in one of the following articles we will reveal how :)

What have we learned

So, we can say that this system is not only an interesting system that is relatively easy to implement and run, but this article shows how this system can be a very powerful production system that can save you both time and money. In our case, we accelerated the entire task-solving system ~10 times and reduced the cost ~100 times.

The project went smoothly, we had good decisions (and also luck) in fine tuning the model because we found good parameters relatively quickly. The only area we faced problems with was refining our sentence data.

I learned a lot from this project, especially how to quickly handle so much data without everything falling apart and how solutions can be relatively simple. Complex architectures are not always a guarantee that something will work better, you always need to start with some baseline architecture.

Technology is evolving at an abnormal rate so there are already many improvements that can make this kind of system much better. You can follow us and find out how in future posts.

Like what you’ve read? Learn more about #LifeAtPhotomath and check out our job postings: https://careers.photomath.com/