From Solr to Elasticsearch — A story by TheFork

Telmo Gomes
TheFork Engineering Blog
7 min readSep 21, 2021

A long time ago in a company far far away…

Actually, it was a few months ago and TheFork is already based in a lot of countries, so it should not be that far away.

We are going to take you through our journey of discovery and learning and our battle to defeat an existing slow searching solution that was making our restaurants unhappy.

What do we need to search for?

As you might know (or maybe not), TheFork allows for a customer to book a restaurant. Alongside this B2C app, we provide our restaurants with a B2B app that allows the restaurant user to manage the reservations made by their clients. One of the important aspects of this app is that it allows the restaurants to know who their customers are, so a feature where the restaurant can search for them is something that they value.

Given this, it makes it clear we are in the presence of two related entities: Reservation and Customer.

We use a microservice architecture and each of these entities live on their own microservice.

When a restaurant user wants to search for a customer or a reservation, we want to display them as fast as possible, so certainly not from a database but from a dedicated search instance.

The old ways (Solr)

We had an instance of Solr indexing customers and reservations every time the user created a reservation or added a new customer.

Customer search on our B2B application

So why did we want to move away from Solr? We had two outages, index size was huge (~256 GiB) and response time was too high (mean: ~500 ms) which led to some other functional issues.

We also had a lot of new functional needs and a desire to start fresh. Most of the developers that worked on the previous search solution had already left the company, so the current solution and the decisions behind were not clear anymore for the rest of us.

For instance, we had a lot of unneeded fields on our index just for information retrieval. Most of these fields were just constant boolean values like isOnline, hasOffer. We already had these fields on our database and if they were not searchable, we did not need to duplicate them. Why were those fields there?

Yes, we could remove the fields from Solr, but we were not confident it would improve anything at all, and as we said, we had more and more the desire to start fresh.

The feature itself was not working perfectly – it took something like 3 minutes to index a customer or reservation. Just imagine the restaurant user adding a reservation and having to wait 3 minutes to search for it! Of course the restaurant user could not wait, so when creating a new reservation it would create a new duplicated customer instead. We definitely had to improve this…

It was even getting worse and worse so eventually, we ran a Solr optimization and things did improve, but oh well, not even close to what we needed.

So we dreamed of a brand new index with only the essential fields and a fresh new technology to learn and on top of this we had one major goal: to be faster.

What to do?

At TheFork we work with 6-week cycles for implementation. After that, we have 2 weeks for cooldown (controlling tech debt and shaping up the next cycle).

During the shaping phase, stakeholders decide what to do in the next cycle in the form of bets. From a pool of pitches, a bet is made on one of them, so that it goes into the next cycle to build.

In our case, a bet was made and the following OKRs were defined:

OKR for the new search feature on TFM3
OKR for the new search bet

Learning with Elasticsearch

But there was a tiny issue: we had no one with knowledge on this technology and our knowledge on the current behavior of the project was very little. Ah, we were almost forgetting, we had 6 weeks to complete the project.

The first thing we did, how surprising, was to study through online courses the 101 on Elasticsearch. We learned we had to create an index to store documents following a mapping we would have to define. Then we would have to define the search to retrieve relevant documents from this index.

As soon as we could understand the basics, we tried the hands dirty approach and we sketched the first mapping. As we said, we wanted a slim approach: only store fields that the search will use, do not store any information not used for the search. We came to a quite simple mapping, 6 or 7 fields, only one nested type.

Now, what kind of search should we build? Well, to answer this, we defined some relevance rules. We typed some examples and defined what should be the order of the results. Once having this specification, we started to work on a search query that gave us the documents we searched for in the correct order, just as we had previously defined.

Relevance rules with fake data by our awesome PM

We did not have much trouble nailing it. It was mainly about understanding the type of queries, how to search within nested types, ngrams topic, how to match either prefix or full names, scoring with the right gauss function, playing around with some weights. We were doing good.

Then eventually we did some cheating. We scheduled a meeting with people outside the company that could help us improve what we already had. We did some prep: we wrote a brief documentation explaining our decisions on both index and search and delivered it to them before the meeting. This saved us some time in a very well-paid meeting…

Surprise, surprise, or maybe not, they found some mistakes in our approach. We took the time not only to discuss our mistakes, but also to learn on mechanisms of Elasticsearch that either we didn’t know about or had just some vague notion of how it worked. So it was not only about finding issues and fixing them or finding more suitable solutions, it was also about validating and having confidence in what you were building. Like a peer review. So if you can afford it, totally go for it. Return of time invested: 5/5.

An example customer on our Elasticsearch index

Then we explored the options that were opened to us since the meeting, did some improvements not only in the search, but also in the mapping itself. The next step was to simulate the real environment and see how the search played.

Introspection phase, we had to think about what was happening.

Remember we said we didn’t want to duplicate data into our Elasticsearch index if it was not searchable? Well, we ended up doing it for some of the fields. Why?

Our search results are displayed in a list format and although some fields are not searchable, they are fetched from the database and displayed in this list. So we found out that it wasn’t Elasticsearch that was being slow, but rather this fetching from databases, which live across other microservices, that was slowing everything down. So to fix this, we had to include this data into our index, even if not searching for it. We were pretty selective about it, so we ended up with 2 or 3 additional fields in the end. Still, nothing compared to the big mapping we had on Solr instance previously.

Yes, it was not only this… we also fixed some more flaws on our search, but all of them were quite expectable for newbies and none of them worth mentioning.

Results

So after all these fixes, did we get a better performance overall?

Comparison of response times

We completely overachieved our goal! We were able to get to a median call time of 19 ms!

We can see some more details here of our new customer search call times:

searchCustomers response times

And for a quick comparison on the new reservation index vs our old one on Solr:

On 27/07 we migrated fully to Elasticsearch 💯
Oh, so sweet response times after the migration 🎉

We can see that this bet was a total success!

We went from knowing nothing about how our search implementation worked, to completely changing our search infrastructure to a different technology and improving its performance greatly.

What did we learn?

We hypothesised that there was a better technology to achieve what we were already doing. We considered some aspects, we selected what we thought was most suitable for us, we learned it and tried to apply it.

We could have failed, but in this case, it turned out to be the right move, and we could not be happier about it.

This article was written in collaboration with the great Francisca Teixeira.

--

--