Lambda Architecture in Search

A case for how realtime and batch indexing go hand in hand

Mayank Jaiswal
Search & Discovery in 21st Century
4 min readJul 29, 2019

--

Lambda What !?

Lambda Architecture is a term borrowed from the world of Big Data. Nathan Marz came up with this concept when he invented Storm back in 2011. This architecture refers to a certain strategy of designing systems so that you can take advantage of both Batch and Stream Processing.

Wikipedia says: “This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.”

Question is — How is this relevant in the context of Search ?

TL;DR

For the impatient souls like me, here goes the summary of what’s ahead.

I will make an argument that you need both realtime and batch Indexing in your Architecture. Read “Lambda in Search”

Secondly, I will talk about how to implement both Batch and Realtime Indexing.

Third, I will objectively list Pros and Cons of the two Architecture.

Lambda in Search

Realtime Use Case

Let’s say you are an e-commerce company and your Elasticsearch Cluster has an index called “production” which is serving the production traffic.

If a product goes out of stock, it makes sense to not show this product as a top search result anymore. Hence, you will want to update the in_stock status to False in realtime. This in_stock flag then can be used to decrease it’s relevance while searching. If a new product is added to the catalog, you might want that this product should not show up in Search and in Category Listing immediately. Hence, you will want to add the document in the index in real time.

If a product is permanently removed from the system, you might want to delete the document itself in realtime.

Hence, we need to support CRUD: Create, Read, Update and Delete a document. In realtime.

What’s the catch in Realtime?

One word answer — efficiency.

Problem is that every operation that you might want to do on ES is NOT an O(1) operation. For example, let’s say your document looks like this:
{
“sku”: 12345,
“colour_id”: 7
“colour_name”: “red”
}

For some reason, the colour_id of all the colours in the system is changing. You will have to update nearly all the documents for this change to reflect. Above document will become:

{
“sku”: 12345,
“colour_id”: 17
“colour_name”: “red”
}

You will have to write custom code to update all the document so that they reflect correct colour_id and colour_code mapping. This is just one example, think of any such metadata whose source of truth is some other database. Keeping ES upto data can be realtime can be challenge computationally as well it can lead to a lot of “unnecessary code” which will run very rarely i.e. only in such special cases. Even if you write code to cover all such cases, it will be fragile - it will be easy for you miss a boundary case only to later notified by QA team or worse — your users. Notice that I am not even talking about load on ES; because developer time is more precious!

Whats the way out ? — Batch Indexing.

Batch Indexing

Simply put, let’s say you have two identical indices in your system — “yin” and “yang”. One of them will serve the production traffic at a given time. Let’s call this one as online index. You can do anything to the other one i.e. offline index.

You can take a full dump of your data from other databases, join the information between them and then create an index from scratch. After Indexing, a sanity check can be done on this offline_index. Then start redirecting traffic from the old index to this newly created index.

How does Batch Indexing solve the O(1) problem mentioned above ?

Since the processing is done in batch mode, we can run heavy processes to create a set of documents that will be uploaded. We can bulk-write into this index without creating an impact on online index.

Risks with Realtime Index

  • Realtime index can get corrupted. Ability to quickly create a fresh index gives us a cushion against disasters.
  • Realtime index can get out of sync.
  • Bulk write i.e. O(n) operations can impact production response times due to high I/O on index. For example — change weightage of every document in index. So, non-time-critical can be delayed and done in batch mode instead of realtime mode.

Benefits of Ability to create offline Index

  • Disaster recovery. In case anything goes wrong with production index, you can create a fresh one.
  • A new algorithm for scoring can be implemented on the alternate index. This can power AB testing.
  • Any crazy experiment which has a potential to blow things up — you can do that using batch processing and then gently try it out on production traffic. That’s a superpower in my View!!!

Batch (Offline) Modes

  • Fast-Mode.
    You should be able to create an offline index in a very short time i.e. 10 minutes. This can then serve as a disaster recovery tool.
  • Slow-Mode
    You can have another mode of creating an offline index which is a fully-loaded index but may take time to create. Index creation time is not a prime concern in this case.

Companies that embrace Lambda Architecture

  • LinkedIn — They have a name for this- Galene! Learn more

Other Interesting Search Infrastructure Architectures

Euler Systems

#Shameless-self-promotion

My company provides search and recommendation consultancy as well as implementation. We have built search and recommendations for one of the biggest names in India and abroad. Few customers worth mentioning are — Nykaa, Hopscotch, Simpplr, Disney, Nerdwallet, Gojek and more.

Feel free to get in touch at mayank@euler-systems.com

--

--

Mayank Jaiswal
Search & Discovery in 21st Century

Software Engineer. Studied Computer Science from Indian Institute of Technology, Kharagpur. Work Interests - Search. None Work Interest - Personal Finance.