How to scale up by dumbing down

Sudarshan Muralidhar
Nov 16 · 6 min read

Growing a search index from 100 million to 100 billion files

This article is part of a series about building Igneous Data Discover, a searchable file index. Click Here for the overview.

Image for post
Image for post
Go to the first post in this series for an overview of this diagram

Reading and Transporting Metadata

The first major parallelization decision that needs to be made is how to break down the data into manageable chunks. Each chunk will be its own queryable metadata index, and can be processed independently of everything else. On one extreme, we could treat the entire data center as a single entity, creating just one index, but this doesn’t scale well, as we would be unable to split the total work across many machines.

Microservices in the Cloud

On the cloud, we have broken down the process of indexing and querying data into several separate components, each handled by a separate microservice. For the unfamiliar, the microservice pattern involves creating separate, independently deployed services (for example, executable binaries on Docker images), each of which performs a single subtask, and interacts with the others to perform the overarching task. In contrast, a monolith is a single service that does all the work itself.

Image for post
Image for post
Credit: imgur
Image for post
Image for post
The Indexing, Querying, and Namespacing Services allow us to process in parallel

Indexing

Our indexing service is in charge of creating search indices. As described in earlier posts, it receives and combines data into a sorted LSM tree. Then, it uses this tree to build a search index. Each indexing service can create a few indices at a time — if we have many indices to create we can spin up more instances to go faster.

Querying

After indices are created, the indexing service is no longer responsible for them. Instead, a separate querying service is in charge of opening the indices and performing queries.

Namespacing

The catch with our strategy of breaking down the data into a per volume is that, eventually, we’ll need to recombine the data. If we have a file system containing /data/volume1 and /data/volume2, we will create separate indices for each. A query to one index will contain results only for the volume it represents. However, users will still want to be able to perform a global query on /data, meaning that we will need to merge results from multiple indices.

Image for post
Image for post
A query to the Namespacing service is split into many parallel queries to the Querying Service nodes

Conclusion

Our parallelizable architecture allows us to grow to enormous scale before we hit bottlenecks. At our largest customers, we are able to process tens or hundreds of billions of files within a weekend. This enables us to reindex customer data at least once a week in order to keep our visualizations fresh. And because of our parallelized query architecture, we can respond to most queries within a matter of seconds, ensuring our UI is responsive and usable.

Acknowledgements

Thank you to Lily Bowdler and Carolyn Hughes, without whom this series would not have been possible. Data Discover represents the collaborative work of many people across the Igneous Engineering team and wider organization. To learn more about it or try it yourself, check out our website.

Nerd For Tech

From Confusion to Clarification

Sudarshan Muralidhar

Written by

Software engineer at Igneous. Cofounder of Upbeat Music App. I do cloud things.

Nerd For Tech

We are tech nerds because we believe in reinventing the world with the power of Technology. Our articles talk about some of the most disruptive ideas, technology, and innovation.

Sudarshan Muralidhar

Written by

Software engineer at Igneous. Cofounder of Upbeat Music App. I do cloud things.

Nerd For Tech

We are tech nerds because we believe in reinventing the world with the power of Technology. Our articles talk about some of the most disruptive ideas, technology, and innovation.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store