So, You Want to Scan a Billion Files in a Day?

How to Build a Searchable File Index at Scale

How many files do you have on your computer? How much total space do they take up? Can you, armed with only the file name, find a specific file that you worked on 3 years ago? If your disk is getting full, which files would you delete or move first, in order to free up space?

Chances are that you know where to go to answer these questions, especially if you have file management software. These tools usually build a metadata index, a data structure that maps file names to metadata like access time, modified time, file size etc., and allows for aggregation and filter queries on this metadata. For example, a metadata index could quickly answer the question “what is the total size of all .mp3 files in this directory?”. Unfortunately, the algorithms that work well for millions of files on a laptop don’t do as well for billions of files in a datacenter.

At Igneous, we build data-management software for enterprise customers, and it turns out that they have many of the same questions. And just as with personal computing, there are dozens of tools that a systems administrator can use to answer questions about the files in their datacenter.

However, when our customers need to scan and index datacenters with billions of files and petabytes of data, existing tools simply don’t scale. Here are some of the challenges with building a metadata index at scale:

  • Scanning data is slow and resource-intensive. Most existing tools need to be deployed in a datacenter on physical hardware in order to scan large systems. However, this is a pain to set up and manage — systems administrators don’t want more hardware to manage, and installing hardware at multiple datacenters around the world is often not possible
  • Even after scanning, creating an index can take days or weeks when dealing with billions of files. The data structures used by many tools either sacrifice write speed or involve a long post-processing phase. By the time your index is ready to query, the data within it is already outdated.
  • Complex queries over very large indices can take a very long time and use a lot of memory. An application must be capable of breaking down a query into many parallelizable parts in order to ensure a quick response time.

Our engineering team at Igneous loves solving such challenges, so we got to work. We came up with Igneous DataDiscover, a system capable of scaling to the largest of enterprise data centers. DataDiscover scans the files in a data center to create a searchable index of all the file metadata that it can find.

On a high level, here’s how it works:

  1. A customer has many different NAS (Networked-Attached Storage) devices, like NetApp, Isilon and Pure Flashblade in their data center. They are running low on space, and the systems administrator (admin) is tasked with moving unused data to the cloud in order to avoid having to buy more storage. Without knowing which data is safe to move offsite, it’s very hard for the admin to build a strategy.
  2. The admin learns about Igneous DataDiscover, and signs up for a trial. After downloading a VM image from our website, they set it up in their environment.
  3. When the VM boots, it registers itself with and downloads software from the Igneous Cloud. (Step 1 in the diagram above) Then, it requests user permission to access storage devices.
  4. The VM runs an ultra-fast, parallelized scan of the customer data using purpose-built user-space NFS and SMB clients written in Go (golang). We use Go for almost everything, because of its excellent concurrency primitives. Using our own clients allows us to control and optimize our scan performance. (Step 2 in the diagram, and more detail in post 2 of this series)
  5. Metadata is written to a Log-Structured Merge (LSM) tree, a unique data structure that allows us to write and sort data quickly. Small batches of metadata are uploaded to the cloud from the VM, where a Table Creation service merges them together into an LSM tree. (Step 3 in the diagram, and more detail in post 3 of this series)
  6. After all metadata is written to the cloud, an Indexing Service steps in to create an index. The index allows searching over the metadata — by modify/access time range, extension, name, etc. (Step 3 in the diagram, and more detail in post 4 of this series)
  7. Within minutes to hours, the admin is greeted with a visualization of their billion file datacenter, served by querying the metadata index (Step 5 in the diagram). They can search for a particular file, or run a query for specific questions, like “how much of my imaging data has not been read in the last 1 year?”. They can also explore their datacenter in a file browser and see a heatmap describing usage patterns for each directory.
DataDiscover showing a file explorer view over 1.4 billion files, broken down by last modified time

The end result is DataDiscover, an experience that feels a lot like Apple Spotlight (Cmd+Space on a Mac) or Windows Explorer’s search feature — except that, instead of searching one computer, it’s searching an entire datacenter (or even multiple datacenters owned by the same company).

Behind the scenes, there’s a lot of awesome software powering this and ensuring that the system is scalable and performant. The rest of this series will go into more detail about how we were able to make all of this possible. Links will be added as the other stories are published. Read:

1. What is Data Discover? (this post)

2. How we improved NFS performance by 50x with one weird trick

3. You won’t believe how we efficiently exported a billion sorted rows!

4. Search 101: Google, but for inside your datacenter

5. How to scale up by dumbing down: growing a search index from 100 million to 100 billion files

The Startup

Get smarter at building your thing. Join The Startup’s +737K followers.