How to build a scalable backend to query and search billions of records?

Published in

Elucidata

8 min readMay 5, 2022

Every day, Elucidata’s Polly ingests gigabytes(GB) of biomolecular data and transforms it into analysis-ready format for the biologists around the world. This rate of data ingestion presents us with problems of scale and storage. We have recently written about how Elucidata handles biomolecular data at scale with highly effective ETL pipelines. In this article, we shall look at how engineers at Elucidata handled the challenges of storage and figured out a way to effectively store and make data consumable.

Historically, Polly’s data platform has relied solely on Elasticsearch to store consumable data. And, It has worked well. Response times for most queries were swift. We were even storing large numerical matrices, serialized as documents, in Elasticsearch. The AWS Elasticsearch service (now OpenSearch service) allows users to use the SQL plugin to make SQL queries on our indices.

Moreover, Elasticsearch — -just like other NoSQL data stores — -does not require a predefined schema for data indexing. But, there is a flip side to this feature. Elucidata deals with a lot of semi-structured data coming from heterogeneous sources. Still, we ended up choosing something that mandated the definition of schemas before starting ingestion. We slowly realized that Elasticsearch has several limitations that makes it unsuitable to serve as our query engine. But, before we go further, we need to explain what data means in this context.

What do we mean when we say “data“

A majority of data on Polly exists in the form of datasets. A typical dataset comprises the following:

A single dataset-level metadata
A set of n samples, each having its metadata
A set of m features measured per sample, each feature having its metadata
An (n ⨯ m) data-matrix, storing the actual measurements

Each entity’s metadata is a dictionary of key-value pairs.

Layout and contents of a typical dataset

For most Polly datasets, the data matrix constitutes >95% of the file contents and information volume. One final concept to remember is “repositories,” which are collections of datasets. Datasets in a repository share some common attributes due to exact origin (source), same data type, or even based on the presence of a particular tissue.

Our requirements

Ultimately, our users need to consume the data present in these datasets. And we have three primary use cases here:

Search on metadata for finding datasets and samples of interest.
Query metadata of datasets, samples, and features using SQL.
Query data-matrix of datasets and features of interest using SQL.

We need an API that accepts full-text search (FTS) queries and an API that accepts SQL queries. The data size on which these queries will run is already in terabytes, increasing every day.

Before you make any guesses on how we addressed all of these requirements with just Elasticsearch, we should first take a look at the limitations of Elasticsearch.

Elasticsearch scaling issue

For the past few months, Polly has been ingesting hundreds of datasets every day. Some of these datasets have hundreds of thousands of features. The combined size of Elasticsearch indexes quadrupled(from 1 to 4 TB) very quickly. At Elucidata, we anticipated that this trend would continue.

Usually, cloud storage is not very expensive. But with AWS Elasticsearch service (now OpenSearch service), for a given class of nodes, there’s a limit on the size of EBS volume one can attach to each node. When a size limit is reached, users either have to increase the number of nodes or upgrade the nodes to a more powerful and costlier instance class.

Typically, the Elasticsearch cluster would scale with the surge in data to seamlessly handle the increasing load during data search. But in Elucidata’s case, data volumes were increasing at an alarming rate. Very often, a majority of the data matrices ended up not being queried. A user would have to compromise by narrowing down on a small number of datasets of interest and then fetch the measurement data present in their matrices. We concluded that querying on data matrices was sub-optimal.

This meant that while we were horizontally or vertically scaling our Elasticsearch cluster, because of storage demands, we were not actually using the compute that was coming with it. We were paying for both storage and compute, when we needed only one.

Limitations in querying

Elasticsearch cluster has a reputation for being an excellent search engine with the added advantage of vast community support. There are various plugins available that augment the abilities of an Elasticsearch cluster that enable index querying using SQL as if they were a traditional database table. The engineers at Elucidata leveraged the plugins available via Elasticsearch to query using SQL. Elasticsearch was being used for both the FTS and SQL APIs until the engineers realized that the Elasticsearch is an FTS engine and is not the best choice for SQL activities. Few of the challenges that we have faced while performing SQL activities:

Getting more than 10,000 rows for any non-trivial query.
Limited scope for joins and aggregations on joins.
Querying nested data like in the data-matrix — one document per row where each sample’s name and measurement was in an array of nested documents.

These limitations were very problematic for a platform whose purpose is to add value and expose large amounts of analysis-ready biomedical data.

Absence of schemas

Elasticsearch allows users to dump data and make it easily queryable. Except with a limit on how many fields you can have per index. And for a good reason, too — otherwise, there will be no upper bound on how much memory a query needs, which is not a good thing for a reliable distributed system.

Polly’s repositories can have datasets coming in from heterogeneous sources, which makes different metadata fields easy to access. However, the union of these fields can easily exceed Elasticsearch’s field limit. And, it is not recommended to index everything. This issue can be managed through proper processes.

However, writing structured queries (SQL) on Polly’s data turned out to be the real challenge without schemas. The system that was put into use initially didn’t need any schemas. And, writing SQL queries with no documented schema is challenging. Users had to tediously figure out columns and their types from thousands of available fields.

Our solution

While our search APIs were working very well, it soon became apparent that we needed a separate backend to store and serve SQL on structured data. Polly needed something that:

has native SQL support,
scales well with very large amounts of data per table,
scales with a large number of tables,
enforces a schema for each table.

A distributed file-system

First, the engineers at Elucidata figured that using a data-lake format and serializing our tables on a distributed file system would help with scalability since storage is cheap. S3 is a great choice here, given that the rest of the cloud platform already runs on AWS.

The data format

Second, there is a need to select a data format that can support on-disk SQL querying and enforces schema as well. Engineers evaluated a few choices:

TileDB embedded — Looks great but is yet to be adopted widely. There’s limited community support information.
Apache Iceberg — An Apache project (originally developed at Netflix). Decent community support and adoption. But is not a good choice for workloads that involve frequent updates and deletions to your tables, which is the case for us.
Apache Hudi — Another Apache project. Decent community support and adoption. Hudi was almost a right fit.
Delta Lake — Developed and maintained by Databricks, the open-source version checked all the boxes. Delta lake is widely adopted and also provides an option to move to the more robust enterprise version if the need arises.

We concluded that using Delta Lake is the right fit for all our data storage issues.

Delta is based on Parquet and augments it

Separate backends for search and query

That said, complete migration to Delta Lake (running on S3) and ditching Elasticsearch wasn’t the plan either. We still wanted to keep our already excellent search interface based on Elasticsearch. It works exceptionally well. We decided to separate search and query backends, even at the cost of data duplication between these backends to optimize the backends separately so as to serve their respective roles in the consumption journey.

Moreover, only SQL queries needed to be run on data matrices, and there was no need for a search. This allowed the engineers to keep the data matrices in Delta Lake as individual tables and not index them in Elasticsearch. This greatly alleviated Polly’s Elasticsearch cluster’s storage requirements.

Separation and duplication of data in Elasticsearch and Delta Lake

Running SQL on Delta Lake

Consuming data through SQL requires a query engine supported by Delta Lake on S3. Here are a few options to choose from:

We chose to go with Athena because it requires the least effort to get started. Athena is based on Presto (v0.217), and they both require the definition of all external queryable tables to be present in a Hive-compatible metastore, which is how the query engine will know what fields (and their types) are available for querying and what files on S3 should be read to get the actual data. For the complete documentation for the table registration process, check here. Furthermore, in the future, if the current service reaches its threshold for simultaneous query executions, we will move to a self-managed infrastructure running Presto or Spark.

Once again, instead of setting up and maintaining the Hive instance, AWS Glue Data Catalog was used. The Glue Data Catalog service is supported by Delta Lake and Athena and has a pay-per-table object model, saving us the effort and upfront cost of maintaining our Hive metastore. In case, the Glue Data Catalog is getting either too restrictive or expensive in the future, there is always an option to move to a self-managed solution.

Final Architecture

As shown in the figure above, two different backends serve Polly’s search API and SQL query API. They both go through a Lambda whose job is to perform some common tasks such as:

Repository-level access control — each user has access to a certain set of read-only datasets.
Capturing usage metrics for getting insights on the usage of data repositories.

Polly’s new storage infrastructure enables entity metadata to be present in both Elasticsearch and Delta Lake. From our initial estimates, the persistent increase in storage cost amounts to just ~$3 per terabyte of data ingested. Though a bit expensive, the risks associated with scaling Elasticsearch vertically or horizontally frequently have been eliminated by accommodating the disproportionately sized data matrices.

The new consumption infrastructure of Polly’s new storage backend allows users to make powerful queries on biomolecular data. It also forced the engineers and scientists internally to set up processes for schema definition for each source and type of data, which led to better documentation and management of data complexity and significantly enhanced a user’s querying experience.

What do you think about our scalable backend storage solution? Do you think you have a better solution? Then come work with us at Elucidata. We are hiring!