2023 Summer of Learning— Data Consolidation

Troy Curtsinger
mainframe-careers
Published in
4 min readDec 1, 2023

This blog is about the OLAP database research I completed during my summer employment with Broadcom’s Mainframe Software Division(MSD).

Data Consolidation: Handling Data Storage at Scale for Multiple Use Cases

Data at scale quickly becomes difficult to store or query efficiently. Hundreds of thousands of events are expected to come in every minute, which necessitates a datastore that can handle the billions of rows of data.

For our problem, we are trying to support two separate use cases, both of which are meant to work with similar data. One needs to see the historical record of values for a given column on a very consistent basis, while the other needs to fire off multiple queries in succession, each using multiple where clauses, group bys, and generally “slicing and dicing” the data.

During my time at Broadcom’s Mainframe Software Division this summer, my objective was to identify a robust database solution to facilitate the needs of both use cases.

For the first use case, a NoSQL columnar solution is ideal, but provides no support for the second use case. For the second use case, a row-based traditional database or an OLAP database makes sense, but does not often provide the speed at scale that the first use case requires. This means we want either a time-series database or a real-time OLAP database which gives us the response time and scale we need for both use cases.

A time-series database is a system where data is stored in rows with a time or timestamp, alongside other values such as readings or labels, providing a datastore tailored for time-series data. Examples include TimescaleDB, InfluxDB, and Prometheus.

OLAP, or OnLine Analytical Processing databases, are optimized to support analytical queries, often “slicing and dicing” the data with multiple where clauses, group bys, etc. These databases are specially made to support dashboarding and other analytical use cases. Examples include Clickhouse, Snowflake, and Apache Pinot.

Apache Pinot:

Source: Apache Pinot(https://pinot.apache.org/)

In this case, I was researching Apache Pinot. It has a few features which make it attractive for our problem, since it has a variety of indexes and options to optimize queries.

Each index is a trade-off, requiring more disk space but improving query performance. Pinot supports a variety of indexes based on data type, like timestamp, text, JSON, or geospatial, as well as the ability to create reversed or sorted indexes on columns.

Apache Pinot also supports a unique index, the Star Tree. This index stores aggregation and row values for certain slices of data, navigated like a tree. The root node has multiple nodes below it, divided by certain ranges of values, or a single value for a column. This node contains the avg, max, min and etc. for each relevant column in the index. Below this node are more nodes, split into other columns, and providing the same aggregations for this slice of data. This allows a query with multiple where clauses to navigate the star tree based on the clauses in the query, and use the aggregations in the star tree instead of calculating them at query time.

To test the results of these indexes in action, I set up an environment, loaded mock data, and used Gatling, a load testing tool, to test some of the queries our solution needs to serve. This table uses the default star tree, as well as the sorted, text, and timestamp indexes on various columns of the data. This test is performed on an environment containing ~3 billion points of data, with 105 sessions during the test, firing off 1 query at a time.

This test showed that all of the queries could be completed by the database, but under load some failed to perform. In best cases, minimum values showed that some queries are served in only a couple of milliseconds, while 75th percentile values showed that it is not uncommon for queries to take over ten seconds, some taking up to 55 seconds.

Summary

Data consolidation requires solving multiple unique problems, and combining solutions into one option which supports all requirements. In this case, I tested Apache Pinot, which has a promising set of features. In my test, it successfully handled most of the queries needed in just a couple of milliseconds, but others needed multiple seconds to be fulfilled, and under load, those times ballooned. A more optimized or scaled system may be able to support these use cases.

Hopefully you found this post interesting. Feel free to leave a comment if any of the things mentioned here piqued your curiosity!

--

--