The Origins of OLAP Databases and Where Are They Heading in 2022?
An attempt to capture the entire OLAP and data warehousing landscape into a single post, compare different technologies and predict what will come next in this space.
In this post, I will go back to the roots of OLAP systems and attempt to summarise different OLAP and data warehousing strategies from their early days to modern data space. At the end of the post, you will be able to capture the entire OLAP landscape.
This analysis is purely based on my research. If you have anything to share, I’m all ears.
In the beginning, there was no OLAP
Back in the day, especially in the 1960s to 70s, people didn’t know about OLAP databases or data warehouses. They have just been switched to disk-based databases from tapes and started running the transactional workloads on the same database (OLTP). SQL was the new kid in the block back then.
But the business data grew in size over time. Business needs became complex and time-sensitive. Then people started to realize that running workloads like getting monthly sales reports on the same OLTP database impacted the business transactions.
There was a pressing need to separate complex and expensive SQL queries from OLTP databases. It has subsequently paved the path to inventing a purpose-built database to run those analytical workloads.
People used to call these databases data warehouses.
In light of data warehouses
The data warehouse served as the single source of truth of an organization’s business data. It was engineered to load data from multiple operational systems and run complex queries that filter and aggregate millions and billions of records. BI tools and internal reporting tools consumed the analysis results.
Back then, there was no cloud. All the data warehouses were primarily distributed deployments based on on-premise data centers. Not to mention the staggering amount of licensing costs as well.
Constrained by computing and storage capacities
They were constrained by the amount of compute and storage capacity. You can buy additional hardware to speed up queries if you are rich. Most data warehouses (DWH) were built on MPP principles (massively parallel processing systems). Compute, and storage was often coupled to benefit from the data locality during query execution.
ETL and data modeling came into the picture
Due to the lack of computing and storage, data had to be pre-joined and normalized before loading into the DWH because that would speed up the queries. ETL pipelines were employed to do that.
Data in the DWH was arranged in a particular way to reduce the storage footprint. Tables were laid out in two main strategies, star and snowflake schemas. The star schema followed a model where large fact tables were surrounded by many dimension tables that were smaller in size.
Many joins were required to compute the final result when executing a query, resulting in increased query latency.
Major DWH vendors back then
The data warehouse was a mandatory item in most digital corporations back in the 90s. But they were slow. It wasn’t a significant problem since they were used by a handful of analysts in the organizations, and they had a lot of patience dealing with slow queries.
Passing through the Big Data era
We passed the 90s, early 2000s and then stepped into the second decade of the 2000s, and boom! There we had big data and the cloud.
Hello, cloud computing!
Cloud computing had made a grand entrance around the early 2000s, and cloud vendors like AWS were beginning to make some waves. The Cloud computing movement has made organizations unlock many data sources. Organizations started migrating their classic on-premise data warehouses to the cloud, targetting scalability, reliability, and cost reduction.
The hype, the Big Data
In the meantime, companies like Google, Yahoo, and Facebook had been operating with billions of users across the globe. They dealt with Internet-scale traffic that often challenged them to use traditional data warehouse technologies anymore.
The volume, variety, and veracity of data also increased over time. The traditional warehouses were designed to work with structured relational data in mind. They were no longer a good fit to process that big data.
That has led the industry to invent new distributed computing technologies like map-reduce to process the ever-growing big data. A lot of open-source projects emerged during this era. Apache Hadoop, Hive, Pig, Flume, Impala, Drill, and Spark are a few notable projects that took a lot of attention.
The birth of data lakes
Since the data had many variations, it was no longer possible to store them on traditional relational databases. So, a new class of reliable, scalable, and cheap storage systems was born after being influenced by HDFS.
HDFS was primarily a distributed file system that could store any data type, including structured, semi-structured, and unstructured data. Having a schema was optional at the point of writing data to it. The primary goal was the scalability, durability, and the economy of storing data arriving in high velocities.
AWS also launched S3 by this time, and people quickly embraced it because there was no infrastructure to manage, unlike HDFS. This new storage was known as the data lake.
Data lakes vs. data warehouses
There were debates when choosing between these two. Some favored the data houses due to their relational nature and robust data integrity, while others preferred the data lake due to simplicity and flexibility.
In the meantime, data lake engines came into the picture where they made it possible to query the raw data in the data lake directly. These engines promoted a federated data processing architecture where compute and storage is no longer coupled. That gives you the ability to bring an engine of your choice to analyze the data in the lake.
From Hive, Drill, to Presto
Facebook open-sourced their analytics engine Presto which made big waves in the ad-hoc data analytics space. Engines like Presto and Drill were instrumental in running federated SQL queries across multiple disparate data sources.
Redshift, BigQuery, and Snowflake
Data warehouse vendors didn’t like to lose their market share as well. AWS, Google, and Snowflake launched their cloud-native versions of data warehouses to the market, and it was a huge success.
At the end of the big data era, many organizations had realized the value of having their data platform on the cloud and started making preparations for the lift and shift.
What is the problem now?
All this fancy machinery gave us great computational powers and storage capacities. Compared to the early days, the time to analyze a terabyte-scale data set has been reduced to months to days to single-digit seconds.
What was the problem then? Was that not good enough? Not really.
The need for real-time user-facing analytics
In fact, the major problem was the latency.
Analytics was no longer an inward-facing need. Many data-driven applications and operational analytics systems were demanding low-latency real-time analytics. Companies like Facebook, Uber, Netflix have set the bar high for a greater user experience over the Internet.
Ad-hoc analytical databases like Presto or even modern data warehouses can respond to queries with seconds’ latency. But modern applications demand it to be in the millisecond’s range.
Can pre-aggregations help?
Data warehouses traditionally used pre-aggregations and rollup cubes to speed up the queries. Projects like Apache Kylin excelled at building cubes.
But pre-cubing had a significant footprint on the storage and computational power. Moreover, building cubes was an expensive process, leading to delays in data ingestion. Traditionally pre-aggregation was done using ETL tools based on Spark, Hadoop, or Hive, which has impacted the data freshness.
Real-time OLAP databases to save the world
A new breed of OLAP databases was invented to address the gaps we discussed above.
Their primary goal is to:
- Ability to answer around 100k-200k queries per second (QPS).
- Respond to queries with stringent latency SLA of milliseconds. Anything that goes beyond a second won’t be tolerated.
- Eliminate the need for expensive pre-aggregations and allow users to feed raw data in multiple formats, including JSON.
- Maintain data freshness (must allow data to be queried as they are ingested)
- Combine both streaming and historical data to answer queries more accurately while maintaining data freshness.
- Use expressive SQL queries that often have complex aggregates and filter predicates.
To get a good understanding on these needs, I suggest you read this article by my friends Chinmay Soman, and Neha Pawar.
How To Pick Your Real-Time OLAP Platform - Part 1
Real-time analytics has transformed the way companies do business. It has unlocked the ability to make real-time…
Druid, Pinot, ClickHouse, and Rockset
These are the key players in the real-time OLAP space that more or less satisfies the above requirements. These databases use a combination of intelligent indexing and segment placement and query pruning strategies to bring down the query execution time.
The compute and storage layers of these databases are often tightly coupled. The ingested data will be broken into smaller segments and laid out in the disk in a columnar format. While in operations, segments are memory-mapped, enabling high performant query processing.
Moreover, the tightly-coupled storage enables the query engine to evolve with storage in tandem.
With more organizations embracing user-facing outward analytics, real-time OLAP will be the ultimate choice for delivering low-latency data-driven products.
The OLAP movement is not over yet. There are other movements in this space that I haven’t got time to explore.
- Delta lakes
- Lakehouse architectures
- Data mesh
Are some of them. I will write about them in another article, maybe?
The bottom line is there’s no one-size-fits-all type of OLAP database that exists today. You can still live with an old data warehouse if you are okay with dealing with slow queries. If you try to instill a data-driven culture with many experiments, machine learning, and data science, a data lake strategy would be ideal. To deal with latency-sensitive use cases like user-facing dashboards and personalization, you can consider a real-time OLAP database.
If you don’t want any of the above but want to get done with everyday operational BI needs, use a cloud-based modern data warehouse for simple day-to-day operational BI needs.