Open Lakehouse — A journey
SQL on anything with Dremio
Nowadays every company relies on data to make decisions. So every company needs a data infrastructure that will handle their use cases over time. And the use cases have evolved over time. With changing use cases we saw different solutions like a data warehouse, Hadoop, data mart, data lake, cloud data warehouses, and lakehouse.
On-prem Data Warehouse
The data warehouse was the first attempt to help integrate disparate data silos into one platform, acting as the single source of truth. It used historical transactions, records, operations, and other applications to make data consumable for business users’ analytics.
Over time, companies realized every OU(sales, marketing, finance) needs a customized view of the data and this gave life to data marts. This led to multiple copies of the data for different BI cubes and reporting tools.
Scaling the on-premise data warehouses was a big challenge. The license and operating cost of these data warehouses outgrew their return.
It started on-prem as an alternative cost-effective storage solution to the data warehouse for sources that had large volumes, velocity, and variety of data. Data lakes prioritized the flexibility of analytic methods. Hive attempted to provide a way to run SQL efficiently on data lakes. It provided the concept of a database, schemas, and tables for describing the structure of a data lake in a way that let BI tools traverse the data efficiently. Hive came out as a winner in the warehouse race. But soon this solution started to run into infrastructure shortcomings as the data warehouse.
Cloud data lake + data warehouse
The First attempt to remove the operation cost was attempted by BigQuery where the infrastructure was managed by GCP. Soon AWS offered Redshift and dominated the cloud data warehouse market until Snowflake's launch. All these products provided faster speeds on queries using managed storage. In all these products, an ETL process will move source data into the cloud data lake, and a curated and trimmed version was loaded in the warehouse. Cloud object storage like S3 served as a cost-effective data lake. But here also we had the issue of duplicate copies of data. Also, one data loaded in the warehouse were only queryable from the product.
Data Mesh & Data Fabric
Both Data Mesh & Data Fabric tried to overcome the shortcomings of Data lake.
Data Mesh, an analytical data architecture, tried this by organizing data as products, and scoping it down to domains in a distributed fashion. Governance and architecture are federated. It has a centralized catalog of different data products.
Unlike Data Mesh, Data Fabric takes a centralized metadata-driven and technology-centric approach. Governance and architecture are centralized. Some part of the data is virtualized. Most of the Data is centralized and delivery is done using APIs. See open-data-fabric.
The launch of open table formats started a new chapter in the data management area. These file formats brought performant SQL capabilities of warehouses to the data lake. Open file formats introduced the following features to the data lake:
- Data as a product — versioned data, rollback
- Transactional consistency — Atomic updates to files with full read isolation.
- Schema evolution to track table changes over time.
- Time travel to query historical data.
- Open formats — non-proprietary formats
- Open Interfaces — ANSI SQL, jdbc/odbc connectivity
- Cloud agnostic — Deployment using containers
- Open Source — non-proprietary code of the software
Using these features Dremio launched an open lakehouse platform. This platform has the following benefits
- SQL on any data
- Reduce infrastructure management costs
- Query your data where it lives
- No unnecessary data movement and copies
- No vendor lock-in as it uses open source components
Technical Components of Dremio
- Apache Arrow — columnar in-memory data processing and interchange
- Apache Arrow Flight — gRPC and HTTP/2 based data interoperability
- Apache calcite — SQL Parser & CBO
- RocksDB — Metadata storage on coordinator
- Gandiv — (LLVM Kernel)
- Apache parquet — file format used for on-disk spillage
- Nessie — catalog for Iceberg
- Production deployment on Kubernetes — Elastic scaling platform
Logical components of Dremio
- Coordinator — Serves UI, Powers API Connections, Manages metadata
- Executor — Execute queries
- Distributed Storage — use external like hdfs/S3
- Admin — Does housekeeping & background tasks
- Zookeeper — Cluster Coordination (use external)
Overall workflow in Dremio
Dremio Optimizations from 5000 feet
- Data Reflections
- Ultra high-performance, vectorized Parquet reader
- Columnar Cloud Cache (C3) (NVMe based)
- Predictive Pipelining
- High-Performance Parallel readers for S3 compatible storage
Note: I came across LakeSoul, a new table format. But couldn’t find much documentation around it. Also, it targets the Spark engine and not the SQL engine. Just mentioning it here for future reference.
Data architecture evolution
Operation Data store → Data Warehouse →Data Marts → Multi-Dimensional Databases (Oracle Express, Hyperion Essbase)→Appliances (Teradata, Netezza)→ Data lake (S3)→ Data Fabric (HPE Ezmeral)→ Data Virtualization (Denodo)→Lakehouse → Delta Lake (Databricks)→ Data Mesh (ThoughtWorks)
Happy Reading !!