Geek Culture
Published in

Geek Culture

Photo by janer zhang on Unsplash

Open Lakehouse — A journey

SQL on anything with Dremio

Nowadays every company relies on data to make decisions. So every company needs a data infrastructure that will handle their use cases over time. And the use cases have evolved over time. With changing use cases we saw different solutions like a data warehouse, Hadoop, data mart, data lake, cloud data warehouses, and lakehouse.

On-prem Data Warehouse

The data warehouse was the first attempt to help integrate disparate data silos into one platform, acting as the single source of truth. It used historical transactions, records, operations, and other applications to make data consumable for business users’ analytics.

Over time, companies realized every OU(sales, marketing, finance) needs a customized view of the data and this gave life to data marts. This led to multiple copies of the data for different BI cubes and reporting tools.

Scaling the on-premise data warehouses was a big challenge. The license and operating cost of these data warehouses outgrew their return.

Data Lake

It started on-prem as an alternative cost-effective storage solution to the data warehouse for sources that had large volumes, velocity, and variety of data. Data lakes prioritized the flexibility of analytic methods. Hive attempted to provide a way to run SQL efficiently on data lakes. It provided the concept of a database, schemas, and tables for describing the structure of a data lake in a way that let BI tools traverse the data efficiently. Hive came out as a winner in the warehouse race. But soon this solution started to run into infrastructure shortcomings as the data warehouse.

Cloud data lake + data warehouse

The First attempt to remove the operation cost was attempted by BigQuery where the infrastructure was managed by GCP. Soon AWS offered Redshift and dominated the cloud data warehouse market until Snowflake's launch. All these products provided faster speeds on queries using managed storage. In all these products, an ETL process will move source data into the cloud data lake, and a curated and trimmed version was loaded in the warehouse. Cloud object storage like S3 served as a cost-effective data lake. But here also we had the issue of duplicate copies of data. Also, one data loaded in the warehouse were only queryable from the product.

Data Mesh & Data Fabric

Both Data Mesh & Data Fabric tried to overcome the shortcomings of Data lake.

Data Mesh, an analytical data architecture, tried this by organizing data as products, and scoping it down to domains in a distributed fashion. Governance and architecture are federated. It has a centralized catalog of different data products.

Unlike Data Mesh, Data Fabric takes a centralized metadata-driven and technology-centric approach. Governance and architecture are centralized. Some part of the data is virtualized. Most of the Data is centralized and delivery is done using APIs. See open-data-fabric.

Open Lakehouse

The launch of open table formats started a new chapter in the data management area. These file formats brought performant SQL capabilities of warehouses to the data lake. Open file formats introduced the following features to the data lake:

  • Data as a product — versioned data, rollback
  • Transactional consistency — Atomic updates to files with full read isolation.
  • Schema evolution to track table changes over time.
  • Time travel to query historical data.

Openness Pillars

  • Open formats — non-proprietary formats
  • Open Interfaces — ANSI SQL, jdbc/odbc connectivity
  • Cloud agnostic — Deployment using containers
  • Open Source — non-proprietary code of the software

Dremio

Using these features Dremio launched an open lakehouse platform. This platform has the following benefits

  • SQL on any data
  • Reduce infrastructure management costs
  • Query your data where it lives
  • No unnecessary data movement and copies
  • No vendor lock-in as it uses open source components

Technical Components of Dremio

  • Apache Arrow — columnar in-memory data processing and interchange
  • Apache Arrow Flight — gRPC and HTTP/2 based data interoperability
  • Apache calcite — SQL Parser & CBO
  • RocksDB — Metadata storage on coordinator
  • Gandiv — (LLVM Kernel)
  • Apache parquet — file format used for on-disk spillage
  • Nessie — catalog for Iceberg
  • Production deployment on Kubernetes — Elastic scaling platform

Logical components of Dremio

  • Coordinator — Serves UI, Powers API Connections, Manages metadata
  • Executor — Execute queries
  • Distributed Storage — use external like hdfs/S3
  • Admin — Does housekeeping & background tasks
  • Zookeeper — Cluster Coordination (use external)

Overall workflow in Dremio

In above pic I shown two different storage since Dremio leverages storage in two key ways: Data Lakes and the Distributed Store.

Dremio Optimizations from 5000 feet

  • Data Reflections
  • Ultra high-performance, vectorized Parquet reader
  • Columnar Cloud Cache (C3) (NVMe based)
  • Predictive Pipelining
  • High-Performance Parallel readers for S3 compatible storage

Note: I came across LakeSoul, a new table format. But couldn’t find much documentation around it. Also, it targets the Spark engine and not the SQL engine. Just mentioning it here for future reference.

Appendix

Data architecture evolution

Operation Data store → Data Warehouse →Data Marts → Multi-Dimensional Databases (Oracle Express, Hyperion Essbase)→Appliances (Teradata, Netezza)→ Data lake (S3)→ Data Fabric (HPE Ezmeral)→ Data Virtualization (Denodo)→Lakehouse → Delta Lake (Databricks)→ Data Mesh (ThoughtWorks)

Happy Reading !!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Amit Singh Rathore

Amit Singh Rathore

Staff Data Engineer — Writes about Cloud | Big Data | ML