Navigating the Data Deluge: A Journey to Crafting a Sophisticated Lakehouse Platform

Meni Shakarov
Eleos Health
Published in
14 min readMay 26, 2024

It was late on a Thursday afternoon when the realization dawned on us. Gathered around the flickering screens in our dimly lit office, my two colleagues — both sharp data analysts — a dedicated team of data scientists, and I, a data engineer, were deep into attempting to join data from our PostgreSQL and MySQL databases. The task should have been straightforward, but instead, we found ourselves wrestling with mismatched data structures and conflicting schemas from different departments. As we manually tried to align these disparate data sets, the complexity felt overwhelming. With each failed attempt to synchronize this chaotic information, the air grew thicker with frustration. “We can’t work like this,” I muttered under my breath, the stress palpable in my voice. It was clear that our current system was woefully inadequate — we needed a unified platform that could not only contain but also make sense of this sprawling data jungle. This was the moment we knew a significant change was imperative. This was the moment we knew a big change was about to come.

Embarking on this journey led to the creation of a robust data platform architecture that now underpins our company’s decision-making process. In this blog post, I will explore the insights and challenges we faced while building a data infrastructure that not only consolidates diverse data into a single, powerful engine for analysis but also remains flexible and scalable, continuously evolving and integrating more data sources over time.

Data lakes have emerged as a vital tool for businesses that need to adapt quickly and scale efficiently, allowing the storage of large volumes of raw data in its native format. This flexibility is a game changer for data integration. As we progressed, however, we recognized the need for a more structured approach to data management, leading us to adopt a lakehouse architecture. This hybrid model combines the scalability and flexibility of data lakes with the management capabilities of data warehouses, enhancing our data strategy. This narrative will explore the practical steps we took to harness these technologies, shedding light on how the integration of a data lake into a lakehouse framework can be a transformative asset for any data-driven organization.

Our Data Strategy Before the Lakehouse

Before developing our lakehouse, our data management was simple but became inefficient as our needs grew. Initially, we managed 3 main data sources using Apache Superset for BI, which initially met our needs with minimal complexity.

However, as our data requirements grew, the limitations of our system began to show. A significant challenge arose when we needed to join data across two different sources. Our solution at the time was far from efficient: we manually replicated the necessary data from one source to another. This process was not only time-consuming but also prone to errors, as it required frequent updates to keep the data synchronized.

Moreover, as different teams and projects developed, multiple datasets were created within Superset, each tailored for specific analytical needs. Unfortunately, this led to redundant transformations being coded into multiple datasets. Not only did this redundancy complicate our data processes, but it also made maintaining and updating these transformations increasingly burdensome.

Deciding on a Data Architecture: Lake, Warehouse, or Lakehouse?

Choosing the right architecture for our data infrastructure was a critical decision. We deliberated among three main options: a data lake, a data warehouse, and a lakehouse. Here’s a brief overview of each:

Data Lake: A data lake stores vast amounts of raw data in its native format. It’s ideal for handling large volumes of diverse data from various sources, offering high flexibility and scalability. However, it lacks some of the processing efficiencies of structured data environments.

Data Warehouse: This is a system that stores data in a structured format, optimized for querying and analysis. Data warehouses are excellent for fast query performance on structured data but can be less flexible in accommodating changes and new data types.

Lakehouse: A hybrid model that combines the best of both data lakes and data warehouses. It provides the vast storage and flexibility of data lakes, along with the efficient querying capabilities of data warehouses.

After careful consideration, we chose to implement a lakehouse architecture for several reasons:

1. Flexibility: The lakehouse architecture offered the adaptability we needed. Traditional data warehouses can be restrictive and slow to adapt to changes, making it challenging to integrate new data sources or types quickly.

2. Simplified Architecture: Initially, we considered implementing a traditional ETL process to a data warehouse alongside a data lake, but we found no compelling reason to maintain two separate systems. The lakehouse model offers a streamlined approach, combining the robust querying and storage capabilities into a single, more manageable system.

3. Query Performance: Our lakehouse setup provides excellent query performance that meets our current needs. While it may not support real-time millisecond response times, it’s more than adequate for our analytical requirements without the overhead of real-time processing.

Our Data Lakehouse stack

Our lakehouse architecture is designed to harness the best of AWS technology and open-source solutions to manage and analyze our data effectively. Here’s an overview of the core components:

Data Storage:

We utilize AWS S3 for our data storage, organizing our environment into 2 dedicated buckets for development and production. Embracing the medallion architecture, we’ve established three distinct layers(folders) within each bucket: bronze, silver, and gold, each serving a specific purpose in our data management lifecycle.
The medallion architecture is a tiered data processing model used in lakehouse systems that categorizes data into three layers:

  • Bronze Layer (Raw Layer): At this foundational level, we store data exactly as it arrives from various sources, preserving it in its native formats such as JSON, CSV, and others. This layer is essential for us, primarily utilized by our data engineering team who access it for debugging and ensuring data integrity. It’s also where our data scientists start their exploratory analysis to gauge data quality and gather initial insights, which are crucial for guiding further data processing strategies.
  • Silver Layer (Cleansed Layer): Once data moves to the silver layer, we perform necessary cleansing and transformation processes. Here, we correct any inconsistencies and enrich the data, applying specific business rules to make it more structured and useful. Our analytics engineer frequently works with this cleansed data, executing complex transformations and generating detailed reports that drive our internal analytics. Moreover, this layer provides a cleaner, more organized data environment that our data scientists rely on for building sophisticated models.
  • Gold Layer (Aggregated Layer): This is where our data achieves its highest value, transformed into business-level aggregates and key performance metrics that are directly used for decision-making and strategic planning. Optimized for rapid retrieval and high-speed analysis, the gold layer is predominantly accessed by our data analysts for decision-makers. They depend on this refined data for actionable insights that influence company-wide strategies and operations. Additionally, this layer serves as the backbone for our enterprise-level reporting and dashboarding.

In our silver and gold layers, data files are stored as Parquet files to optimize query performance due to Parquet’s efficient columnar storage format. Over these Parquet files, we utilize Apache Iceberg, an open table format that brings several critical features to our lakehouse architecture.
With Apache Iceberg, we can treat our data lake much like a traditional database but with greater flexibility and scalability. It supports complex operations such as snapshotting, transactions, upserts, and deletions, transforming our data lake into a more dynamic and versatile system.

Data Catalog:

For our data cataloging needs, we chose AWS Glue Catalog, primarily for its cost-effectiveness, deep integration with other AWS services, and straightforward metadata management capabilities. AWS Glue Catalog serves as a central metadata repository, which makes it easier for us to manage and access our data assets across various AWS services. Utilizing the AWS Glue Crawler, we are able to automatically discover and categorize data stored in S3, creating and updating the data catalog tables effortlessly.

However, we recognize that AWS Glue Catalog has its limitations, especially when it comes to facilitating easy data discovery in a way that suits our operational needs. While it integrates well and is cost-effective, it does not support some of the more nuanced features of a sophisticated data catalog, like enhanced search and discovery tools that are vital for larger scale data operations. This can impact various team members, including data scientists who require quick access to diverse datasets for modeling, data analysts who need to generate rapid insights for business decisions, and business intelligence professionals who depend on streamlined data exploration to build comprehensive reports. Moving forward, we plan to investigate more robust solutions for data cataloging that can provide a more convenient and comprehensive support for data discovery, catering to the needs of these critical roles in our organization.

Data Access and Query Engine:

AWS Athena serves as our primary query engine, offering a seamless integration with the AWS Glue Catalog. This straightforward setup allows us to efficiently query our data lake, making Athena a crucial component of our data architecture. One of the key advantages of using Athena is its cost-effectiveness. Since Athena charges based on the amount of data scanned during queries, and our current queries do not scan excessive amounts of data, we have managed to keep our costs quite low.

However, we are aware that as we expand our use of the data lake — especially with the upcoming integration of direct data queries via charts in our application — the costs associated with Athena could increase. To prepare for this potential scenario, we are considering a transition to Trino, which would run on EKS and connect to the AWS Glue metastore. This migration is expected to be straightforward due to the underlying similarities between Athena and Trino.

Currently, we utilize two different workgroups in Athena — one dedicated to SQL queries and the other to Spark (Python) operations. Looking ahead, we plan to refine this setup by creating separate workgroups for different business needs, such as transformations, customer analytics, etc., to enhance operational efficiency and cost management.

Data Governance:

AWS Lake Formation is integral to our data governance, significantly boosting data security and permission management in our lakehouse architecture. It helps us enforce strict access controls, crucial for handling PHI and sensitive data.

Implementing LF-Tags for Robust Access Control: To ensure data is accessed securely and in compliance with our strict policies, we utilize LF-tags to finely control permissions at the database and table levels. Our tagging strategy is methodically designed, where databases are tagged generally, and more granular permissions are managed at the table level based on specific requirements. The possible tags we use include:

  • Environment: dev, prod
  • Department: app, internal, devops, hr, customers, infra, sales, ds
  • PHI: true
  • Data Lake Layer: gold, silver, bronze
  • Client Facing: true (indicates whether the data can be exposed to clients)

Multiple tags can be applied to each database and table, allowing for nuanced, role-based access control. For example, a role designed for customers querying data from our application might be granted access through a combination of tags such as client_facing: true, data_lake_layer: gold, and environment: prod.

Initially, setting up AWS Lake Formation was not straightforward. The platform, while powerful, was not intuitive, and it took considerable effort and time to align the permission behaviors with our governance requirements. Overcoming these challenges involved a steep learning curve, where we had to experiment with various configurations to understand how to effectively implement and manage the detailed access controls we needed.

Data Ingestion:

We sought a platform that was straightforward to use and supported a wide variety of connectors, anticipating the future need to ingest data from across the organization. After conducting a POC comparing two platforms, Airbyte (an open-source solution) and Rivery (a SaaS solution), we opted for Airbyte for several compelling reasons.

Firstly, our decision leaned towards a cost-effective solution that wouldn’t charge based on data volume, as we preferred to ingest data freely without worrying about escalating costs. Additionally, Airbyte’s development tools particularly impressed us. The platform’s Connector Builder SDK, which includes both a Connector Builder UI and a low-code connector development environment, offered the simplicity and flexibility we needed. This feature allowed us to easily build and customize data connectors as per our specific requirements.

While Airbyte boasts a large and active community, we did face some challenges with the product. Initially, data ingestion speeds were slow, particularly noticeable when it took over two days to transfer 20GB from PostgreSQL. I first tried using the AWS-data-lake destination, but it was slow and did not support long-running syncs. To address this, I submitted a pull request to fix this issue, though it took three months to merge. In search of a better solution, I experimented with several other destinations. The S3 destination, while functional, had an annoying issue where timestamps were formatted as a struct when using Parquet. This specific issue has been open for two years without any resolution, highlighting a significant gap in support. The Iceberg destination, though promising, did not support the AWS Glue Catalog. I then tried the AWS-Glue destination, but found it inefficient as it only supported JSON output.

Ultimately, none of these options fully met our needs, leading us to develop our own custom AWS-data-lake destination. We forked the original code and customized it specifically for our requirements, creating a tailored solution that significantly improved our data ingestion process.

Despite these hurdles, Airbyte has met all our requirements effectively. Today, we have successfully integrated about 15 different data sources into our data lake using Airbyte, significantly enhancing our data ingestion capabilities and overall data strategy.

Data Processing:

Our data processing workflows are powerfully driven by dbt Core, using the ELT (Extract, Load, Transform) approach where all our data starts in the bronze layer and is progressively transformed to the silver and gold layers.

We are using the dbt Athena adapter, which supports both SQL and Python (PySpark) models. This versatility is crucial for handling more complex transformations effectively. The dbt-Athena adapter benefits from a vibrant community and continues to evolve with regular updates. Initially, we were somewhat hesitant to adopt the Python Athena integration due to its novelty and the limited track record at the time. However, after thorough testing and validation, we have successfully implemented it in our production environment without any issues, confirming its stability and efficiency.

In dbt, configuring table properties is straightforward and flexible, which greatly enhances our data management capabilities. For instance, we extensively use incremental tables, which are crucial for efficiently ingesting only the new or changed data. Leveraging the Iceberg table format, we adopt a merge incremental strategy, allowing us to seamlessly update our datasets without redundant processing. Furthermore, managing data partitions is also simplified in dbt; partitions can be declared directly within the table properties. Below is an example of table properties configuration for a table in our gold layer, showcasing how we specify the materialization strategy, partitioning, and data format:

{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key=['session_id'],
partitioned_by=['day'],
table_type='iceberg',
format='parquet'
) }}

We are extremely satisfied with dbt; it has revolutionized how we manage our data transformations. The tool provides robust version control, reusability of code, and clear documentation of data flows, which greatly simplifies managing complex transformations and maintaining data integrity across our lakehouse architecture.

Building on our success with dbt, we’re currently exploring new tools to further enhance our data management capabilities. A standout addition has been Montara.io, which integrates directly with our dbt Git repository and brings an array of powerful features to streamline our workflows. Montara offers automatic CI/CD, a user-friendly UI that simplifies writing and testing models — even for team members without deep dbt expertise — and provides valuable tools such as data lineage display, a data catalog, and observability.

We’ve been impressed with Montara; it significantly simplifies our dbt workflow, making data transformation more accessible and manageable across our team. Although the tool is relatively new and still evolving, meaning we occasionally encounter minor issues and gaps in functionality, our experience has been overwhelmingly positive. The Montara team provides exceptional support, working closely with us to quickly address any challenges and integrate our feedback into ongoing product enhancements. This collaborative approach not only helps us resolve issues swiftly but also ensures that Montara.io continues to evolve in a way that perfectly aligns with our data infrastructure needs.

Analytics and BI Tools:

Apache Superset is our analytics and business intelligence tool of choice, renowned for its open-source and powerful data visualization capabilities. We opted for Superset primarily because of its flexibility and cost-effectiveness compared to other BI tools. Its extensive customization options and user-friendly interface allow our teams to tailor dashboards and reports to their specific needs, making it particularly suited for our unique analytical environment where we primarily use Athena as our data source.

Our data analysts mainly use Superset to create dashboards for different departments within our company. Furthermore, we leverage Superset’s capabilities to embed charts directly into our applications, providing valuable insights to our customers.

Currently, some of our charts directly access data from the bronze layer, performing transformations in real time. However, we are in the process of modifying this approach to reduce the load on raw data queries and plan to eventually restrict access to the bronze layer using LF-tags.

While we find Superset to be a convenient and effective tool for dashboarding, it can become somewhat messy due to the proliferation of datasets created over time. Each dataset in Superset is configured individually, which can lead to redundancy and management challenges as the number and complexity of dashboards grow. Despite these challenges, Superset meets our requirements well, providing a versatile platform for visualizing and interacting with our data across the organization.

Orchestration and Workflow Management

Apache Airflow plays a crucial role in orchestrating and managing workflows within our data environment. As an open-source tool, Airflow offers flexibility, scalability, and robust community support, which are essential for our operational needs. We leverage Airflow to ensure all data pipelines into our data lake are triggered accurately, maintaining the freshness and reliability of our data.

Currently, we manage three main DAGs (Directed Acyclic Graphs) that are integral to our lakehouse operations. The first DAG is tasked with orchestrating data ingestion into the bronze layer, utilizing the AirbyteOperator to trigger all necessary syncs. A second DAG is responsible for executing our dbt transformations, ensuring our data is processed and moved to the silver and gold layers efficiently. The third DAG oversees the entire workflow, triggering the ingestion DAG and subsequently the dbt DAG in sequence to maintain a smooth flow of data processing.

Additionally, we have integrated Slack notifications within these workflows. This setup provides real-time alerts whenever a DAG fails, enabling immediate monitoring and response to maintain continuous operation and data integrity.

Conclusion: Harnessing Data for Strategic Advantage

In conclusion, our journey to build and refine a data lakehouse architecture has been both challenging and rewarding. We have successfully integrated a suite of tools and technologies that have transformed our data management capabilities, enabling us to unify diverse data into a robust analytical engine that supports dynamic decision-making. Our use of Apache Superset, AWS S3, AWS Glue Catalog, Apache Airflow, and dbt, demonstrates our commitment to employing cutting-edge technology to address complex data challenges.

These tools have not only improved our operational efficiency but have also paved the way for more insightful data analysis and reporting across the company. As we continue to evolve our data infrastructure, we remain committed to exploring new technologies and approaches that can further enhance our data capabilities.

I am open to feedback and questions from readers who are on a similar journey or considering such initiatives. Feel free to connect with me on LinkedIn for further discussion or to exchange ideas.

--

--