From Traditional BI to Lake House, a Data Architecture Evolution

Published in

Blog Técnico QuintoAndar

10 min readJul 10, 2023

How QuintoAndar evolved its Data Platform to a Distributed Strategy.

Although QuintoAndar was born as a tech startup, its Data area started according to the traditional Business Intelligence framework. Our early-stage Data Architecture was based on batch ETL (Extract, Transform and Load) Pipelines for extracting and treating data; an Operational Data Storage (ODS) serving as a staging area and a final Data Warehouse structure, subdivided into Data Marts, to support the decision-making process.

QuintoAndar's early architecture, based on the Business Intelligence Archtecture.

As time went by, the company grew, and at the same pace, its internal and external data sources got wider and more diverse. Also, the data treatment and consolidation were becoming more and more complex, and as a collateral effect, the existing Data Architecture was not being enough to deliver ready-to-use data to its stakeholders.

We revisited our Data Architecture searching for a more efficient way to deal with huge amounts and different types of data, and something easier to maintain and scale (otherwise we would hit the cap soon, again). That search led us to a Big Data architecture that was able to handle a gigantic amount of data and would provide a very scalable environment for the long road.

The traditional BI architecture

The Data area in QuintoAndar was created to better organize data from a single source, in our case — and the case of most new companies — the company's monolith. The main goal was to analyze the product's data without the need to access the company's single-application database directly. Due to the simple requirement, a Business Intelligence approach was applied, in which daily batch jobs would extract data from the source, treat it accordingly and load it in the final location. This Extract-Transform-Load process, or ETL, would provide data ready for ad-hoc analysis, reports extraction, and basic data visualization in a Data Warehouse platform.

The ETL and the Data Warehouse (DW) are the core concepts in the Business Intelligence architecture:

The ETL process is responsible for extracting data from sources; cleansing transactional data; treating data, according to our business rules; and modeling data, organizing data into segments (or datamarts) within the Data Warehouse. We followed the Star-Schema dimensional modeling framework, which provided data arranged as datamarts by business processes/areas.

The final step of the ETL process is the loading of data. Loading in the data warehouse environment usually takes the form of presenting the quality-assured […] tables to […] each data mart.
Ralph Kimball, in The Data Warehouse Toolkit.

The Data Warehouse is the frontier between usage-ready data and users eager to use them. The DW will be frequently accessed for data retrieving, especially for reporting and data visualization tools. The frequent access requires data well-assembled — which we were covered by star-schema-modeled datamarts — and optimized software for that matter. For the latter, we were using AWS Redshift, which provided a huge computational power and quick response time.

Still, there is an important middle part to mention within a BI architecture, the ODS:

The ODS, or Operational Data Storage, is the middle ground between the source databases (containing transactional data) and the Data Warehouse (containing analytical and modeled data). In our architecture, the pipelines in Airflow dumped data into ODS, in our case, a Postgres deployed as an RDS in AWS. Afterward, the next tasks in Airflow kept cleaning, treating, and joining data within the relational database, amongst different schemas and tables, until they got organized enough to be modeled to Redshift's star schema framework.

The Achilles' heel

There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.
Eric Schmidt when he was CEO at Google, in 2010.

Our pipelines in Airflow were good enough to deliver data to stakeholders running its batch and daily jobs. However, we started facing some SLA (service level agreement) problems, that is, we were not being able to deliver the ready-to-use data on time every single day. The crashes in the pipelines, running during the night, were becoming common and some of them could delay delivery by half a day, in extreme situations.

One issue which was delaying data delivery was the technologies behind the ETL. Data extraction was made with Python, normally with the petl library, which would load data into Table objects in memory, and would not work well with large datasets. In some more complex pipelines, the data were being delivered constantly with delay due to the huge amount of data that kept growing and growing in the source database and not being processed fast enough with Python.

Source databases in QuintoAndar, in the past, had a huge problem with the lack of fully trusted audit architecture. Not all tables had audit implemented since it was a manual process, and for those which had, we usually found some inconsistencies. Due to that, most of the database extractions were made full — redundantly loading every single row within every single table every day.

Furthermore, our Airflow instance would offer some memory and processing limitation, since it was deployed in an AWS EC2 instance. These limitations were starting to become a pain: The heavy memory usage in the Airflow instance, caused by multiple (and huge) petl Tables in memory, was affecting the environment — once various DAGs running concurrently would require a more powerful AWS EC2 instance. Crashes in pipelines due to a lack of usable memory in Airflow were becoming constant and increased the list of SLA detractors.

Illustration by David Fletcher on CloudTweaks.com

The shift to a distributed file system

Two approaches helped us to handle our challenges regarding the processing and compiling of a big variety and volume of data: a data lake storage, and a parallel data processing framework.

I. Data Lake

A data lake is a central repository that stores both structured and unstructured data at scale. It is a game changer to handle a big amount and diversity of data, backed up by small storage costs.

We started to implement the Data Lake architecture to substitute the ODS framework. We replaced the Postgres database which demanded a lot of process and time to structure data and was hard (and costly) to scale, with the data lake strategy based on AWS S3 storage. We based ourselves on the data lake layers framework, by defining layers in our storage with different goals and for different audiences:

raw layer — stores unprocessed and unmodified data, keeping the original file format (JSON or CSV, usually). This layer was not supposed to be accessed by analysts or services, it was supposed to be for internal data processing only.
clean layer — stores converted data in an optimized file format for consumption, in our case Parquet. The data in this layer had basic cleaning and standars applied. This layer was supposed to be accessed for exploratory analysis and used as a source for Star Schema.
enriched layer —stores well-processed data, which can also go through aggregation and joining processes to generate both one or more tables per original table. This Layer was also available for exploratory analysis and used as a source for Star Schema.

The Data Lake approach gave us the flexibility to work with different types of data and to supply for different products that we were not able to aid before, such as machine learning algorithms and predictive analysis, on top of empowering the analytical backbone of the company, optimizing the KPIs and reports quality. With cloud object storage and Spark, we were able to segregate our needs for storage from our needs for processing, while in ODS they were all tightened up in the same Postgres architecture.

II. Spark

Spark was the other pillar of our distributed architecture. It is a distributed data processing engine that can process an enormous amount of data. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Spark is the most effective data processing framework due to its handling of large data sets, speed, and overall flexibility. It would fit like a glove to our requirements at the time: process large volumes of data in a feasible time.

Databricks was our choice to run Spark at. Databricks is a cloud-based platform in which we can quickly create and deploy Spark clusters and run our ETLs, based on PySpark. Databricks provides us with the Spark infrastructure and management of clusters. It has a very good adherence to AWS, and it was very easy to integrate into our pipeline in Airflow since it has native Operators for that communication.

The shift to Data Lakehouse

At that point, our Data Platform was based on storage in AWS S3, data processing in Databricks via Spark, pipelines orchestration in Airflow, and Data Warehousing in AWS Redshift. Although Redshift was theoretically the last layer, users and data services may still need to access data in previous layers (such as data lake's clean and enriched layers), which was done via Athena or Redshift Spectrum feature. Our goal was to find a way to unify the data access strategy and not be dependent on three different tools… then the Data Lakehouse architecture comes into play!

The Data Lakehouse was this fresh architecture which is designed to combine the data structure and data management features of a data warehouse with the low-cost storage of a data lake, by aggregating a metadata layer to data stored in the data lake and unifying the access of data, independently of the layer they were on.

"What is a Data Lakehouse?", by Taygan Rifat, on Data Lakehouse.

To implement the Data Lakehouse architecture, we would need to add a Metadata layer on top of stored data in S3. This layer would be responsible for mapping data’s metadata and would allow us a handful of options:

By plugging a query engine into the Metadata Layer, we would have a centralized access point. A centralized access point would allow us to perform federated queries.

The term query federation describes a collection of features that enable users and systems to run queries against multiple siloed data sources without needing to migrate all data to a unified system.
Databricks on What is query federation?

The Metadata Layer would provide us with the capability needed to evolve our Security and Governance platform. From the Governance point of view, we would be able to add data documentation, map data lineage, and add custom metadata, such as tags to identify PII (Personal Identifiable Information). From the Security point of view, we can use the metadata to map users' access profiles.

Here you can read about the first challenges of the Data Governance Team in QuintoAndar.

Redshift had limited scaling capabilities, only scaling horizontally. On average, our usage of memory and cpu was reaching 70%, while in storage it was only around 1%. Thus to implement the Data Lakehouse architecture we chose to replace Redshift (and its Spectrum feature) and Athena with a new stack with Hive Metastore and Trino.

The Hive Metastore is a data catalog that is responsible to manage and persist metadata in a relational database. Trino is a highly parallel and distributed query engine that is capable of querying petabytes of data.

Building a SQL Engine Infrastructure at QuintoAndar, by @kenjihiraoka, shows QuintoAndar's new Data stack based on Trino and Hive Metastore.

If you want to know more details regarding QuintoAndar's Hive & Trino stack, check out the following Stories: Old but gold: implementing a Hive Metastore Infrastructure; The Solo Career of Hive Metastore; and Building a SQL Engine Infrastructure at QuintoAndar.

Benefits

Data Lakehouse, by Databricks, shows the evolution of data architecture, which matches how the Data area in QuintoAndar changed over the year.

To sum it up, we migrated our storage to a distributed file system, very focused on layers with well-defined responsibilities (data lake on S3, data warehouse, and data marts on Redshift) and ELT flows via Airflow. Afterward, we went full-cloud by adding to our stack Cloud Composer (the GCP fully managed Airflow), which orchestrates and manages our pipelines, all relying on the parallel processing on Spark via Databricks.

Even though the improved architecture was very eye-catching, the Analytical layer designed on top of Redshift was still not handling well our frequent need to scale. Thus, by replacing the Data Warehousing layers presented on Redshift per a Lake House strategy based on S3, Hive, and Trino, we took one step further to the top-notch Lake House architecture.

The changes were surreal! Having every single layer decoupled from one another allowed us to customize and scale layers individually, simply, and quickly, thus reducing the effort to manage infrastructure and reducing general costs. Furthermore, we could process more data in less time, reducing our time to deliver and facilitating reprocessing and backfilling.

As a business-related consequence (although not less important), the new infrastructure allowed us to empower our users with more complex data products like real-time analytics based on Streaming and modern Machine Learning products. And let's bear in mind, everything all tighted up in an LGPD (Brazilian General Data Protection Law) compliant environment.

If you want to be part of an innovative team and contribute to top-notch projects, join us!

QuintoAndar Carreiras

When I started my career, I was sure that I wanted to be in a company that had a clear and effective purpose, at…

carreiras.quintoandar.com.b