A Serverless Approach to Data Warehousing

Maximilian Stoehr
GAMMA — Part of BCG X
8 min readOct 28, 2020

Authors: Maximilian Stoehr, Rajesh Yanamandra, Niels Freier, and Aaron Arnoldsen

Gone are the days when companies can be competitive without strong digital capabilities. Whether your company is in retail, manufacturing, travel, construction, or is a city government, the gathering, monitoring, tracking, modeling, and deploying of data-driven insights creates your competitive advantage. The problem is that building and maintaining a scalable and resilient data platform to manage all this is a daunting, complicated task that can be expensive. Despite its critical role in business success, the creation of the proper data platform is one task many companies often completely overlook.

In the modern business world, data made relevant through analytics & insights act as a form of currency. As they “spend” their data, companies must strike a balance between performance and cost, while at the same time giving the data platform the resources it needs to scale its processing. This problem has become even more challenging amid the COVID-19 pandemic, during which companies are looking for the latest data and insights to drive decision making. We at BCG faced these challenges as we deployed Lighthouse by BCG, our new high-frequency data and modeling platform built to drive real-time insights. Within a matter of weeks, Lighthouse was able to process over 200 billion rows of data and scale without any downtime, enabling us to quickly provide our project teams with the analytical horsepower they needed. And it did so without undue drain on resources.

The following article will walk you through the journey we embarked on to create a decoupled, high performing, resilient, and scalable data platform — while under tight deadlines and a mandate to keep costs in check. We believe that our approach can be replicated and easily integrated into many existing pipelines.

Choosing the Right Data Warehouse (DWH)

One of the significant challenges in data management is choosing the right platform — one capable of hosting petabytes of data without compromising on the compute. After much deliberation, we chose Snowflake because it provides multiple benefits over comparable DWH solutions:

1. Scalable: Snowflake allows us to scale based on the actual project need. With distinct computing resources for every team, there’s no need to compete with each other for those resources. In fact, each team can leverage the platform’s full analytical capability.

2. Handles Semi-Structured Data: Snowflake also allows the loading of semi-structured data like Parquet and JSON files, which sets it apart from AWS’s Redshift database. Data can also be uploaded via the web interface, which simplifies ad-hoc tasks.

3. Team Connectability: Our commonly used tools, including Tableau, Alteryx, Excel, and PowerBI, have native connectors to Snowflake. This reduces the effort our project team must expend to connect to our data platform by enabling them to use the tool they prefer.

4. Powerful Data Sharing Model: We are leveraging this model by creating a dedicated read-only account (also called a reader account) for every project team. The account allows us to assign dedicated access rights and compute resources, and to share only that data the project team needs. Combined with the usage reporting, the reader account enables us to comply with our contractual obligations.

5. Vendor Connectability: Some of our data vendors use Snowflake as well, which allows them to share their data directly with us without any intermediate ETL.

Snowflake Data Sharing Model [1]

With the usage reporting capabilities of Snowflake, we can use Lighthouse to bring transparency for our project teams by providing them with a real-time view of the cost of their analytical workflows.

Transparent Usage Reporting with Snowflake

Serverless ETL

You wouldn’t keep the lights on in a room you never enter, so why manage a permanent infrastructure you don’t really need? Designing a serverless architecture to handle the data processing enabled us to save resources and avoid having this extra burden — without compromising on performance or deliverables. We created the ETL to seamlessly integrate into our Snowflake architecture, while also being able to scale and run multiple processes on-demand at the click of a button. By leveraging this serverless, decoupled architecture, we were able to reduce costs by approximately 60% compared to a long-running, classical architecture.

Architecture

While our Lighthouse architecture is based on AWS, it can be replicated easily for other cloud service providers. For the sake of demonstration, we’ll use the AWS nomenclature to explain our process.

At first, we tried to leverage AWS Glue, but we needed more flexibility. We ended up using a mix of AWS Lambda as an entry function, AWS ECR and ECS Fargate as our ETL execution layer, and S3 as our intermediate staging area. To monitor our containers, we used DataDog as our centralized monitoring tool. For exploring the data platform, we provided our project teams with a data catalog based on Metabase.

The following image represents the final Lighthouse architecture:

Final Lighthouse Architecture

Entry Function

As described above, the Lambda function serves as an entry point. Given that we have various sources, from machine learning model output and OSINT data to data delivered by 3rd party vendors, the Lambda function creates the execution layer’s configuration.

This Lambda task is triggered either by new files being created on an S3 bucket, by CloudWatch events (time trigger) or directly with an AWS SNS message sent by our teams’ machine learning models.

To start the ECS Fargate (our execution layer) task, we leverage the Boto3 Python library. We create the task manually beforehand, which makes it possible for leads to the resources to be shared among all running containers. While this imposes a limitation on the number of tasks that can run parallel, it simplifies the management of the container. To overcome this limitation, we can create new tasks on the fly, although this ad hoc approach means that the ECS Fargate task needs to be removed after the execution has finished.

We are using a Lambda function similar to the following code snippet:

import boto3
def lambda_handler(event: LambdaDict, context: LambdaContext):
"""Entry function for AWS Lambda.
Parameters
----------
event: LambdaDict
AWS Lambda uses this dict to pass in event data to the
handler
context: LambdaContext
Context object that provides methods and properties that
provide information about the invocation, function, and
execution environment

Returns
-------
response: Dict
Dict like structure with optional message and http return
code
"""
config = _get_config(event) # Custom function to retrieve configuration
client = boto3.client("ecs")
response = client.run_task(
cluster=os.environ["FARGATE_CLUSTER"],
launchType="FARGATE",
taskDefinition=os.environ["FARGATE_TASK"],
count=1,
platformVersion="LATEST",
networkConfiguration={
"awsvpcConfiguration": {
"subnets": [
os.environ["SUBNET_1"],
os.environ["SUBNET_2"],
],
"assignPublicIp": "DISABLED",
}
},
overrides={
"containerOverrides": [
{
"name": os.environ["CONTAINER_NAME"],
"environment": config,
"memory": os.environ["MEM_LIMIT"]
}
]
},
)
return str(response)

Execution Layer

For the Lighthouse execution layer, we chose ECS Fargate spot instances. This serves the dual purpose of managing the container orchestration layer while using a highly cost-effective approach that leverages spot instances as the capacity provider.

Using containers instead of traditional infrastructure accelerates development and allows local execution for testing purposes and implementation of additional features that would otherwise not be compatible. One of these features is the Great Expectations Python library, which we use as a quality gate. For example, when we are testing the source data’s schema and one of those checks fails, we raise an error to our monitoring system.

We decided to use one container for all our ETL processes. Based on the entry function’s incoming configuration, the data-source-specific ETL process will then be executed. We decided against the use of distinct containers for each data source so that we can easily reuse standard components and simplify the container management.

Each data source has to implement an abstract base class and, thus, follows a clear, pre-defined structure. The ETL process takes the input data, executes the quality checks with Great Expectations, cleans and transforms the data, and then loads the data to our intermediate staging area on S3 in Parquet format. We use Parquet wherever possible to save space, maintain the data types, and simplify the loading process to Snowflake.

We load the Parquet files to Snowflake from our staging area. Some cleaning and transforming operations are performed solely on Snowflake, especially if the data size does not allow for processes inside the container itself or if the data needs the context of other tables inside the data warehouse.

Monitoring

Another vital part of our architecture is monitoring with Datadog. Datadog allows us to monitor our ETL processes without managing any infrastructure, which aligns with our efforts to create a serverless data platform.

The integration of Datadog into the container was seamless. Instead of integrating the agent, we chose the API to directly send monitoring events to the Datadog console. While this doesn’t allow for sending performance metrics, it satisfies our requirement that we are able to send customized monitoring events.

Sending events to Datadog is reasonably straightforward, as shown here:

from datadog import apidef create_log_event(
title: str,
text: str,
tags: List[str],
alert_type: str = "info"
) -> None:
"""Helper function to create a Datadog monitoring event

Parameters
----------
title: str
Title of the event
text: str
Detail text of the event
tags: List[str]
Custom tags to filter events in the Datadog console
alert_type: str
Alert type of the event, "info" by default
"""
api.Event.create(
title=title,
text=text,
tags=tags + ["source:etl_container"],
alert_type=alert_type,
)

Data Catalog

One common problem for data warehouse users is the inability to get an overview of the data. To address this, we provided our users with a data catalog based on Metabase, a web interface that allows them to view and explore the data on their own. By utilizing Metabase’s built-in visualization and query builder, our project teams can quickly derive insights or get a first impression of the data. We also provide our teams with pre-made dashboards to jumpstart their analysis.

Self Service Visualizations with the Data Catalog

Automation

To be as efficient and standardized as possible, we automated the onboarding of new teams to Snowflake by creating our own CLI. The CLI allows us to create a new reader account using our naming conventions, which complements our efforts to develop standardized processes around our data platform.

Fortunately, Snowflake simplifies the process of creating the CLI, since every necessary operation can be executed via SQL. As part of the process, we share every credential via an API-based password manager, thus eliminating the need for manual actions. With this type of automation, we can onboard new project teams in a matter of seconds.

Conclusion

The final Lighthouse architecture enables us to create and maintain a resilient, scalable, and decoupled data platform capable of supporting 100+ TB of data, while simultaneously being cost-efficient and straightforward to manage. Without the need to spend time on managing any infrastructure, we can concentrate on creating value by adding more features and data to the platform. This flexible architecture can efficiently serve as a blueprint for upcoming projects and lower our project teams’ and clients’ time-to-value.

1. Source: https://docs.snowflake.com/en/_images/data-sharing-reader.png

--

--