Principled Data Engineering, Part I: Architectural Overview

Published in

SSENSE-TECH

10 min readApr 25, 2019

The first of a two part series exploring the fundamental themes of architecting and governing a data lake. Co-authored by David Lum & Hussein Danish.

*Click here for Part II

Data engineering is not the most well understood profession. From a high-level perspective, the problems encountered can seem almost trivial. Aggregate company data and make it accessible; while this mandate may seem straightforward, the devil is in the details, as they say. It can be challenging to have a system which accepts a wide variety of types and formats of data, accounts for their change over time, enables accessibility and governance, all while remaining robust. Furthermore, considering the wide variety of stages at which a company may be, there is no one size fits all solution. This article will start off by covering some basic concepts and terminology to make sure everyone is on the same page, and then move onto more specific topics about data infrastructure, with an emphasis on the concept of a data lake. Finally, we’ll end with how the SSENSE data engineering team has applied these concepts.

The Storage Layer

Let’s assume you’ve just started working as a Data Engineer at an organization. One of the first things you’ll want to do is get familiar with the organization’s data architecture. More specifically, you’ll want to know what kind of data exists and where. For example, in a microservices architecture, your different microservices might be generating events data. You might also have a primary database that supports your organization’s core application. Different sources might generate different types of data, which might change at different rates, and be stored in different ways.

Chances are that your organization will, at the very least, have a transactional database supporting the key parts of your production architecture. Perhaps there will also be an analytical database to enable data analysis. These systems are commonly referred to as OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing). The differences between these systems are well documented (see here and here) and are out of this article’s scope. That being said, it’s important to emphasize that these systems should be separate from one another. Analytics databases usually support large, resource intensive, and infrequent read queries with very few and predictable write queries, while transactional databases are designed to support large amounts of frequent but small read and write queries. The two have completely different architectural needs. More importantly, analytics queries can easily bottleneck production database infrastructure due to the size and latency of their expected responses. Needless to say, this should be avoided at all costs.

Traditional data systems usually place constraints on the types of data you can store. What kind of database would be best to store different events generated from various microservices, and how would you best store any logs they generate? When dealing with a variety of formats like this, a data lake can be a good solution. While traditional database systems offer more structured, but rigid ways to store data, data lakes are far less opinionated, but provide enough control and information to allow users to examine and govern their data, and dive deeper to answer more fine grained questions.

A data lake cannot be conceptualized in the same way as a database. A database contains data in a refined, consumable format, whereas a data lake ingests data that is raw, heterogenous, and generally harder to use. Much of the challenge involved in building a data lake is managing the data’s transformation from its original raw state to a final, consumable format. Concepts such as storage, governance, and access, that are well understood in traditional databases, need to be planned carefully. Determining processes for ingestion and cataloging are necessary, and often require applying schemas and transformations to unstructured data in order to render it accessible to end-users. In the same vein, the choices made with respect to partitioning based on either time or type of content will heavily influence the way in which you process your data.

At SSENSE, we are currently building a data lake from scratch. At this stage, our primary mandate is to aggregate, refine, and make available much of the company’s data in a centralized location that can serve as the single source of truth for the organization. However, blindly collecting data can quickly take you from building a data lake to a data swamp. A data strategy is required so that solving business problems is a guiding principle for building the data lake. More information about data strategies can be found here and here. This sentiment is captured by Rob Casper, Chief Data Officer of JPMorgan Chase (as published here):

The best advice I have for senior leaders trying to develop and implement a data culture is to stay very true to the business problem: What is it and how can you solve it? If you simply rely on having huge quantities of data in a data lake, you’re kidding yourself. Volume is not a viable data strategy. The most important objective is to find those business problems and then dedicate your data-management efforts toward them. Solving business problems must be a part of your data strategy.

Our data lake is split across the following three AWS S3 buckets (a cloud based flat file storage system), each of which plays a very important role in the lifecycle of our data sets:

A “raw data” bucket which contains the raw, untransformed, and lossless incoming data from all our sources such as event streams from microservices, transactional database snapshots, log dumps, and data from third party sources via FTP or API calls. Data here can be in various formats such as CSV, JSON, flat text files etc., and may not have defined schemas. This bucket plays the key role of guaranteeing no data loss and replayability of our pipelines. In other words, if our pipelines downstream change or fail, our raw data store guarantees that all the original source data remains intact and available for re-processing.
An “interim data” bucket which imports the aforementioned raw data and performs the most minimal transformations required to homogenize its structure, impose schemas, and allow cataloging. We use Parquet as a format for this stage and we recommend using a format that enforces a schema. This eliminates a lot of dangerous ambiguities of schemaless data, which can lead to data loss and poor governance. The minimal transformations performed at this step also allow for some critical type management such as homogenizing date formats and number types (decimals, floats, doubles, etc.), and handling null values.
A “business data” bucket which presents transformed datasets to end-users. Data here conforms to semantically meaningful naming conventions and each dataset corresponds to a specific business need. Furthermore, the datasets here have more refined schemas that make sense to our end-users — the consumers of this data.

By organizing data this way, problems can be compartmentalized, all data can be governed and traced to its source, and all pipelines can be replayed if necessary. Beyond this, organizations can also consider separating hot data, with short to medium term access needs, and cold data, which simply needs to be archived. Cloud providers will often provide different services for these, with different pricing strategies for storage and access.

The Pipeline

We’ll begin this section by considering what characteristics the ideal data architecture might exhibit, and then, accounting for real world encumbrance, see how this manifests itself in practice. At SSENSE, our data pipeline and data lake architecture is closely coupled. In breaking down an ETL (Extract, Transform, Load) pipeline, the simplest aspect is the T. Although, the transformation might be where the majority of your code lies, it is the system you have the most control over. You do not rely on external APIs to be functional, or connections to external systems (such as a remote SFTP server) to be operational, as you often have to in the E and L phases.

The transformation can be modeled as a simple function from the raw data to the transformed data. In a perfect world, that would be the end of it. The client has data in one form, needs it in another, and we’re done. The reality is, of course, that things break, requirements change, and pipelines need to be retried. This leads to a first desirable characteristic of a data pipeline — its replayability. Ideally the input, the data we are extracting should be immutable. Another desirable characteristic is idempotence. Something that is idempotent can be applied several times without changing the result beyond the initial application. Formally this is f(f(x)) = f(x). In practice, this means that running your pipeline more than once beyond the first time will not duplicate or corrupt data. In cases where pipelines are rerun — automatic retries for example — this can be essential.

Although we do interact with a stream of events data produced by our microservices, most of the sources we deal with can only be processed in batch format. Batch processing is typically divided into two general workflows; extract, transform, load (ETL) and, extract, load, transform (ELT).

In an ETL pipeline, data is extracted from a source, transformed to the required shape, and inserted into the target. The advantage of ETL is that data enters your system in the shape you want it to be in, and can easily be modeled for analytics. This works exceedingly well when the data is coming from consistent and trusted sources, but can quickly become too brittle in cases where there is a possibility of a schema change, or the data cannot be re-extracted. Imagine a scenario wherein data is extracted from a third-party API, transformed and then loaded into a data warehouse. Some time later the process needs to be rerun, but the source data is no longer available due to the third-party aggregating its data after a certain amount of time (to reduce storage fees). In other words, your system does not offer immutable data.

ELT addresses this by extracting and loading the raw data immediately into storage. From there, you can rerun the transformation portion of the pipelines to your heart’s content. The drawback here is that the raw data can potentially be schemaless and unstructured. Managing this raw data becomes the main difficulty in this configuration, and in an era of ever increasing compliance, proper cataloguing is critical.

At SSENSE, we have developed a system that can be considered a hybrid of the ETL and ELT paradigms. As explained earlier, we load raw data into our first bucket — this offers immutability. For extracting and managing batch data, we use Apache Airflow, a robust pipelining framework for Python. To ingest our events stream, we use Amazon Kinesis Data Firehose which dumps date partitioned raw data. From there, we perform the minimum transformation required to allow AWS Glue to catalog the data. Finally, we transform the data into its consumable form and make it available via AWS Athena — a distributed SQL query engine built on Presto. Depending on the size of the data we are transforming, the transformations are handled either using PySpark scripts hosted as Glue jobs, or using Pandas. All our transformation scripts are encapsulated in an internal, version-controlled pip package. Throughout the process, we also track data lineage, which, combined with cataloging allows us to retrace our steps to the original source, providing better data governance than simply raw ELT (more on this in Part II of this series).

An architectural map of the SSENSE data lake

Our Airflow pipelines are responsible for ensuring that each of the aforementioned steps is carried out on time and in order, thus ensuring temporal consistency. Airflow is also responsible for handling task failures, retries, and external connections management. This setup can be thought of as a three tiered system which moves from raw data to normalized and catalogued data, and finally to consumable business data, gradually refining the schema and homogenizing the data at each stage. Our system offers immutability, temporal consistency, and idempotence, granting us fearless replayability with no risk of data duplication.

Apart from this, we also rely on and maintain certain auxiliary systems such as Terraform modules to manage our infrastructure as code, Jenkins pipelines to manage continuous integration and deployment, GitHub for source control management, Docker for containerization, and Kubernetes with Helm charts for orchestrating containerized deployments.

Although the benefits of a data lake architecture have been highlighted, there are a number of cases for which it is not suitable. If you want mutable data for example, use a database. The advantages of the data lake are constructed around having an immutable source of truth. Furthermore, data lake management is less well-defined, and has less industry standard tooling built around it, although this may change over time. If you are operating a relational database, and need some sort of visibility or reporting on it, there are many robust, open source solutions available immediately that you can turn to. This isn’t necessarily the case with a data lake. If your data engineering team has limited bandwidth and scarce resources, a more traditional data warehousing approach might be wiser.

In this article, we have covered two of the most important elements of data engineering, and shown how they are tightly coupled — storage and pipelining. We discussed the apparent simplicity of these systems while also highlighting some of the key challenges that a data engineer might face when implementing them, particularly in the context of a data lake. However, these aren’t the only important concepts. A critical aspect of every data engineer’s role is data governance. The data you store is a source of both tremendous value and liability. Ensuring that all such data is properly managed, protected, catalogued, tracked, and accounted for, is of utmost importance. We briefly touched on subjects such as cataloging and lineage. Stay tuned for the second installment of the Principled Data Engineering series, in which we will delve deeper into such subjects and share our perspective on the art of data governance.

Editorial reviews by Deanna Chow, Liela Touré & Prateek Sanyal.

Want to work with us? Click here to see all open positions at SSENSE!

Principled Data Engineering, Part I: Architectural Overview

The Storage Layer

The Pipeline

Written by Hussein Danish