Building a data lake in S3 using an event-driven serverless architecture

This article is about the journey from:

  • A data warehouse to a data lake.
  • Batch to near real-time processing.
  • Availability to query all data from the same repository (raw and processed).

Let’s start by understanding some terms.

Wikipedia defines Data Warehouses as:“…central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”

And the term “data lake” in Wikipedia states: “…A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data …”

Until recently, the de-facto standard for handling big data would include several technologies that can cope with huge amounts of data such as it the technologies on the Apache Hadoop ecosystem (deployed and managed with Cloudera or Hortonworks, for example). These are very powerful tools but they require lots of devops maintenance and tuning.

For most of the use cases, you really can’t go wrong with them performance-wise, but there can be a big difference between the ongoing operational burden of running your own infrastructure and letting AWS/Google/Microsoft do it for you using their managed solutions. Having said that, if your team requires special tuning and portability or on-premise installation, you should stick to the non-managed solution.

For the ones that can use managed solutions: AWS, Google and Microsoft have come to the rescue, offering many data services as PaaS (Platform as a service), that can provide the required functionality and scale without the complexity of maintaining the infrastructure associated with it.

Our needs

At Natural Intelligence we manage and analyze data being delivered from our comparison sites. A critical part of our day-to-day is the ability to store and query it from the data warehouse.

Our ETLs crunch the data in batches every couple of hours at the moment, so the data availability and the monitoring capabilities are limited. We wanted to be able to get insights and alerts in near real-time.

When we started analyzing the use cases we encountered that our needs required the ability to query:

  • Raw unfiltered and unstructured data (near real-time analytics, alerts on anomaly detection, machine learning).
  • Structured and optimized data (dashboards, analytics, daily aggregations)

Our Data Platform Architecture

We use AWS all over and Redshift as our data warehouse with Tableau as the visualization tool, so it was only natural to look into what AWS has to offer us to solve our new challenges.

For the unstructured data we decided to build our data lake in S3 and query it using some Distributed SQL engine for big data (for example: Redshift Spectrum, AWS Athena or Presto). Using AWS Kinesis Firehose for streaming data and AWS Lambda for serverless processing.

S3 for storage

Amazon S3 is an object storage built to store and retrieve any amount of data from anywhere . It is designed to deliver 99.999999999% durability, and stores data for millions of applications.

AWS Kinesis Firehose and AWS lambda for data ingestion

Kinesis Firehose pairs Kinesis streams with out-of-the-box Lambda functions, so it can deliver streams of data to pre-defined destinations such as S3, Redshift and ElasticSearch. Kinesis streams allow for real-time data processing continuously collecting data as it is generated.

AWS Lambda provides a strong FaaS solution. FaaS stands for “function as a service” which means that your code is executed in containers that are invoked in response to events like an http call to a REST API or a file created in an S3 bucket.

A distributed SQL engine for big data

We are evaluating 3 options at the moment.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against exabytes of data in Amazon S3. With Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond data stored on local disks in your data warehouse to query vast amounts of unstructured data in your Amazon S3 “data lake”.

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Our Solution

We use S3 as our data lake, storing both the raw and unstructured data and querying it using some distributed SQL query engine for big data.

Our applications send events to AWS Kinesis Firehose. Firehose writes the events into files located in S3 buckets. We use AWS Lambda functions written in python for processing the streaming events in the S3 files.

We will continue using ETLs to process the raw data into structured formats and aggregations that are visualized using Tableau.

We are now in the process of implementing Redshift Spectrum on top of S3 data lake that is showing great performance so far (more on this on the coming part of this series). We will be doing benchmarking on other distributed SQL query engines for big data and keep you posted.

Next steps

  • Rollout the Redshift Spectrum implementation to be queried by relevant users.
  • Implement alerts and monitoring over the streaming data for near real-time processing using anomaly detection.
  • Create a near real time dashboard.
  • Tie it all together as a robust solution.

I will be posting more articles and “how to’s” and “lessons learned” topics in here, so stay tuned.