My Data Lake Story: How I Built a Serverless Data Lake on AWS

Seth Saperstein
9 min readAug 21, 2019

--

In this post, I’ll explain the design decisions, trade-offs, and complications of building a cloud-native data lake built on a relational database.

Before we begin, I should explain the type of data we’re handling because it will impact the design decisions of the architecture. The primary data is ingested from a location-gathering mobile SDK. Every couple of hours, the SDK decides to batch upload visitation points in JSON to an AWS endpoint. The data itself looks a little something like this:

{
"arrival": "2019-08-01 20:08:44 +0000",
"departure": "2019-08-01 20:13:49 +0000",
"duration": 5,
"horizontalAccuracy": 10,
"latitude": 42.27582191586706,
"longitude": -83.73760169379761,
"uuid": "7A538D9D-9DB9-4FE3-9B1B-A9557B2293C8"
}

At the beginning of the design process, the solution seemed somewhat straight forward, but certain constraints led this architecture to be somewhat irregular. I wanted to ingest, decorate, store, and query this data with cost and simplicity being the main concerns.

In this blog post, I’ll show you:

  • Why I chose a serverless data lake
  • An architectural diagram
  • End result

Architectural and Design Concerns

Here are some of the things that I’ve considered:

  • Minimal infrastructure maintenance. The cloud, especially serverless services, provides multiple levels of abstraction above the actual hardware. No servers sitting idle. No hypervisor maintenance. No problem scaling for the potential use of 25+ million DAU’s.
  • Cost. The largest concern with regards to cost is the actual long-term storage of potentially petabytes worth of data. For reference, the average-length adult books is about 100,000 words. The average length of an english word is 4.5 letters. Assuming each character is 1 byte, we’re looking at storing the equivalent of 10 billion books!
  • Updates to data formats. As mentioned earlier, the data ingested is location-related and coming from a mobile SDK. However, Apple is constantly updating their framework for collecting this data, often times including new fields. As a result, our data lake will have to react by altering its schema while allowing for data without these field to be accepted as well.

These considerations led to using the following AWS services for my serverless data warehouse:

The following diagram shows at a high level how these services are used:

Tradeoffs

While these services meet the requirements of the data lake, tradeoffs were made to increase simplicity and minimize dependencies between the SDK and our data lake.

  • Decreased interdependency vs efficiency. To make the data lake independent of the SDK for simplicity, API Gateway was used to make HTTP requests without authorization. While this data is stream-like data, API Gateway can handle a load of 5000 requests per second. So, let’s do some math. At max capacity, we’ll assume 25 million DAU’s — let’s round to 24 million to make things easier. Assuming uniformly distributed batch upload times between devices and 4 batches per device per day, we’re looking at (24 million devices * 4 requests) / (24 hours * 60 minutes * 60 seconds) = 1111 requests / second! This is well below the 5000 requests per second API Gateway can handle even with some normal variance thrown in there. The alternative involves writing directly to Kinesis Firehose, however, users need an IAM role with the attached policy to write to Firehose. To obtain such access from iOS, we would now need to create AWS Cognito User Pools and use the AWS SDK for iOS to create a phony login/password for each user and write to Kinesis Firehose. Again, simplicity over efficiency in this case. With a price of $3.5 per million requests in API Gateway, it was a no-brainer.
  • Consistency vs simplicity. As you may have noticed, Lambda alone is used for extract, transform, load (ETL) jobs rather than AWS Glue. AWS Glue is a fully managed data catalog and ETL service for simplifying and automating data discover, conversion, and job scheduling. It uses Crawlers to detect data that has not been transformed and is able to add a label to that data to indicate that it is now transformed and in the proper form to be queried by Athena. However, an essential step in this data lake is that an API call to Google Places is needed to decorate the data by obtaining the name of the place the visitation occurred at, based on the latitude longitude of each point. Since Lambda is being used to extract the latitude and longitude of each point from the JSON, it takes only a few extra lines to write this data to S3 in a format proper for building tables in Athena. Thus, Lambda is triggered based on S3 writes to our raw data bucket and transforms our data in a way that still requires Athena to properly construct and repair tables (due to partitioning) before querying.

Building Data Pipelines with API Gateway, Kinesis Data Streams, and Kinesis Data Firehose

There are two main ways to deliver data from API Gateway to S3; the first is by triggering a Lambda and the second, which is used in this case, is Kinesis Streams and Kinesis Firehose. While Kinesis Streams sequences and consumes events to be stored in disk, Kinesis Firehose consumes the stream and serializes into S3 in batches of N minutes or N megabytes. The following requires a POST request with the “PutRecord” action from the API’s integration request. The following syntax is used for the corresponding mapping template:

{
"Data": blob,
"PartitionKey": "string",
"StreamName": "string"
}

It’s worth noting that Kinesis Streams is what I like to refer to as “serverless-ish” — Kinesis throughput scales without limits via increasing the number of shards within a data stream. However, each shard can only support up to 1000 PUT records per second. For my use, 1 shard should be enough and 2 at full capacity, but it is possible to automatically scale shards based on current capacity using CloudWatch Metrics and Lambda.

Using S3 for data storage allows for automatic high availability, scalability, and durability. As noted earlier, petabytes worth of data can quickly become expensive to store and access though. There are two ways in which we make this more affordable.

  • GZIP compression. S3 standard storage costs $0.023 per GB for the first 50 TB per month. Thus the smaller the data is, the cheaper storage in S3 becomes. For this reason, we specify Kinesis Firehose to use GZIP compression before delivering to S3. While GZIP can compress data up to 95%, standard english text is compressed only roughly 65%. However, due to the fact that file compression is based on the frequency of strings and this JSON data is quite redundant, the result is roughly an 86% reduction in both size and costs of storage.
  • S3 Storage Classes. As mentioned earlier, S3 standard storage costs $0.023 per GB. However, based on the frequency that our data is accessed, we may want to transition to S3 Standard — Infrequent Access which costs $0.0125 per GB or even Glacier which costs $0.004 per GB. However, data retrieval becomes more costly and retrieval times can take up to 5 hours with Glacier. At this time I’m not entirely sure how often I’ll need to access data as it’s primarily dependent on client demand. For this reason, S3 Intelligent-Tiering will be used to optimize costs by automatically moving data to the most cost-effective access tier, without performance impact or operational overhead.

Gathering insights from our data lake using Athena, CloudWatch, and Lambda

To gain insights from our data lake, we need some way of querying our data in S3 without the need to load or further prepare our data. There are two ways to do this.

  • Amazon Redshift Spectrum. Redshift Spectrum is used to query data in S3 with benefits of open data formats and the ability to scale out to thousands of Redshift nodes to pull data, filter, project, aggregate, group, and sort. You pay for the resources you consume and get the benefits of using standard SQL. Redshift Spectrum is built primarily for business intelligence queries with performance and cost-effectiveness in mind.
  • Amazon Athena. Athena is similar in that it is used to query data in S3, you pay for only the resources you use, and that it accepts standard SQL. However, rather than using leader nodes and clusters, Athena performs ad-hoc queries and is built on Facebook Presto.

While both appear to have the same function and similar costs of $5 per terabyte scanned, Redshift Spectrum was built to make life easier for Redshift users. Many analytics tools didn’t support Athena but supported Redshift at the time. The only problem was that creating Redshift clusters and storing data was a bottleneck, especially when Redshift isn’t readily horizontally scalable and may include some downtime for adding new machines.

For this reason, Redshift Spectrum is more beneficial to those who are existing Redshift users, also giving them the ability to move colder data to external tables while still having the flexibility of joining with Redshift tables. As a user without the preexisting Redshift clusters needed for Redshift Spectrum, Athena appeared to be the simplest solution.

The following methods leveraged Athena to run business intelligence queries:

  • Use cron scheduling in CloudWatch Events to trigger a Lambda function every hour
  • Use Boto3, an AWS SDK for Python, to execute Athena queries and check the status of corresponding queries
  • Extract the query result in Lambda and write to an S3 bucket in a format readily compatible with the admin dashboard

It’s worth noting that Lambda Function timeouts have a maximum of 15 minutes, thus long-running queries equate to timeouts and paying for compute time in Lambda that goes unused. The solution — Step Functions. Step functions work in the following ways, splitting the process into multiple Lambda functions:

  • Execute a long-running query.
  • Wait x seconds and check the query status.
  • If query status does not exist return to step 2.
  • Process the query result or handle the failed query.

While Step Functions aren’t used in my architecture (yet), they are worth strongly considering.

Building a REST API using API Gateway’s Lambda Proxy to access S3 files

Now that the data is sitting in S3, it’s time to deliver it to our admin dashboard. While most use Amazon QuickSight or Tableau to display insights from querying data in S3, I chose the non-traditional route of creating an admin dashboard from scratch in ReactJS for two reasons:

  • More customizable with regards to UX/UI
  • Allows the flexibility to easily turn this admin dashboard into a client-facing web application for the ad-tech industry, returning advertiser ID’s based on names of stores, durations of visits, and other geographical factors

By passing the S3 file corresponding to the Athena query in the URL path, API Gateway’s Lambda Proxy integration is able to return the file with the help of our good friend, Boto3. As a result, the following data gets delivered from our API:

{
"data": [{
"label": "DAUs Over Time",
"data": [3, 1, 1, 3],
"color": "#e74c3c",
"labels": ["Thursday", "Friday", "Monday", "Wednesday"]
}]
}

End result: the admin dashboard

For more details on the development of this dashboard, the code is available on my Github.

Summary

As a “newbie” to AWS who recently received my AWS Solutions Architect Certificate, I was thrilled how simple the learning curve was for creating a fully serverless data lake. In just a matter of weeks I was able to build out a low-maintenance pipeline with the ability to scale for tens of millions of devices.

I hope to continue building out interesting data-related applications and also hope to inspire others to build out a data lake of their own!

--

--