No-Code Data Collect API on AWS

Dima Statz
The Startup
Published in
9 min readMay 17, 2020
Services for building Big Data pipelines on AWS

Introduction

This article is all about moving data into Big Data Pipelines running on AWS. Since most data pipelines have 5 steps in common: collection -> storage-> processing -> analysis-> visualization, AWS has a very solid foundation for building all these steps. For example, when it comes to data collection step you can use the following services:

  1. Real-Time pipeline: Kinesis Data Streams, IoT, Simple Queue Services and Managed Streaming for Apache Kafka.
  2. Batch pipeline: Snowball, Database Migration Service

There are some cases that aren’t covered though. Consider a batch processing data pipelines having a lot of different data sources that are spread across different geo regions and not necessarily running on AWS cloud. For example, some of them are web apps running in browsers, some of them are mobile applications, some of them are external data pipelines, etc.

AWS has no built-in solution for such a scenario, but it is possible to construct such a Data Collection mechanism by using AWS managed services only.

Objectives

So, our Data Collect API achieves the overall goal through the following objectives:

  1. Correctness: Data Collect API should provide a reliable way to deliver data to AWS S3. File sizes vary, from 10K to 10GB. Files can be uploaded from any data source like web applications running in browsers, mobile applications, external data pipelines, and more.
  2. Scalability: The scalability target is thousands of concurrent upload sessions per second
  3. Performance: Data Collect API should support the traffic of up to 500 MB/sec to handle spikes. Data generating sources should experience minimal latency.
  4. Maintainability: Use managed services only, with nearly 0 lines of code.
  5. Security: Data Collect API should support encryption in transit, API keys, IP whitelisting, request rate limit, write-only interface, the immutability of writes.
  6. Cost: Data Collect API should be reasonably priced for 100TB/month (1000000 files)

Solution

Obviously, when we think about sending files to AWS, the first thing that comes to mind is to send the data directly to S3. AWS S3 is a solid managed service with industry-leading performance, scalability, and availability. It also supports HTTP protocol, so any given piece of code can perform HTTP PUT/POST and send data to S3. The problem with such an approach is Security and Performance.

  1. Performance: AWS S3 bucket is local to some specific AWS region. For example, you can create your bucket in Oregon and all data generating sources located in Singapore will suffer from high latency.
  2. Security: S3 does not support SSL with a custom domain name. As such, you would need to include something in the mix, which would allow for SSL to be terminated between data generation sources and S3. AWS S3 supports HTTP but it cannot do any of these: API keys validation, IP whitelisting, request limit, etc …

In order to solve these problems, 3 additional services can be used: Route53, CloudFront, and AWS WAF.

Let’s see how it works

AWS CloudFront

The core component of this mechanism is the AWS CloudFront. CloudFront is a scalable, easy to use web service for content delivery, with a pay as you go pricing model. You probably know that a CDN can be used to efficiently distribute content from the ‘center’ (the static or dynamic origin) out to the edges, where the customers are located. AWS CloudFront does it perfectly and also can transfer information from the end-user back to the origin. When users perform an HTTP PUT/POST, the data is uploaded to the nearest CloudFront edge and then it synced with the origin (AWS S3).

AWS S3 + CloudFront

Creating a CloudFront distribution is easy. Just navigate to CloudFront console, open CloudFront wizard, define S3 bucket (origin) and allow PUT commands. If you need encryption in transit, use ‘HTTPS Only’ or ‘Redirect HTTP to HTTPs’ options.

Create distribution on CloudFront

Also, you have to allow CloudFront in the S3 bucket policy.

{
"Sid": "2",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::cloudfront:user/CloudFront Origin Access Identity ******"
},
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::your_s3_bucket/*"
}

Now we can verify that everything is working as expected. We can use the following curl command:

$ curl -v --progress-bar -X PUT -T test500MB.gz http://d3ev1uq5j****.cloudfront.net/orange/test500MB.gz | tee /dev/null
* Trying 13.226.6.139...
* TCP_NODELAY set
* Connected to d3ev1uq5j4****.cloudfront.net (1*.2**.*.*) port 80 (#0)
> PUT /orange/test500MB.gz HTTP/1.1
> Host: d3ev1uq5j4****.cloudfront.net
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 659916732
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
} [65536 bytes data]
################################################ 43.7%

After a while, you will find the test file in your S3 bucket.

AWS WAF

Having security requirements like API keys enforcement, IP whitelisting, request rate limit, write-only interface, the immutability of writes is pretty common. And here AWS WAF can help. AWS WAF allows controlling over how traffic reaches the S3 bucket. In AWS WAF you can define security rules that block the common attacks, and rules that filter out specific traffic patterns. For example, you can use Managed rules that are created and managed by AWS and AWS Marketplace Sellers:

AWS WAF Managed Rules

In AWS Managed Rules set you can find such rules as IP Reputation, Anonymous IP, Known Bad Inputs, etc. So, you can get started quickly just by using Managed Rules for AWS WAF. Another approach is to create custom rules by yourself. For example, in order to achieve a write-only interface, rate limit, and API keys enforcement we can block all requests that don’t match any of the following rules.

AWS WAF Custome Rules

The rules are prioritized by the order they appear. Once a request matches any of the given rules, AWS WAF takes the corresponding action. So you can see, that by default, we block everything, except for HTTP PUT calls that contain the right API key in request’s header and don’t exceed 10 RPS from the same IP. Here you can see the example of http_method rule.

{
"Name": "http_method",
"Priority": 1,
"Action": {
"Block": {}
},
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "http_method"
},
"Statement": {
"NotStatement": {
"Statement": {
"ByteMatchStatement": {
"FieldToMatch": {
"Method": {}
},
"PositionalConstraint": "EXACTLY",
"SearchString": "PUT",
"TextTransformations": [
{
"Type": "NONE",
"Priority": 0
}
]
}
}
}
}
}

Once all rules are defined, we can test the Security mechanism that we just set. For example, you will see that any GET command will be blocked

$ curl -H "api-key: b3JhbmdlIGNkbg=="  -X GET https://**********.cloudfront.net/orange/test5MB.gz<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.

The same result you will see when sending PUT request with the wrong api-key. But when sending the PUT request with the right api-key everything works just fine.

$ curl -H "api-key: ********" -v --progress-bar -X PUT -T test5MB.gz https://********.cloudfront.net/orange/test5MB.gz| tee /dev/null###########################################################   99.4% * We are completely uploaded and fine
###########################################################. 100.0%
< HTTP/2 200...

Now we have pretty solid protection: encryption in transit, requests rate limit, write-only interface, and API keys enforcement. Here we support static api-keys, but it is easy to extend this solution and use API keys rotation. The last requirement is the immutability of writes meaning that once an object is created on S3 it cannot be overwritten. It can be achieved just by enabling versioning on AWS S3.

AWS Route53

CloudFront assigns a domain name to each newly created distribution, such as d***********.cloudfront.net. You can use this domain name in the URLs for your content, for example - https://d*.cloudfront.net/test5MB.gz. But probably, you will prefer to use your own domain name in URLs- https://your-domain.com/test5MB.gz.

If you want to use your own domain name, use Amazon Route53 to create an alias record that points to your CloudFront distribution. When Route53 receives a DNS query that matches the name and type of an alias record, it responds with the domain name that is associated with your distribution. If you want to deliver your content over HTTPS using your own domain name and your own SSL certificate, you can use one of AWS Custom SSL certificate support features.

Analysis

Now we all set. We have a solid log ingestion mechanism that can serve Big Data Pipelines running on AWS. Let’s check that all objectives goals are achieved.

Correctness: Data Collect API should provide a safe way to deliver data to AWS S3. File sizes vary, from 10K to 10GB. Files can be uploaded from any data source like web applications running in browsers, mobile applications, external data pipelines, and more — here we can put ‘V’ since CloudFront supports up to 20GB file sizes and any data generating source can upload data to the nearest CloudFront edge just by using HTTP PUT

Scalability: The scalability target is thousands of concurrent upload sessions per second — a big ‘V’ here. CloudFront supports 100,000 RPS by default. This limit can be increased on demand.

Performance: Data Collect API should support the traffic of up to 500 MB/sec to handle spikes. Data generating sources should experience minimal latency — CloudFront supports up to 40Gbps of data rate per distribution and this quota can be increased on demand. Again a big ‘V’ here.

Maintainability: Use managed services only, with nearly 0 lines of code — No code is required, except of JSON definition of rules.

Security: Data Collect API should support encryption in transit, API keys, IP whitelisting, request rate limit, write-only interface, the immutability of writes — all requirements a met by using AWS WAF and S3 features.

Cost: Data Collect API should be reasonably priced for 100TB/month — now let’s calculate the cost. So we have a traffic of ±100TB/month. Files sizes range is between 10KB and 10GB. In order to simplify the cost estimation, let’s assume that we have 50% of traffic in 10KB files and another 50% in 10GB files.

The price is calculated by using AWS on demand pricing model

  1. Number of requests: 50TB/10GB + 50TB/10KB = 5,000 + 5,000,000,000 = 5,000,005,000
  2. Cost of HTTPS requests: 5,000,005,000 / 10,000 = ±500000 => 500000*0.01$ = 5000$
  3. Regional Data Transfer Out to Origin: 100T = 100000GB => 100000*0.02 = 2000$
  4. WAF Web ACL: 5$/month + 3 rules * 1$ = 8$
  5. WAF requests: 0.6$*5000 = 3000$
  6. Optional: if your data generation sources don’t support SNI, you might need to use the Dedicated IP Custom SSL feature: 600$

The total cost of our Data Collection mechanism is around 11,000$ / month. It is not cheap, but it is reasonable for fully managed services that handle 100TB of monthly traffic.

Summary

Setting up a Data Collection by using Route53, CloudFront, WAF can be done in an hour. Such a Data Collection mechanism has a solid performance, a wide network of CDN edges that reaches all around the world and provides great security. Another advantage is that you get access to all regular AWS features like monitoring, alerting, etc. The cost is reasonable in our case.

When it isn’t a good choice? In case when your data generating sources generate a lot of small requests. For example, having 100TB of monthly traffic with the average size of each request around 1KB will cost you around 100,000$ just for HTTPS requests. In this case, you definitely should look for another approach. For example, you can use this solution which is self-hosted and extremely cheap but will require an effort and time to set up, maintain, and monitor. If you are all about AWS manages services, you can also take a look at Kinesis Streams with a Cross-Region replication.

--

--