ETL (Extract, Transform, Load) in Insights

Why AWS Step Function

While we are working on the next generation ETL platform in Compass, we are looking for something with these features:

  1. Scalable
  2. Reliable (no single point of failure)
  3. Supports both severless workers and self-managed workers
  4. Is considered an industry standard with open source community support

After weeks research, we have targeted two major Candidates: Apache Airflow and AWS Step Function. They are both good while having different Pros and Cons.

AirFlow

Pros:

  1. Airflow has a large and active open source community.
  2. UI is probably the best among all ETL platforms in the market.
  3. It is open source so works with any cloud provider.

Cons:

  1. Single point of failure for the scheduler. This is a deal breaker at Compass, since we are aiming to remove anything that is tied to a single point of failure component.
  2. AWS does not provide native managed service for AirFlow. Compass engineers have to maintain the AirFlow server by ourselves.

AWS Step Function

Pros:

  1. Fully managed by AWS. No self installation or maintenance needed for the server.
  2. No single point of failure. The scheduler is clustered on the AWS side natively.
  3. Easy and native integration with the most cutting edge serverless implementations including Lambda.
  4. Also supports connecting to a self-managed micro-service using the AWS API.

Cons:

  1. Not open source. The configuration for AWS step functions’ scheduler cannot be reused in a non-AWS environment.
  2. The community support is not very active.

It is a tough decision. After considering all the various factors, we decided to go with AWS Step Function. Since it is a managed service, it frees up our engineering resources from maintaining the basic infrastructure of this platform and allows us to focus on building the data pipeline.

Source Control & Deployment: Serverless Framework

Lambda is great, and Step Function is awesome. But editing/deploying the code through the AWS console is not a preferred development experience by many engineers. Luckily, we have Serverless Framework to help us edit, debug and deploy the code using our familiar command lines. Plus, it provides us a native way to source control the code.

Sample commands for Serverless

With Serverless, you can edit your lambda code in your favorite IDE. In addition, Serverless allows you to test/deploy the code with command line.

Test/invoke remote lambda function

sls invoke — function {functionName} — stage {stage} — data '{"foo":"bar"}'

Test/invoke local lambda function

sls invoke local — function {functionName} — data '{"foo":"bar"}'

Invoke step function state machine (workflow)

sls invoke stepf — name {functionName} — data '{"foo":"bar"}'

Deploy to staging/production (applies to both step function and lambda)

sls deploy — stage {stage}

Why do we still need managed workers?

Lambda scales by itself, but it also has its own limitations.

In the general scope, the biggest concern for Lambda is the 50Mb code size limit. If there is a job that involves a lot of code, particularly large libraries like numpy, pandas, etc, it is easy to reach the code size limit. In that case, we have no choice but go to the managed service approach.

In a Compass specific scope, the integration of Lambda and Thrift/gRPC is not an industry or Compass standard yet. If a job needs to talk to a Compass micro-service through gRPC, then the managed service will be the only choice.

The above two reasons are easily resolvable and serverless approach is still encouraged. However, it is nice that we have a fallback solution when we need it.

A sample step function workflow in Compass

In Compass, a very typical digital marketing data step function workflow is:

  1. Load source data from various digital marketing vendors.
  2. Normalize the data and transform them into a relational database.
  3. Group the 3rd party data with Compass internal listing data and generate the completed digital marketing data report.

Workflow chart

Final output — an aggregated line chart in our Insights app

Sample YAML config