ETL (Extract, Transform, Load) in Insights

Shiyang Fei
Mar 5, 2019 · 4 min read

Why AWS Step Function

While we are working on the next generation ETL platform in Compass, we are looking for something with these features:

After weeks research, we have targeted two major Candidates: Apache Airflow and AWS Step Function. They are both good while having different Pros and Cons.

Image for post
Image for post

AirFlow

Pros:

Cons:

AWS Step Function

Pros:

Cons:

It is a tough decision. After considering all the various factors, we decided to go with AWS Step Function. Since it is a managed service, it frees up our engineering resources from maintaining the basic infrastructure of this platform and allows us to focus on building the data pipeline.

Source Control & Deployment: Serverless Framework

Lambda is great, and Step Function is awesome. But editing/deploying the code through the AWS console is not a preferred development experience by many engineers. Luckily, we have Serverless Framework to help us edit, debug and deploy the code using our familiar command lines. Plus, it provides us a native way to source control the code.

Sample commands for Serverless

With Serverless, you can edit your lambda code in your favorite IDE. In addition, Serverless allows you to test/deploy the code with command line.

Test/invoke remote lambda function

sls invoke — function {functionName} — stage {stage} — data '{"foo":"bar"}'

Test/invoke local lambda function

sls invoke local — function {functionName} — data '{"foo":"bar"}'

Invoke step function state machine (workflow)

sls invoke stepf — name {functionName} — data '{"foo":"bar"}'

Deploy to staging/production (applies to both step function and lambda)

sls deploy — stage {stage}

Why do we still need managed workers?

Lambda scales by itself, but it also has its own limitations.

In the general scope, the biggest concern for Lambda is the 50Mb code size limit. If there is a job that involves a lot of code, particularly large libraries like numpy, pandas, etc, it is easy to reach the code size limit. In that case, we have no choice but go to the managed service approach.

In a Compass specific scope, the integration of Lambda and Thrift/gRPC is not an industry or Compass standard yet. If a job needs to talk to a Compass micro-service through gRPC, then the managed service will be the only choice.

The above two reasons are easily resolvable and serverless approach is still encouraged. However, it is nice that we have a fallback solution when we need it.

A sample step function workflow in Compass

In Compass, a very typical digital marketing data step function workflow is:

Workflow chart

Image for post
Image for post

Final output — an aggregated line chart in our Insights app

Image for post
Image for post

Sample YAML config

Compass True North

Compass Engineering & Product Blog — An inside glimpse at…

Shiyang Fei

Written by

Compass True North

Compass Engineering & Product Blog — An inside glimpse at our technology and tools, brought to you by the engineers of the game-changing real estate platform, Compass. Hiring at https://www.compass.com/careers/

Shiyang Fei

Written by

Compass True North

Compass Engineering & Product Blog — An inside glimpse at our technology and tools, brought to you by the engineers of the game-changing real estate platform, Compass. Hiring at https://www.compass.com/careers/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store