How we do Serverless Big Data ETL/OLAP Queries

There’s a lot of hype on serverless in recent years, it’s a nature evolution of cloud computing — the ubiquity of on demand instances and servers has taken the customer’s appetite to the next level, that is, on demand function calls.

Why do we bring up instances? We need to run applications but we don’t wanna maintain actual hardwares. The technology and market is mature enough now we don’t wanna maintain the instance itself. In layman’s term, serverless is to EC2 as Lyft/Uber is to New Yorker, we want the convenience of a car but we don’t want to pay for the parking and maintenance of a car.

It’s a very common use case for web applications especially where applications need to handle large volume of requests yet each request is usually very simple computation task. However, in the case of ETL and OLAP queries which usually has more requirements on hardware and setup (more than a container, say a Hadoop cluster), it’s still a green field for serverless.

One of our product offerings allows customers and our analysts to run all kinds of analytical queries against 20+ years worth of US consumer credit data (the data the calculates everyone’s credit score). It’s big but not huge data in TB range, however, drilling it down is almost impossible if you just dump into RDBMS.

The rest of the post I’ll be going through how we (at PeerIQ) achieve serverless data lake/warehouse and running performant OLAP queries serverlessly on AWS working with the above data.

S3 as Serverless Data Lake/Warehouse

At the beginning, I was the one to architect it out and maintain the cluster with very minimal devops support. Only took me couple days to realize there’s no way we can support long running hdfs on EC2 or even EMR, we had to dump the data on some other file systems where it can 1. still work with all the distributed query engines (hive/spark/presto) and 2. no one has to ever worry about it. Naturally, there’s one very good candidate on the horizon — S3. There’s more detailed comparison on databricks. One thing I’d like to highlight is that — with other integrations such as Glue and Athena, not just the cost savings on actual AWS service, but also the man hours is huge, this turned out to be an excellent decision.

Early in the days, migrating metastore (where hive keeps all the table information and spark and presto referencing from) from EMR to EMR was not the easiest thing to do if we bumped EMR version. With Glue (basically an external metastore/data catalog), it made things (for a developer from non-big data world) a lot easier to run OLAP query with distributed query engines on any data that’s on S3. There’s still caveats with Glue that’s either not well documented or there’s just bugs with the connector — for example, somehow I can drop partition in spark-sql but not in hive. Yet, the cost savings of having anyone with minimal understanding on how these systems are setup and how things work together being able run queries on any data on S3 with an EMR cluster worth way more than working around these caveats.

Tip: Don’t create your table with location on S3 if it’s going to be a large table, do your things on hdfs then do s3distcp to S3.

Serverless ETL and OLAP Queries

We’ve been having great success with is AWS Athena. Before Athena was available, one of my daily routine is to troubleshoot hive/spark/presto queries for our analyst, there are principals to tune the memory settings that I have probably shared too many times it’s ironed in my brain. The analyst’s job is to write analytical queries, and my job make the barrier as low as possible for their queries to work — with Athena, no clusters deployment involved, most queries work out of box, job done! Our analyst is even able to run ETL on his own (or maybe he’s just too awesome!) with Athena. I am now more concerned with version controlling our ETL queries than them being able to iterate faster.

Of course it’s not done-done, Athena still has concurrent and resource limit (a little unreliable to use for production big data ETL), at times it’s difficult to see why query’s erring out, or say we’d like implement our own UDF (that said it does support lambda expression). We’d always need our own EMR from time to time, but once we get the custom piece into processed data on S3, data engineers can ‘enjoy’ other projects on their hand.

Hopefully this gives small orgs ideas to solve big data ETL and analytics problems cost effectively! Not just lowering AWS bill, data engineers and devops finally don’t have to be constantly trouble shooting the cluster.