Affordable and Scalable Data Analytics on AWS Using Athena and Data Pipeline

Statehill uses AWS to measure the government’s impact into actionable intelligence to help your ability to advocate effectively

Karl Roos
A Cloud Guru
4 min readFeb 18, 2017

--

Statehill uses AWS Data Pipeline and Athena to efficiently query and ship data from RDS to S3

If you’re using Amazon Web Services or just to some extent keeping tabs with their service offerings you can’t have missed out on the latest addition in their suits of analytics services, Athena. It’s a tool that fits in very well with the recent trend of serverless cloud computing. Essentially just computing without having to allocate the cloud based resources yourself.

AWS Athena, like Google’s BigQuery is a serverless tool for querying big data sets without having to set up clusters of physical or virtual machines. A somewhat fair comparison in the AWS space would be to say that Athena is to EMR what Lambda is to EC2. You have a lot of limitations and you can just do a subset of the things you can do with the non-serverless option. But once you actually architect with taking those limitations into account you can build something that delivers result at a really low price point.

We decided that we wanted to port some of the calculations we do for legislator ideology in our application to Athena, with the goal of being able to scale better and cut costs.

Statehill connects legislative data around the world to discover, analyze, and measure government’s impact on you. We start with collecting and organizing legislative information into actionable intelligence to facilitate your ability to advocate effectively.

Essentially what we’re doing is taking historical voting records and determining how Republican or Democrat leaning legislators are on a number of categories. We’re currently doing this in a number of US states, including Montana, Texas and Wisconsin.

Ideology for Alan Redfield (R), Montana legislator in the 65th session

First and foremost we had to conclude that we could accomplish the task using nothing but SQL, as that’s how you query Athena. Once we had determined that, our first step was shipping data from our RDS instance to S3 in a CSV format so we can create a table in Athena. And how do you efficiently ship data from RDS to S3 on a regular basis? Obviously using Data Pipeline!

Pipeline for exporting data from RDS to S3

Data Pipeline is an incredibly powerful Swiss army knife kind of service, and thanks to scheduling it can be a viable option for running recurring data transformation tasks, or even used just for computing tasks. Which is how we ended up using it to actually query Athena.

Athena may at first glance just look like a GUI. But you’d still think you can run queries using the AWS SDK? Not so fast, the only way to actually run a query outside of the GUI is by connecting using JDBC and using the Athena driver. So for us to query Athena on a regular basis we ended up using Data Pipeline again, but this time the goal was to run a ShellCommandActivity for preparing the results of the query in S3, and then later ship it back to our RDS instance.

To accomplish this we had to build a custom AMI with Java 1.8 (as the default EC2 AMI has 1.7), build a small Java class that could connect to Athena, run a query and then output the results at a specified location. Now I’m not a good Java developer by any means, but I can get around. In the end, we ended up with something like this (loosely based upon the sample provided by AWS):

This worked incredibly well, and since the sql variable and outputLocation is actually coming from args we can reuse the same class for future queries, and simply duplicate the pipeline.

Final setup, now serving data for our customers in production

Conclusion

Athena is an incredibly powerful tool and I recommend that you take a look at it. It’s a cheap and fast way to run queries distributed, and it’s now powering production data for our application. At $5 per TB of data scanned the barrier of entry for big data processing has been significantly lowered.

There are certainly a few ways to improve our setup, there almost always is. But the main thing would probably be swapping the CSV’s that we’re querying in Athena for Parquet files to save on costs and improve query performance. For more about that, Mark Litwintschik wrote an excellent post comparing different data formats on Athena here that I recommend reading.

--

--