How we built a big data platform on AWS for 100 users for under $2 a month

Amazon Athena is a serverless query service that helped reduce our per user cost from over $74 a month to less than $0.02

Jacob Richter
A Cloud Guru
7 min readApr 11, 2018

--

A large swathe of the content on serverless deployments is based around CRUD and authentication operations — think lots of Lambda + DynamoDB + Cognito wizardry. Instead, I wanted to share my experience in learning, developing, and using a much wider range of the AWS offerings — the first being Amazon Athena.

This story will focus more on the project versus the development —exploring the costs and development from a product owner’s perspective. I’ll save the nitty-gritty for my next blog.

Please note — yes, the true total cost of ownership for the entire platform end-to-end is higher than what the title of this story suggested; although the analytics layer itself genuinely costs pennies.

Time is money, money is time — we had neither

The customer requirements were fairly straightforward:

  • we want a platform that allows our customers to access and visualize any of their large stores of data at any time of the day
  • we want a machine learning pipeline on this data
  • we want this to run as lean as possible
  • we need it developed in 1–2 months
  • we don’t want to pay for anything our customers aren’t using

Oh — and you will be the only developer.

Breaking down the problem

The product required a large time investment to develop the client side application and machine learning pipelines, so we couldn’t afford to focus too much time on the back-end. We needed a clean, quick to develop, easy to deploy and update, extremely robust and secure platform.

I have developed numerous data management applications using serverless before — so I knew the ins and outs of the whole Lambda / API Gateway / Cognito services, connecting them to SPAs hosted on S3, and serving them through CloudFront.

But this challenge was new. I had to analyze large amounts of financial data with billions and billions of rows, sitting in tens of thousands of files — with low latency and execution time.

Just use Redshift they said …

Redshift is great, Redshift is fast, Redshift is highly scalable, and we even built our MVP on Redshift Spectrum for front-end testing. But you know what else? Redshift is hella expensive — just ask Larry Ellison.

RedShift was apparently named very deliberately as a nod to Oracle’s trademark red branding

For our application we would need at least one of the big fella’s, a ds2.8xlarge dense storage instance, costing $10 an hour or $7,440 a month, and supporting up to 100 users — assuming 160GB per user.

But once it fills up, then what? We could create an ETL (Extract, Transform, Load) pipeline and scrubbers to dynamically load and erase data sets based on users user requests. But this would be costly, slow and difficult to manage. Or we could just boot up another Redshift instance for a further $7,440 per month (nasty).

Additionally, we would still be paying for the data to be held in raw format in a cold-storage S3 bucket — effectively paying for data to just sit there. The true power of Redshift is the raw power to run a lot of monster queries really, really well. But we simply did not need this type of raw beast power.

Given the type of data, queries, and usage patterns we were expecting, this solution just wasn’t a feasible option for our application. Not to mention the costs were well above what the “powers that be” wanted to spend.

Athena to the rescue!

Initially released to the public in late 2016, Amazon Athena is a serverless, SQL-based query service for objects stored in S3. To use you simply define a table that points to your S3 data file and fire SQL queries away! This is pretty painless to setup in a Lambda function so we never have to run a server again.

The price model of Athena is interesting; you are charged only for the amount of data scanned by each query and nothing more. Currently the rate is $5 per TB of data scanned, charged to a minimum of 10 MB.

On average the data files provided by customers are between 100MB to 1GB each, in CSV format. So at worst for a full data set scan we are looking at $0.005 per query. With our dashboard requiring approximately 20 queries to be run on each page load this equates to:

Cost of Page Load = 20 queries * $0.005 per query = $0.10 per Page Load

If a user applies filters, this would require another batch of queries. Say a user clicks on 20 filters each visit, this would require 400 queries and end up costing $2.00 per visit.

Based on our estimate that users would visit our platform once a day, and with the aim to having approximately 100 users, this would end up costing:

Total Costs = 23 working days * $2.00 per day per user * 100 users = $4,600 per month

At $46 per user per month, this is getting better!

How low can you go? 🎶

Step 1: Compression — Because our data is highly compressible, we were able to achieve 5:1 Data Compression, reducing our query costs by 5 times. Now we are at $9.20 per month per user.

Step 2: Columnar Data Format — As suggested by AWS, we converted all out data in parquet format, massively reducing the amount of data our queries are run on. It is also important to note that the type of data we have in our system is highly categorized.

In general, our queries only need to be run on data from specific categories (although we do have one full table scan). Given our data is normally distributed over 80 categories, this means we are now looking at 1/80th of the cost. This brings us to approx. $0.12 per month per user.

We implemented the file conversion by a triggering a long-lived Lambda function on file upload to do the conversion, costing us close to nothing to perform for how much we gain over time.

Step 3: Caching — The data in the lake was a collection of monthly records, and as such users would generally in a one month period only query the data that was uploaded for the previous month. We didn’t want to have to rerun the same queries over and over so we began to systematically store and categorise the results also in our S3 datalake.

We found that on average each month a single user’s requests would generate 1600 unique queries, which we would then store (for next to no cost) and serve to them if re-requested. See below on how this also improved the user experience.

Step 4: Running queries together — With the techniques above we quickly hit the 10MB minimum charging threshold for Athena. So to make it cheaper still, we began to string multiple queries to be run together, then we would split them apart on Lambda before sending them back to the client.

Improving experience — reducing latency on frequent queries

As mentioned above, the visualization dashboard has a lot of charts. These charts need a lot of queries to fill, with each user action requiring a re-fetch of the data.

In the span of a few seconds we could have potentially hundreds of requests each taking between 5 to 10 seconds, making for a very nasty user experience. Users would be viewing the same dashboards quite often — so it doesn’t make sense to have to fetch the results of a query that you have already performed.

The simple fix we deployed was to cache the results by saving the Athena query ID in a DynamoDB table, with the sort key being a string that represents the type of query and additional SQL parameters. So now, when someone executes a query from the dashboard we first check to see if the query exists in the database and if so simply fetch the data.

In fact, we now preload this database with the 1600 unique queries each time new data is added to the platform — allowing us to achieve sub-one-second API response times from first use.

Further cost reductions — moving AI to the client side

Our post-processing machine learning models are mostly K-means clustering models to identify outliers. As a further step to reduce our costs, we are now running those models on the client side using TensorFlow.js instead of using an autoscaling EC2 container. While this was only a small change using the same code, it resulted in a proportionally massive cost reduction.

The results

We reduced our monthly costs by a factor of 3700 — for the same cost of what it took to support a single user, we can now get 3700 users.

By compressing, converting, caching, and simultaneously executing we managed to reduce our costs from an initial $74 per user per month to less than $0.02 per user per month. Effectively we are now running our entire analytics platform for less than the cost of a cup of coffee a month. Wow!

So what’s next?

That’s it for Athena. In upcoming blogs, I’ll focus on how we have used other serverless services and SPAs to create and deploy a wide-range of applications. I discuss the challenges given to us, and how we broke them down to achieve awesomeness for very little 💰💰💰.

Now go out and explore how you could better leverage serverless to reduce run-costs and heighten scalability!

I’d love to hear your questions and feedback in the comments below.

--

--

Jacob Richter
A Cloud Guru

Jacob is a software engineer, data scientist and technology consultant. Currently the Head of Engineering at Propeller Aero.