AWS Athena or AWS Redshift. What’s right for you?

Published in

affinityanswers-tech

3 min readApr 26, 2021

According to AWS

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

It is based on Prestodb, developed by Netflix and Facebook. Basically, you have your data in AWS S3, in one of the formats like CSV, TSV, Apache Parquet, JSON, etc, Athena can help you analyze the data within it in some minutes of setup and query it using SQL in a few seconds. We have used Athena with several terabytes of data, AWS says it can easily scale to petabytes of data. In the worst case, it took us few hours to get data. Athena charges based on the data it scanned on S3, around $5 per TB. Compressing your data, using columnar file formats like Apache Parquet, and using partitions can save you a lot of query time and money in the long run.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

A data warehouse is a repository of databases containing all your data, in a central place.

Source: Jerry vlntn at English Wikibooks, Public domain, via Wikimedia Commons

AWS Redshift is based on PostgreSQL, It is a distributed system, basically, you choose an instance type and the number of nodes to be in your cluster, and AWS Redshift will automatically scale it for you. You pay for the number of Redshift nodes you run. Since you have a dedicated cluster for your AWS Redshift task, your tasks can run immediately unlike AWS Athena where it depends on it having resources. AWS Redshift like Athena can query data from S3, using Redshift spectrum and the pricing is also the same, the query speed depends on your cluster. Redshift spectrum allows you to join your data in the cluster and also that of S3, this is awesome.

Why was AWS Athena a better choice for us?

Our data pipeline looks like the following

We use crawlers to collect data from the web, clean it and store it in MySQL in real-time, after a week of this we move the data to AWS Redshift/AWS S3.

Every week we generate a dataset queried from AWS Athena or AWS Redshift, which will be used to generate a model.

Our Redshift Cluster consisted of 8 dc.large instances, costing us around $2000 per month, it would only be on load for less than 72 hours a week, while querying and inserting the data. After implementing the same in AWS Athena for the same process (without S3 storage costs) it would be less than $100 every month and a lot faster since Athena might use a bigger pool.

Data warehouses like Redshift can become a golden hammer because the whole world is talking about it. In many cases, all we need is Athena and a neatly organized data lake — a lot more cost-effective and maintenance-free solution.

AWS Athena or AWS Redshift. What’s right for you?

Why was AWS Athena a better choice for us?

Written by Ashish Kumar