AWS Athena vs Google BigQuery.

Published in

CloudWithMORE

6 min readMay 15, 2018

Today we look at data analytics services from Google and AWS. While data analytics is a vast area, let’s start with the first step in this cycle; Storing and running queries on your data to get insights. On the google cloud, we have Bigquery — a datawarehouse as a service offering — to efficiently store and query data. On AWS, there was a choice between Redshift and Athena. For this blog, we will look at Athena, because like Bigquery, Athena too, does not need any node/cluster creation.

When you want to run random queries to better understand your data, performance matters. Both these services lay a claim to querying peta bytes of data within minutes, so we take them for a spin. First, let’s understand what these services are.

What is BigQuery?

BigQuery is Google’s serverless, highly scalable, low cost enterprise data warehouse designed to make all your data analysts productive. Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL and you don’t need a database administrator

What is Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

To conclude, both are fully managed services that support SQL based query execution that are meant to store large amounts of data and to make querying them fast and easy.

Sample Dataset

What is the common platform we use to test both these services? We use the much-used 1.1 billion New York taxi rides data. A simple google for 1.1 billion New York taxi riders will let you know how widely this database is used for all sorts of bench marking on different platforms. We don’t re-invent the wheel and continue the tradition.

Now this data is available as a public dataset in Google BigQuery, so instead of hunting elsewhere, this is where we download the data from. This data is approximately 130 GB in size and has 1+ billion rows. Using bigquery’s data export option, we get the data exported to a GCS bucket in CSV format. This is a simple task and takes just about a minute.

Thanks to gsutil from the google cloud SDK that has an “rsync” option, we do a bucket to bucket transfer to get the data from google cloud storage bucket to an AWS s3 bucket. This operation does take some time, approx. 2 hours, but this is completely hassle free. Compared to having to download and upload data this size, a far better option.

Data Ingestion

Now that we have the same data on both the clouds, it’s time to “ingest” it into the respective services. To get this data into Athena, there were 2 options, create a table manually or to use AWS Glue that would automatically detect the schema and “upload” the data.

Obviously, we first chose the automatic route. For this we create a crawler in AWS Glue where the source was the s3 bucket were all the CSV files were stored and destination was the database in Athena. AWS Glue worked like a charm and the table got automatically created. A simple count (*) confirmed that all 1+ billion rows were present. The only hitch was that most of the columns were of string datatype. This meant that for any queries that required any operation on the taxi fares or operations on the pickup/dropoff timestamps to find busiest days etc, the query would have to first convert the data to the desired data type. This was certainly not a very ideal situation and would have a big impact on the performance.

This meant creating the table manually with the appropriate data types and ingesting the data. This was straightforward and took less than a minute again.

To be able to compare this process as well in both the services, we decided to ingest the data in our own dataset in bigquery (inspite of this being available as a public dataset). The process of importing data in bigquery gives us the option of creating a schema manually by specifying the fields or allowing big query to automatically create the schema. Automatic schema creation did not work. At all. The problem with automatic schema creation — based totally on my observations — is that the datatypes get decided in the first few rows (may be a hundred) and if later a field turns up with some different type of data, say a float in a column where until then there were only integers, the import fails. The error reporting is also not very intuitive.

Coming back to the data ingestion, for bigquery too, we ended up manually specifying the fields and datatypes and then import went through smoothly. The import took about 80 seconds to complete.

Conclusion: Schema creation is a one-time process anyway, so go manual.

Note: In case of Athena, there is no real “ingestion” or “import” happening, simply because the data continues to reside on S3. The process of “ingestion”, merely creates the schema metadata. In case of bigquery however data ingestion results in the creation of a “native” table which means data is actually copied onto data nodes belonging to bigquery. Of course, we as users do not need to know about these nodes at all. BigQuery also supports “external tables” where in it can access data stored in GCS buckets, but native tables are better performing.

Query Performance

We now come to the most important aspect of these services, query performance. We now run some queries and pitch the two services head on. Let’s start with a simple count(*) . The stats were:

Athena: 11.7 seconds, Data scanned: 133.48G

BigQuery : 1.8s elapsed, 0 B processed.

Note: The data processed or scanned is important as well, since both the services charge users on the amount of data processed in each query. (Cost details are explained separately below). Obviously on bigquery, the number of rows is stored as metadata and hence no processing needed.

We now begin to add a little complexity to the queries. Let’s consider this query:

select sum(total_amount) as cost, sum(trip_distance) from [TableName]

Basically sums up the data in two columns over the entire table. At this point, we must mention that both the services use a columnar format to store data and hence such queries should need less processing. Let’s see what the results were:

Athena: (Run time: 15.85 seconds, Data scanned: 125.1GB)

BigQuery: (4.5s elapsed, 16.5 GB processed)

We now run this query with some more complexity:

select avg(total_amount) as cost, avg(tip_amount) as tip, month(pickup_datetime) as month, year(pickup_datetime) as year  from [TableName]  group by year(pickup_datetime), month(pickup_datetime) order by year(pickup_datetime), month(pickup_datetime)

Let’s see what the results were for this query:

Athena: Run time: 22.76 seconds, Data scanned: 125.12GB

BigQuery: 7.8s elapsed, 24.8 GB processed

The numbers speak for themselves — in terms of query performance, bigquery is superfast !!!!

Cost

With that we come to the other most important aspect, the cost. Both Athena and BigQuery charge users for the data processed while executing queries. Both charge $5 per TB. In case of bigQuery though, the first TB is free.

BigQuery also has a storage cost, where data stored in Bigquery is charged. For details on pricing please see: https://cloud.google.com/bigquery/pricing

In case of Athena, the data is stored in S3 and hence S3 storage costs apply.

There are no charges for DDL or loading data in both the services, leading us to conclude cost-wise they are almost equal.

Conclusion

All other things being equal, what matters is query performance and there is no doubt that bigquery is amazingly fast.

Originally published at c1oudwars.wordpress.com on May 15, 2018.