techatwattpad
Published in

techatwattpad

Redshift vs Athena — Clash of The Titans

Redshift vs Athena

Wattpad is a platform that allows writers to publish stories and readers to read millions of such stories for free, all the while supporting their favourite writers. We’re able to accomplish it by showing ads to the users, and the ads team within Wattpad is responsible for building and maintaining the infrastructure that helps us do this. Around the summer of 2020, the team came across a rather exciting challenge. It all started when we received a request to include a new field in our ad event data. We store this data on S3, from where it’s moved once a day to Redshift, a data warehouse we use. It allows you to organize data into tables and perform queries on them efficiently.

Performing queries on a single table is relatively easy. However, there are times when we have to combine data from multiple tables to get relevant information (Say, combining two correlated ad metrics to gain deeper insights into performance). In one such query, whenever we tried to include the country name and combine two tables, Redshift would throw an error and won’t allow us to fetch data. It was due to a Redshift nested data limitation, wherein if it’s joining two or more tables together, it cannot retrieve nested data.

One of our data sources sends us data as a series of objects (nested layers), wherein there is a top layer that contains all the fields. There is a location layer inside the top layer with all the location info, which further has a country-name layer that includes the data we were looking for (See the example below, for instance).

`{I’m the top layer, location: {I contain all location info, countryName: {I’m here!}}}`

This arrangement made it impossible for Redshift to get `country-name` when performing a join, making it difficult to include it in any of our queries. Determined to find a way around it or possible alternatives, we came across Athena, a serverless query service from Amazon that allows you to get data directly from the `S3 buckets`. Athena overcame the nested data limitation and, in many instances, performed queries faster than Redshift. We decided to conduct a proof of concept and copied some of our heavily queried ads tables to Athena, so end users can either use Redshift or Athena to query data. However, this brought up a question as to is Athena a better alternative for managing our data? More specifically, should we migrate our advertisement data to Athena instead of copying it over and storing it in Redshift and are there any merits to this approach? We compared the two query engines for our use case to see how they stack against each other.

Overview

Ease of Partitioning Data: Redshift Wins

Partitioning is quite helpful when dealing with a large amount of data as it helps the database index and organize the data for querying purposes. One way to look at partitioning is organizing your documents in cabinets. You may choose to arrange these documents by the year/month, by the document type or any other custom criteria. Whatever decision you make is highly dependent on the kind of data you’re dealing with.

Redshift doesn’t support direct data partitioning by default as it uses the concept of `distribution keys` (columns in a table) to optimize data for parallel processing. Redshift can determine the keys for us, or we can manually define them; however, improper manual selection can often significantly impact query performance. It is imperative to select the right combination of distribution and sort keys. One can usually make these selections by analyzing the nature of the executed queries, wherein columns that are frequently present in the `WHERE` clause end up as sort keys and the column most commonly used in `JOIN` becomes the distribution key.

On the other hand, Athena provides the capability to define up to 20,000 partitions per table and can automatically load sections from our data if it’s organized in a specific manner (Hive Format). However, we didn’t find this feature to be quite as useful for our use case. We wished to create partitions on certain extra data events that weren’t necessarily present in the file pathname of our data. So, even though Athena was pointing to the correct directory and most of our data complied with the `Hive Format`, it couldn’t load the partitions autonomously since our file pathname was missing the partitioning events we wanted (i.e. pathname `/year/month/day/data` doesn’t contain the location). However, it can actively identify all these missing partitions: all the S3 files that we couldn’t load into the tables due to the `path-not-matching` issue, and we can manually add them if needed.

Supported Data Formats and Types: Athena Wins

When it comes to supporting specific data formats and being able to parse through them, both these query engines can `serialize/de-serialize` popular formats like CSV, JSON and Apache Logs. However, Athena has an edge here as it can parse through many more data formats like Parquet, ORC that Redshift doesn’t have support for yet. Although it’s worth mentioning that, unlike other query engines, Athena doesn’t have its own storage layer and relies on S3 to fill that gap. As a result, its performance is dependent on how the data is stored in S3 buckets and on the complexity of the data format used.

Redshift is very limited in the kind of data it can work with when it comes to data types, which is why we encountered the `nested-data-limitation` error earlier.

User-Defined Functions (UDFs): Redshift Wins

When it comes to defining our own set of user functions and using python data libraries, Redshift has the edge over Athena. It offers a wide variety of support for these functions that we can use for applications in different Analytics areas. Moreover, it can also handle many complex regular expressions that aren’t user friendly enough to be used in queries. As a caveat, all these functions are in Python 2.7, which isn’t supported by many major systems anymore.

Athena, for the moment, doesn’t support user-defined functions.

Security: It’s a Tie

Redshift runs on its cluster, and thus, we need to define a cluster level security group to give users access to the database. It supports IAM level security and can encrypt data during transfer (i.e., When we perform queries). We can also launch our cluster in a virtual environment for added security. Still, since we are setting up the Redshift cluster, we are responsible for ensuring that appropriate security measures are in place to protect it.

Athena, being a serverless service, is protected by AWS’s infrastructure and cloud environment auditors. Access to Athena is managed through IAM security groups wherein if you have access to the S3 data warehouse, you can use Athena to query the data.

Both services provide a robust suite of security measures and are safe when it comes to storing and transferring data.

Scaling and Upgrading: Redshift Wins

Redshift runs on its own server and can be easily scaled as long as we add new nodes to the cluster. The more nodes there are, the more computational power we have when querying and working with data. Since we manage the Redshift server, we can scale it up and down whenever we want, though adding more power does lead to higher costs.

Amazon manages Athena’s scaling and infrastructure, so we don’t have to worry about adding new nodes or resizing if we need more power. Athena is tied to our S3 warehouse and acts as a layer on top of it. However, Amazon has restrictions on organizing data as a user can only define 100 databases, and each database can have at most 100 tables.

Data Warehouse Performance

We compared Athena and Redshift by running many of the common queries that we use for fetching and organizing data in terms of performance. We ran each query three times on two different platforms (Redshift: Redash and Datagrip; Athena: Redash and AWS console). We took their average run time to evaluate the overall performance. For each run, we removed the cache to make sure it wasn’t affecting the run time in any way. Both these engines have a different architectural design when it comes to querying. Redshift is designed to run complex queries and combine data from various sources, whereas Athena focuses more on working with a single source.

Creating/Updating Table Schema: Redshift Wins

Both query engines recorded almost the same time when it comes to creating or updating a table. They use a different format for updating wherein Redshift uses the `SQL syntax` while Athena utilizes a modified `Presto based syntax`. It’s worth mentioning that Redshift was faster than Athena while performing upgrades on these tables, which might be due to their underlying query language (For reference, Redshift is based on PostgreSQL where Athena uses HiveQL).

Simple Query — Get Data from Tables: Athena Wins

Simple Query

We performed a fetch on one of our tables and requested the event data field for this test. Athena recorded almost half the time compared to Redshift in loading the results. Below is the run time of the queries, along with an estimate of the data size:

  • Number of Rows Scanned: 4,378,519,459
  • Redshift ~ 13 minutes
  • Athena ~ 7 minutes

Slightly Complex Query — Get Counts of Data: Athena Wins

Slightly Complex Query

Now we move on to the next stage. This test involves running a slightly more complex version of the previous query. We want to list all the distinct `itemOne` for a particular day and how many rows for each type of `itemOne` are present in the database. The idea behind this is to test aggregation, as many of our ad queries often involve getting aggregated metrics from two or more tables and comparing them against some business metrics. Athena outperforms Redshift in this case as well, where it’s almost 46% faster for the same query. It is worthwhile to mention that Redshift has to do more work in this case as it’s a columnar database system, and since we have constraints on three different columns, it has to do an index scan for each of them.

  • Number of Rows Scanned: 93,839,672
  • Redshift43 seconds
  • Athena23 seconds

Slightly More Complex Query — Get Counts of Data from Two Tables: Athena Wins

Slightly More Complex Query

Going into the semi-finale, we now test how these two query engines perform when combining data from two different tables. Here, we request the number of `tableOne` and `tableTwo` data points for all distinct `items` in a month. The majority of our queries involve fetching data from two different tables and comparing it side by side to get an idea of how they relate (i.e., Get ratio of aggregated metrics, for instance). How these two perform in this case is a good determiner of how they will toil for most of our Ad queries. Athena was around 50% faster than Redshift in all our test runs.

  • Number of Rows Scanned: 283,644,610
  • Redshift11 minute
  • Athena6 minutes

The Mammoth Query — Get Counts of Data from Two Tables with Full Join: Athena Wins

The Mammoth Query

Welcome to the finale. We will be testing a full join of two tables on several different columns to see how these two engines perform against each other. This test involved a fairly complex query, and based on our recent experiences within our Ad databases, we haven’t had to run queries more complex than this so far. We query `tableOne` and `tableTwo` over a short period of time and see how these two query engines perform. Athena again manages to finish the execution faster than Redshift.

  • Number of Rows Scanned: 1,276,400,745
  • Redshift13 minutes
  • Athena8 minutes

Service Cost

Redshift’s cost is determined based on the number of nodes used in the cluster, and it charges an hourly rate for these nodes. Moreover, it has additional costs related to storage and querying whenever we perform any operation on the data. Storage costs are per GB and charged every month. As for querying, a typical Redshift spectrum charges around $5 for every terabyte of data processed in the query.

On the other hand, Athena charges between $5 — $6.75 for every terabyte of data that gets scanned (Depending on the region). We don’t incur any additional storage costs with Athena. How much one might incur depends on the database’s size and how many data-related operations one performs. Often, if we’re dealing with a large amount of data, the querying costs become relatively insignificant compared to the ‘node + storage’ cost in Redshift and S3 storage costs in Athena.

Concluding Thoughts

Athena helped us address the ‘Nested Data Error’ that we encountered, but is it the only solution? Another way to address this problem is within Redshift itself. We do some pre-processing of data from two or more tables and utilise a static reference table to pull the nested data out to the top layer. This approach will allow us to bypass the nested data limitation within Redshift. However, it would have a resource overhead in maintaining a pre-processed table that we might not use that often. Athena provides a prompt solution that requires little setup and can query data to obtain the business insights we want. However, if this error becomes frequent in other areas of our database in the future, it’ll be worthwhile to implement a more robust Redshift based solution to handle it.

Athena does come forth as a very effective query engine, but it’s severely lacking when it comes to integrating it with other data sources and services within Wattpad. Redshift not only allows us to query from several data sources at once, but it also offers a host of integrations with several Analytics tools, which, when taken into consideration, far outweighs the speed advantage and data type support provided by Athena. Moreover, we can do some pre-processing on our more commonly queried Redshift data-sets to help improve the query performance. It all comes down to what you’re looking for; if data storage and querying are all you need, Athena is an excellent service, but Redshift is the tool to go with if you require more widespread integrations.

Even though we didn’t switch over our database to Athena, it helped bring forth a very active discussion within the team. As we move forward in creating an impeccable ad infrastructure, we must keep in mind our short-term requirements and long-term goals. Though Redshift helps satisfy our immediate requirements (With some occasional assistance from Athena ;)), we’re continuing our quest to look for even better tools that can help catapult us towards our goals!

--

--

--

Building at Wattpad: Engineering, Product, Design

Recommended from Medium

What Chance Did I Stand Against Kismet?

Use An IP Geolocation API To Perform Geo-Advertising

Is it Possible to run Docker in Docker and Launch Container of Jenkins?

3 APIs to Get Flight Data of Alaska Airlines

Credential Store using HashiCorp Vault

Getting to Know Watson (Part 2): Earning 2nd Place at Galvanize’s Cognitive Builder Faire

The QBE Window: Training Wheels for Access Developers

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhishek Dhakla

Abhishek Dhakla

More from Medium

7 Great Data Engineering Youtube Channels

Big Changes with Big Data: from teaching into Data engineering

Izabella decided to make a move to a Big Data when she was proposed to enter the “201 Big Data Mentoring Program” course by EPAM

Data Engineer Blog 0x0012 (2022–01–22)

Can SQL Be a Library Language?