Rethinking Big Data: Athena Shines as a Streamlined Alternative to Spark for S3 Integration

Pierre-Yves BONNEFOY
6 min readOct 19, 2023

--

In the data ecosystem, while Apache Spark has established its dominance, there’s a growing need to spotlight the versatility of alternative tools. Amazon Athena, Presto, Trino, or Starburst are emerging as compelling solutions, each bringing unique strengths to the table. This article dives deep into why and how to use these tools, specifically focusing on Athena for data partition management. However, within the AWS cloud, Amazon Athena presents a straightforward and efficient alternative for analyzing vast amounts of data without the hassle of managing infrastructure.

Why Look Beyond Spark?

While Spark is undeniably robust and versatile, it can sometimes be overkill for certain use cases or will require too much. Enter tools like Amazon Athena, which offers fast ad hoc queries on data stored in S3 without the expense of managing a full-fledged infrastructure. Additionally, this can help to be more efficient and avoid cluster startups or infrastructure limits.

The Value of Athena for Data Integration Compared to Spark: A Case Study

Data integration and analysis are essential processes in the realm of data science. Numerous tools are available to assist with these tasks, such as Athena and Spark. In this article, we’ll delve into a practical example to demonstrate how Athena can be an efficient solution for data integration compared to Spark.

Background

Consider an original table containing over 2 million objects with a total size of 32.4 GB. The structure of the table is as follows:

CREATE EXTERNAL TABLE my_database.store_item (
`site_uid` string ,
`site_id` string ,
`article_id` string ,
`availability_period` array<struct<availability_starts:string,availability_ends:string>>,
`update_date_tmp` timestamp
)
PARTITIONED BY (partition_date STRING)

LOCATION ‘s3://…’;

The goal was to create a new table from this original table using Athena.

Execution with Athena

Using Athena, the following query was executed:

CREATE TABLE datafactory.store_item_follow_partitioned with (
format = ‘PARQUET’,
write_compression = ‘SNAPPY’
)
AS
SELECT *
FROM datafactory.store_item_follow;

This query successfully created a new partitioned table with the mentioned specifications. The process took only 5 minutes and 2.882 seconds. Additionally, it scanned 27.06 GB of data to produce an output consisting of 30 files totaling 26.7 GB.

Why is Athena Effective in this Instance?

  • Simplicity: With a few lines of SQL code, we were able to generate a new partitioned table. This indicates that users don’t need to write complex code for data integration.
  • Performance: In just 5 minutes, Athena processed a substantial amount of data (27.06 GB) to yield the desired output.
  • Cost-Efficiency: With Athena, the pricing model is based on the volume of data scanned. In this example, even though the original table size was 32.4 GB, only 27.06 GB was scanned. Given a rate of $7 per terabyte, the financial implication of this query was approximately $0.185. This approach provides a clear and predictable cost structure for processing large datasets.
  • Integration with S3: Athena natively integrates with Amazon S3, allowing direct analysis of data stored in S3 buckets without requiring additional data transfer. In contrast, while Spark can read data directly from S3, when run in traditional Hadoop ecosystems, it often defaults to using HDFS for storage. This setup might necessitate transferring data from S3 to HDFS or entail additional configurations to read directly from S3, potentially introducing performance challenges or complexities.
  • Cluster Overhead with Spark: When using Spark for handling such large datasets, there’s a significant overhead in terms of infrastructure. For this volume of files, it’s often necessary to spin up a massive cluster, in our case necessitating provisioning of 128 GB of RAM and 10 TB of storage. Moreover, fetching the data into this environment and processing can extend the processing time substantially, sometimes taking more than 5 hours. This contrasts starkly with the direct and efficient processing capabilities of Athena, which doesn’t require such massive resource provisioning or data movement.

In the graph below, we have visually represented the processing times of Athena and Spark when handling data integration tasks on our dataset. The X-axis denotes the tools, while the Y-axis displays the processing time in minutes.

What immediately stands out is the stark difference in execution times. Athena, with its direct integration to Amazon S3, completes the task in just a few minutes, showcasing its efficiency and speed. On the other hand, Spark, while undeniably powerful, requires considerably more time, particularly when factoring in potential data transfers to HDFS or configurations for direct S3 access.

This visual representation underscores Athena’s capability to deliver quick results, emphasizing its potential as a streamlined alternative to traditional big data tools like Spark, especially for datasets housed in Amazon S3.

Comparative Efficiency: Unpacking Why Athena Outpaces Spark for Direct S3 Queries

  • No Cluster Overhead:

Athena: One of Athena’s primary advantages is its serverless architecture. This means there’s no infrastructure to set up or manage, and you can start querying data immediately. There’s no cluster to start, maintain, or scale, and hence no associated startup time.

Spark: Spark, on the other hand, requires the initialization and provisioning of clusters, especially for processing large datasets. The time it takes to start a Spark cluster can be substantial, and it increases the overall processing time, especially if you’re not using a persistent cluster and have to start one up each time you have a task.

  • Data Transfer Overhead:

Athena: Athena integrates natively with Amazon S3, allowing for direct querying without the need for any data transfers.

Spark: In traditional setups, Spark often relies on HDFS as its primary data storage. If your data is in S3, you might need to transfer it to HDFS, introducing a significant delay. Even if you configure Spark to read directly from S3, the connectors and additional configurations involved can still introduce latencies, especially if not optimized correctly.

  • Optimized for the Task:

Athena: Athena is purpose-built for ad-hoc querying of data in S3. It’s optimized for this specific task, which means it can provide rapid results for large datasets without requiring extensive configurations or optimizations.

Spark: While Spark is an incredibly powerful and versatile processing engine capable of handling a variety of big data tasks, it might not always be the most efficient tool for simple data querying tasks, especially when dealing with data in S3. Its general-purpose nature means it isn’t as laser-focused on this singular task as Athena.

  • Technological Suitability:

Athena: Being a managed service, Athena abstracts away many of the complexities, allowing users to focus solely on their queries. It takes care of all the optimizations and uses a distributed architecture to ensure quick results.

Spark: Spark is designed for extensive data processing and analytics, handling tasks ranging from batch processing to machine learning. For simpler data querying tasks, especially on S3 datasets, Spark might be overkill, introducing unnecessary complexities and overhead.

In summary, while both Athena and Spark are powerful tools in the world of big data, their inherent designs and optimizations make each more suited for certain tasks. In the case of ad-hoc querying on large datasets in S3, Athena’s direct integration, serverless nature, and specific optimizations give it a distinct speed advantage.

Conclusion

While Spark is a powerful tool for data processing, Athena showcases undeniable advantages in terms of simplicity, performance, and cost, especially for scenarios like the one discussed above. If you’re working with data stored on Amazon S3 and are seeking a straightforward and efficient solution for data integration and analysis, Athena might be an excellent choice for you.

It’s worth noting that this use case is not unique to Athena alone. Tools like Trino, Presto, and Starburst also offer similar capabilities, allowing for efficient querying over large datasets without the need for extensive infrastructure provisioning. These tools, much like Athena, leverage distributed querying mechanisms to deliver fast results, making them another set of powerful alternatives for efficient data querying and integration, especially when working with colossal datasets in distributed storage systems like S3.

--

--

Pierre-Yves BONNEFOY

DATA & CLOUD Architect ☁️ • Entrepreneur 💪 • Developer 🔨 • Business Intelligence Expert 📈 • CEO & Co-founder at Olexya • Le Mans Tech Member