Optimizing Glue Scripts for Efficient Data Processing: Part 2

How to Make Your Glue Jobs Run Faster and More Efficiently

Published in

Globant

10 min readNov 7, 2023

In the first part, we discussed important things like choosing the right way to handle data and tweaking settings for smoother performance. It’s like finding the best tools for a job. Now, in Part 2, we will explore more secrets, like how to make sure our Glue works well even when faced with a lot of data, and some practice tips using pretend data. So, let’s jump into the next chapter of making AWS Glue super efficient and ready for any task!

Data format selection

When working with AWS Glue, the choice of data format depends on your specific use case and requirements. There are several common data formats, each with its own advantages and considerations. Here are some popular data formats and when to use them in AWS Glue:

1. Parquet

Parquet is often considered one of the best data formats for AWS Glue, especially for analytics and data warehousing use cases. It has several advantages:

Columnar storage: Parquet stores data in a columnar format, making it highly efficient for analytics queries that involve reading specific columns.
Compression: Parquet provides excellent compression, reducing storage costs.
Schema evolution: It supports schema evolution, making it easier to handle changes in data structure over time.
Considerations: Parquet is well-suited for read-heavy workloads, but it may not be the best choice if you have frequent write operations or if you need human-readable data files.
Use case: Parquet stands out as a preferred choice for data warehousing due to its efficient columnar storage and compression, catering to the needs of storing and querying large datasets. Its adaptability extends to machine learning applications, where frameworks like TensorFlow and PyTorch find seamless integration. Additionally, Parquet’s support for streaming writes makes it a viable option for real-time analytics.
Example: An online product retailer might utilize Parquet to swiftly query and generate reports from its extensive product catalog.

2. ORC (Optimized Row Columnar)

ORC is a suitable choice for similar use cases as Parquet, particularly in data warehousing and analytics. Its advantages are:

Efficient compression: ORC offers high compression rates, reducing storage costs.
Columnar storage: Similar to Parquet, ORC’s columnar storage makes it suitable for analytical queries.
Lightweight indexes: ORC includes lightweight indexes that can speed up query performance.
Considerations: ORC is a format optimized for Hive and may have slightly different performance characteristics compared to Parquet in certain scenarios. Choose between ORC and Parquet based on your specific needs and the tools you use.
Use case: Similar to Parquet, ORC is highly regarded in data warehousing and machine learning environments, offering optimal performance and features. Its optimization for Hadoop Distributed File System (HDFS) makes it particularly suitable for storing data on that platform. Notably, ORC’s support for schema evolution enables dynamic changes to datasets, a valuable feature for managing frequently evolving data.
Example: A financial services company could leverage ORC to store and analyze transaction data, identifying trends and patterns efficiently.

3. JSON (JavaScript Object Notation)

JSON is a flexible and human-readable format, making it a good choice for semi-structured or nested data. Its advantages are:

Human-readable: JSON is easy to read and write, making it suitable for debugging and manual data inspection.
Flexible schema: JSON allows for varying schema structures within the same dataset.
Considerations: JSON is less space-efficient compared to columnar formats like Parquet or ORC. It may not be the best choice for very large datasets or high-performance analytics queries.
Use case: JSON’s popularity lies in its human-readable and easily parseable format, making it a go-to choice for data exchange between various systems and applications. Widely embraced in web development for its universal support across major browsers, JSON is also commonly adopted by NoSQL databases such as MongoDB and Cassandra.
Example: A social media company might use JSON to store its user data. This would allow the company to easily share this data with other systems and applications.

4. CSV (Comma-Separated Values)

CSV is a universal format that can be used in a wide range of applications and is suitable for simple tabular data. It has several advantages:

Universality: CSV files can be easily imported and exported by many data tools and systems.
Human-readable: CSV files are plain text and can be opened with any text editor.
Considerations: CSV lacks the advanced compression and columnar storage features of formats like Parquet or ORC. It may not be the most efficient choice for very large datasets or complex hierarchical data.
Use case: CSV’s simplicity and straightforward format make it a popular choice for data import/export tasks, facilitating the smooth movement of data between different systems and applications. Its compatibility with popular data analysis tools like Excel and R, along with its use in data visualization tools such as Tableau and QlikView, underscores its versatility.
Example: A Data Scientist might use CSV to export data from a database for analysis. This would allow the scientist to use a variety of data analysis tools to explore the data.

5. Avro

Avro is suitable for use cases where data schema evolution is critical, and you need a compact binary format. Its advantages include:

Schema evolution: Avro supports schema evolution, making it easy to handle changes in data structure over time.
Compact binary format: Avro files are compact and suitable for storing large volumes of data.
Considerations: Avro is not as widely supported by all data processing tools compared to formats like Parquet or ORC. Consider its compatibility with your entire data ecosystem.
Use case: Avro’s efficiency and scalability make it a strong contender for data serialization, particularly suited for sending and receiving data over networks. Its applicability extends to data storage, especially for dynamic datasets, thanks to its support for schema evolution. Avro’s role in polyglot persistence, the practice of storing different data types in various formats, adds to its versatility.
Example: A software company might use Avro to serialize its data before sending it over a network to another system. This would ensure that the data is transmitted efficiently and that it can be easily deserialized on the receiving end.

The best data format for AWS Glue depends on your specific use case, including data volume, query patterns, and compatibility with other tools in your data pipeline. Parquet and ORC are often favored for analytics and data warehousing due to their efficiency, while JSON and CSV are versatile choices for semi-structured and tabular data, respectively. Avro is a good option when schema evolution is a primary concern. Ultimately, you should select the format that best aligns with your project’s requirements and constraints.

Fine-Tuning Data Input and Output

Fine-tuning the parameters for reading and writing data is essential to achieve peak performance in your AWS Glue jobs. These parameters can be adjusted to align with your specific use case and data processing needs. Below, we’ll explore these crucial parameters along with their default values, providing you with a deeper understanding of how they impact your Glue job performance.

Read Parameters

# Reading the data from "file_paths" to create dynamic_frame of glue
dynamic_df_input = glueContext.create_dynamic_frame.from_options(
      connection_type="s3",
      format_options={"groupSize": "1048576", 
                      "groupFiles": "1", 
                      "useS3Implementation": "false", 
                      "attachFileName": "false"},
      format="json",
      connection_options={"paths":file_paths, "recurse":True},
      transformation_ctx="dynamic_df_input")

In the provided code snippet, we’re reading data from the specified file_paths and implementing optimizations at the reading level, enhancing the reading speed compared to the usual process. Let's break down each parameter below for a more detailed understanding.

Group Size (Default: 1048576 bytes): The Group Size parameter allows you to adjust the size of data groups retrieved during read operations. Tuning this parameter can optimize data retrieval by specifying the desired group size.
Group Files (Default: 1): Group Files determines the grouping of files for efficient processing. You can configure it to control how data files are grouped together, impacting the efficiency of your Glue job.
useS3Implementation (Default: false): This parameter, when set to true, can significantly improve S3 data access performance. By leveraging the S3 implementation, you can enhance the efficiency of reading data from Amazon S3.
Attach File Name (Default: false): As a bonus tip, consider attaching the original file name to your data. While this parameter is not enabled by default, doing so can provide valuable context and insights into your processed data. This will only work if we do not set Group Files options to any value.

Write Parameters

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": "s3://s3path",
    },
    format_options={
        "useGlueParquetWriter": True,
        "BlockSize":"134217728"
        "PageSize":"1048576"
    },
)

In this piece of code, we’re using AWS Glue to save data to an S3 location in a super-efficient way. We’re specifically picking the Parquet format, which is great for making things run faster. Now, let’s look at the specific things we set up to speed up the process under format_options.

BlockSize (Default: 134217728 bytes): BlockSize allows you to adjust the block size for writing data efficiently. Modifying this parameter can help optimize how data is written to the destination, improving overall write performance.
PageSize (Default: 1048576 bytes): The PageSize parameter enables you to optimize the page size for writing data. Adjusting this value can enhance the efficiency of data writing operations in your Glue job.
useGlueParquetWriter (Default: true): By setting useGlueParquetWriter to true, you can leverage Glue’s Parquet writer for better performance during data writing tasks. This parameter harnesses the power of Glue’s optimized Parquet writer to improve the efficiency of your job.

By understanding and fine-tuning these parameters, you can tailor your AWS Glue job to the unique demands of your data processing tasks, ultimately achieving optimal performance and efficiency.

Caching and Persistence

In the quest for optimizing your AWS Glue scripts, caching and persistence of DataFrames in Spark can play a pivotal role in boosting performance. These techniques allow you to efficiently store intermediate results, reducing the need for costly recomputation. Here, we delve into the benefits of caching and how to implement them in your Glue scripts.

Caching, in the context of Spark, involves storing a DataFrame in memory, which allows for much faster access compared to recomputing the same DataFrame from its source data. This is especially useful when you have operations referencing the same DataFrame multiple times in your script. It can lead to significant performance improvements for two main reasons:

Reduction in Recomputation: Without caching, Spark may recompute the same DataFrame multiple times in response to various transformations or actions. Caching eliminates this redundant computation.
Faster Access: Data stored in memory can be quickly accessed, resulting in reduced execution times for subsequent operations.

How to Cache DataFrames in AWS Glue

Caching a DataFrame in AWS Glue is straightforward. You can use the cache() or persist() method on your DataFrame to indicate that it should be cached in memory.

Here’s an example of caching a DataFrame in an AWS Glue script:

# AWS Glue script snippet
# Create a DataFrame (example)
df = glueContext.create_dynamic_frame.from_catalog(
    database = "your-database",
    table_name = "your-table",
    transformation_ctx = "df"
)
# Cache the DataFrame in memory
df.cache()  # or df.persist()

In the above example, we create a DataFrame (df) from a catalog table and cache it in memory using the cache() method. Alternatively, you can use the persist() method with additional options to control storage levels and serialization formats.

Choosing the Right Storage Level

When using persist(), you have the flexibility to choose the storage level that best suits your use case. The storage level determines where and how the DataFrame is cached. Common storage levels include MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY, and more.

Here’s an example of caching with a custom storage level:

# AWS Glue script snippet
from pyspark import StorageLevel
# Cache the DataFrame with custom storage level
df.persist(StorageLevel.MEMORY_AND_DISK_SER)

When to Unpersist

While caching can significantly improve performance, it’s essential to use it judiciously. Cached DataFrames consume memory, and if not managed properly, they can lead to out-of-memory errors. Therefore, it’s recommended to unpersist DataFrames when they are no longer needed, using the unpersist() method.

# AWS Glue script snippet
# Unpersist the DataFrame when done
df.unpersist()

Caching and persistence of DataFrames in AWS Glue scripts can be a game-changer in optimizing performance. By intelligently caching intermediate results, you can reduce redundant computations and speed up data processing. However, it’s crucial to strike a balance and manage your cached DataFrames carefully to avoid excessive memory usage. Incorporating caching into your optimization toolkit can make your Glue jobs even more efficient, particularly in scenarios where DataFrames are reused multiple times during the script execution.

Load and Stress Testing

Getting the most out of AWS Glue means understanding its performance in different situations. Testing its limits with mock data becomes a key step in this journey. By pretending with mock data, you get a clear picture of how well your Glue jobs handle different tasks. It’s like a practice run before the big show! This method helps you find and fix any issues, adjust settings, and ensure your jobs can handle lots of data smoothly. As you explore AWS Glue, these simple tests with pretend data become your go-to tools. They help you discover all the cool things Glue can do and ensure your data processing is always fast and reliable.

Summary

In wrapping up, making AWS Glue work really well involves picking the right ways to handle data and fine-tuning various settings. It’s like choosing the best tools for a job. We learned how to smartly manage data formats, tweak settings for better performance, and use tricks like caching to speed things up. It’s a bit like having shortcuts to do things faster. Testing with pretend data is also like a practice run, helping us spot and fix any problems before the real work begins. So, by combining these smart choices, tweaks, and practice, we’re all set to make AWS Glue work super efficiently for handling our data tasks!

Optimising Glue Scripts for Efficient Data Processing: Part 1

Boosting Performance of Glue jobs without complexity

medium.com

Reference

Big Data file format

Glue Parameters

Load and Stress Testing

Optimizing Glue Scripts for Efficient Data Processing: Part 2

How to Make Your Glue Jobs Run Faster and More Efficiently

Data format selection

1. Parquet

2. ORC (Optimized Row Columnar)

3. JSON (JavaScript Object Notation)

4. CSV (Comma-Separated Values)

5. Avro

Fine-Tuning Data Input and Output

Caching and Persistence

How to Cache DataFrames in AWS Glue

Choosing the Right Storage Level

When to Unpersist

Load and Stress Testing

Summary

Optimising Glue Scripts for Efficient Data Processing: Part 1

Boosting Performance of Glue jobs without complexity

Reference

Written by Abhishek Saitwal