Sync Autotuner Reduced Our EMR Cost by 25%

Published in

Insider Engineering

4 min readApr 4, 2023

Amazon EMR is our go-to solution to run Apache Spark for data processing, interactive analytics, and machine learning on AWS. As a fast-growing tech company, we rely on Amazon EMR to run over 10,000 Spark applications every day to fuel our products with up-to-date data and predictions. However, over time, our Amazon EMR costs have grown significantly, therefore, we are always looking for opportunities to optimize our costs.

Previous Cost Optimization Efforts

We have undertaken several measures to optimize our Amazon EMR costs, including benchmarking Amazon EMR vs Databricks to determine if a hybrid solution would be more cost-effective. After careful evaluation, we concluded that Amazon EMR was the better fit for our needs. We also discovered that utilizing newer versions of Spark can enable great cost savings, so we ensured our EMR and Spark versions remained as up-to-date as possible.

Furthermore, we recently migrated from Apache Hive to Apache Iceberg to optimize our data lake performance and costs. This decision was based on Iceberg’s superior features such as ACID transactions and better performance with object storage systems like Amazon S3, allowing us to achieve significant cost savings while improving our data lake management capabilities.

Provisioning Right-Sized Clusters

Our current focus is on a more challenging area to optimize: provisioning right-sized clusters. This means ensuring that the cluster has the optimal number of instances and resources for a specific workload to minimize costs and maximize performance.

Optimizing cluster size is difficult due to the numerous factors influencing a Spark application’s requirements. Different Spark applications may have varying resource requirements based on the size and complexity of the data being processed, as well as the analytics performed. If the cluster is too small, the application may run out of memory or take too long to complete. If the cluster is too large, it may result in underutilization of resources and higher costs.

While we have manually optimized our most costly pipelines periodically, manual optimization is not sustainable at our current scale. For example, we have adjusted instance types, the number of instances, and configured Spark parameters for optimal performance. However, with thousands of Spark applications running every day, we began exploring automation solutions like Sync Autotuner to help us optimize cluster size automatically and at scale.

Introducing Sync Autotuner

Sync is a cloud computing optimization company that offers a unique approach to reducing costs and improving performance. Their Autotuner product is designed to help companies optimize their Amazon EMR costs by finding the best infrastructure configurations for their Spark workloads. Sync’s approach is based on machine learning algorithms that analyze Spark log files and cluster configurations to generate optimized cluster configurations and Spark parameter sets, along with their predicted runtime and cost. You can choose between performant, balanced, or economic solutions, depending on your needs and budget constraints.

Sync Autotuner UI for Prediction Results

Integrating Sync Autotuner into Our Pipeline

Apache Airflow serves as our orchestration tool for data and machine learning pipelines. Airflow tasks create and health-check the EMR clusters. To automate the process of feeding Spark logs to the Autotuner API, we decided to integrate it into our Airflow DAGs. Consequently, we developed a Python application to communicate with the Autotuner APIs.

First, the application takes the EMR Job Flow ID, retrieves the Spark application logs from Amazon S3 for that cluster, removes application-specific parameters, and saves the logs back to Amazon S3. Next, it calls the Autotuner API to initiate a prediction and periodically checks for results. Once the predictions are available, the application applies some business logic to determine if there is a significant cost-saving opportunity.

If such an opportunity exists, the application saves both the previous and recommended configurations to a MySQL database and sends a Slack message to notify the engineers. This allows us to examine the results and apply the configurations if it improves the cost-efficiency of the pipeline.

Pipeline integration of Sync Autotuner API

Sync Autotuner’s Impact on Our Amazon EMR Costs

After integrating Sync Autotuner, we received numerous alerts, as expected. We ran jobs with both configurations in a staging environment to evaluate the accuracy of the predictions. Ultimately, we applied most of the recommendations. One notable achievement was a 50% cost savings for our most expensive Spark application, accomplished with minimal effort. When we summed all the savings, we found that by simply applying Autotuner’s recommendations, we saved approximately 25% on our Amazon EMR costs. Sync Autotuner is an excellent tool that tackles a challenging problem and delivers real results that you can see in your AWS bill.

Sync Autotuner Reduced Our EMR Cost by 25%

Written by Deniz Parmaksız