Autotuner API for Apache Spark — Optimize Apache Spark at Scale

How to use the API functionality of the Apache Spark autotuner to optimize your jobs at scale

Sync Computing
Sync Computing
4 min readSep 19, 2022

--

Kartik Nagappa — Product Manager @ Sync

The Sync Autotuner API enables you to continuously monitor and tune your Apache Spark jobs at scale by making it easy to harness the capability of the Sync Autotuner in a programmatic manner.

Photo by Gabriel Heinzer on Unsplash

Sync Autotuner — Giving you choice and greater control over cost and runtime of your Apache Spark jobs

The Sync Autotuner has enabled developers, data engineers, and data scientists, from small startups to large enterprises, to easily tune their Spark jobs and reduce costs, improve runtime, or both.

Infrastructure tuning can significantly impact data engineering productivity. Most developers and data engineers will tell you that trying to figure out the optimal Spark and cluster configurations for a Spark job is a tedious and time consuming effort involving a lot of trial and error. There are practically an infinite number of infrastructure choices to make and it isn’t feasible to try each of these out. When you finally land on the optimal configuration, a change in the input data size, code, or spot market availability throws all that effort out the window.

The Sync Autotuner will quickly provide you with the most optimal set of cluster configurations, in terms of cost, runtime, and infrastructure selection. Also, it is able to do this using data from a single run. The Sync Autotuner provides a UI through which you can upload Spark job run information — the Spark event log and cluster details — and receive recommendations in the form of cluster configurations that optimize your Spark job run. A data engineer can then quickly and easily select a recommendation from the list, update the job configuration, and rest assured that the job is tuned.

A few case studies below highlight the impact we’ve had:

The originally released Sync Autotuner UI is a quick and simple way for users to try out the Sync Autotuner on one of their existing Spark jobs and get real metrics on what the Sync Autotuner can do for their Spark jobs. The Sync Autotuner API scales with you — it is there for you whether you want to tune a single job, a few jobs, or all your Spark workloads.

Sync Autotuner API — Programmatic access to the Sync Autotuner

All the power of the Sync Autotuner available to you as REST APIs.

The Sync Autotuner API gives you programmatic access, in the form of REST APIs, to the Sync Autotuner. Using this programmatic access, you are able to completely automate the work of generating optimal configurations for your Spark jobs, and you are able to do this at scale for all your Spark jobs.

The recommendations returned by the Sync Autotuner API aim to provide the convenience of “plug and play” — optimal configurations for Spark on AWS EMR are returned using the AWS EMR RunJobFlow schema and optimal configurations for AWS Databricks are returned in a format that make it easy for you set your Databricks cluster configuration.

We realize that there are many ways in which clusters can be spun up to run Spark jobs. We’ve designed our recommendations to be as simple and straightforward as possible making it easy for you to extract optimal configurations and plug it into your workflows.

A typical Sync Autotuner API workflow for a single Spark job has the following steps:

  • Run your Spark job and wait for it to complete
  • Call Sync Autotuner API to run a prediction which generates a list of optimal configurations for your Spark job. The steps for the API are: (1) initiate prediction, (2) check prediction status, and (3) get prediction results.
  • Update your Spark job configuration with a recommended optimal configuration based on your business needs

When you scale the Sync Autotuner API workflow to span all your Spark jobs, you end up with a workflow that looks like the one in the figure below.

Let’s get started!

Check out our user guide and recipes to quickly get up and running with the Sync Autotuner API. If you don’t yet have access to the Sync Autotuner then you can request access through the Autotuner page here.

We’d love to hear from you and how you’re using the Sync Autotuner API — tweet at us, find us on LinkedIn, or email us at support@synccomputing.com.

--

--

Sync Computing
Sync Computing

We've built the world's only AI optimization engine for data infrastructure: Gradient.