User Perspective — How Sync Computing Can Optimize Cloud Spend

Matt Weingarten
4 min readApr 14, 2022

--

Saving money!

Introduction

I don’t read Reddit like I used to (it would be hard to stay employed if I did), but one of the subreddits I try to keep regular tabs on is r/dataengineering. A few months back, I saw a post about a new tool that could take EMR Spark logs and suggest a more optimal configuration that could save on processing time as well as overall spend.

As a data engineer who has attempted to go through the pains and struggles of doing this science on their own, having a tool that could do it for me sounded too good to be true. I had to give it a shot to see if their claims were actually accurate.

Methodology

Using Sync’s autotuner is pretty straightforward. You’ll need to use the AWS CLI to grab the configuration of your EMR cluster and then go to the Spark history server to grab the log for the job in question you want to optimize. After using some tools to clean those logs up (all of which is documented in their examples), they can be passed into the autotuner UI.

The autotuner will take some time to run depending on the complexity of the job and then return its suggested configurations. The top of the results page will display 3 options as such:

Example of autotuner output

Performance is for those who want to favor speed, economy is for those who want to favor cost, and balanced is for those who want to achieve both. There’s also more options than this if you take a look at the graph of potential configurations that would save time or money:

Graph of EMR configurations by runtime and cost

For the Spark specific configs for the job itself, those can be found below the graph:

Suggested Spark configs based on what you chose above

Then, plug these into your cluster/job configuration and see the results cash in!

Outcome

The first job I put into the autotuner went from processing in around 90 minutes to 25 minutes after I changed the configurations, only using a slightly larger cluster. However, that time save makes up for using more nodes, so it definitely worked to our advantage.

Afterwards, I tried applying the autotuner to some bigger jobs we have: processing more data and with a much bigger initial cluster configuration. Here, I had to be a little more careful not to exceed our service limit for vCPUs, so aimed for configurations that were using less nodes than we had before. While we didn’t see much time save here (despite what the UI said could be possible), we did see the cost go much lower. That was our main objective anyway, so no substantial change in our processing time wasn’t a big concern.

If our current data load stays consistent for the rest of the year, we will be projected to save around $100k from the autotuner.

Editorial

I started trying out the autotuner very early (thanks to seeing the Reddit post). The team was more than accommodating in getting me set up and answering any questions I had along the way. This team really knows what it’s doing.

Sync Computing also just released an autotuner tool for Spark jobs on Databricks. I’m looking forward to trying that out as we also use Databricks for our ad-hoc data analysis as well as a few smaller jobs.

One thing to note is that this only currently works for AWS. Other Cloud platforms are not yet supported (but stay tuned perhaps?).

Conclusion

I definitely recommend trying out Sync Computing’s autotuner tool for those data engineers who are tired of trying to figure out their optimal Spark configurations by trial and error. It’s worked for us and I definitely want to make sure this application gets the recognition it deserves. Beyond the cost and performance optimizations, the exposure of focusing on Cloud costs is something engineers should always take into account on a consistent basis.

--

--

Matt Weingarten

Currently a Data Engineer at Disney Streaming Services. Previously at Meta and Nielsen. Bridge player and sports fan. Thoughts are my own.