Cluster? F#*k! One machine is all you need

Simiotics
3 min readJan 28, 2020

--

Distributed computing is Hard. It is difficult enough to perform computation across multiple threads or processes on the same machine. You should have an especially good reason to distribute them across multiple machines.

I have been to many data science conferences, and have noticed constant pressure on attendees to evaluate data science frameworks on their potential for distribution.

I have worked in and around many data science teams, too. This has taught me something about distribution — YAGNI. You aren’t going to need it. [0]

The costs of ignoring YAGNI are different for data science teams than they are for development teams. The costs are sometimes invisible to the data scientists. They are mainly incurred on hiring and infrastructure.

Suppose that a team decides to migrate their data processing to Apache Spark. They have instantly accepted the cost of maintaining a Spark cluster. They pay this cost either by spending effort on maintenance themselves or by delegating maintenance to a company like Databricks or AWS. Anyone that has paid for managed Spark knows that it isn’t cheap.

Frameworks like Spark and Tensorflow signal a certain degree of data science sophistication. This sophistication is very useful at scale. But if you aren’t going to need something that scales like distributed frameworks do, then you aren’t going to need this sophistication either. So, as a company, why pay for it?

Very few data science teams require the scale that frameworks like Spark, Airflow, and Tensorflow offer. Especially when most cloud providers allow you to spin up a VM with 64 vCPUs and hundreds of gigabytes of RAM for less than $5/hour. The popular frameworks may offer a better developer experience (Databricks notebooks truly are visionary), but you will overall be better served by A Bunch of Scripts (TM).

Photo by Markus Spiske on Unsplash

Avoiding the cluster is simple. For your preferred cloud provider:

  1. Package your execution environment into a Virtual Machine (VM) image
    The easiest way to do this is to create a VM, install whatever you want on there manually, and then create an image from that VM. A more rigorous approach is to use Packer.
  2. Write a runner script which provisions a VM with:
    - The above VM image
    - The data you wish to process
    - The code that does the processing
    This runner script should create the VM with a startup script that executes your code on launch. You can either make the data processing code responsible for transferring data off the VM or you can delegate this to an scp or sftp call in your runner.
  3. Make the VM type a parameter to your runner script
    This lets you use beefy instances for heavy workloads and wimpy instances for light ones. It makes your runner stickier.
    It also allows you to use spot and preemptible VMs on AWS and GCP[1], which can offer a 70% discount over their on-demand counterparts.

This is an approach that has served me well. I hope it will do the same for you.

By Neeraj Kashyap, Simiotics (follow Neeraj on Twitter)

[0] If you are dealing with a LOT of data, like hundreds of terabytes or more, disregard this entire article.

[1] I am most familiar with AWS and GCP, but also researched other cloud providers for this article. I found Digital Ocean to have very attractive pricing for this use case.

--

--