Distributed Data Science for Management

Matthew Rocklin
Coiled
Published in
5 min readJul 7, 2020

Scaling Data Science is a Team Sport: Management are Key Players

An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management.

Solving simultaneously for all sides of this problem is a cultural and political challenge as much as a technical one. This is the problem that we’re passionate about solving at Coiled, and that we recently spoke about in our PyCon 2020 talk.

In this post, we’ll discuss the pain points felt by data team leads and management when trying to deploy data processing technologies to provide data scientists with distributed computing. In other posts, we do the same for data scientists and for IT professionals.

We often see the paint points felt by team leads and management reduce down to three main challenges:

  1. Avoid Costs: What stops a novice leaving 100 GPUs idling?
  2. Track and Optimize: Where are we spending money and how can we reduce this?
  3. Enable Collaboration: How do we replicate the experience of our top performers, and enable them to raise up the outputs of the entire team?

We’ll call out these challenges in each of the sections below.

Avoid Costs

Be careful what you wish for with “infinite scaling” on the cloud.

Fundamentally, we’re transforming our company by giving as much computing power as possible to every data scientist. To use a military analogy, this is like giving a tank or fighter jet to every person in the military. This is amazingly powerful, but may result in a surprisingly expensive fuel bill.

Most distributed computing costs are avoidable. Here are a few common culprits:

  1. Keeping a cluster on all day: Data scientists often request a large cluster of machines and keep them running all day, even when they’re not using them;
  2. Forgetting machines: Machines will often be left on forgotten, even after their users have moved on;
  3. Allocating unused GPUs: GPUs offer unparalleled performance, but only if you use them correctly, which is unfortunately quite hard today (although see Dask/RAPIDS). We often see individuals request expensive GPU machines, but then never use them properly;
  4. Writing inefficient code: Most of us write inefficient code. When we rent our machines this inefficiency translates directly to financial waste. As we start to scale out our analyses this waste scales with our computation.

These issues are straightforward to address, but they do need to be addressed. We can implement per-user and per-group usage limits, and expose per-user and per-group usage metrics to management. We can also perform distributed profiling and see how much money we’re spending on every line of code that we’re running. This lets us make better management decisions to reduce costs.

Slide from our PyCon 2020 talk “Challenges of Deploying Distributed Computing”

Track and Optimize

The irony is that the practice of data science is not itself terribly data driven today.

Tracking and profiling aren’t just useful for reducing costs, they also help us make data driven decisions about our processes. For smaller data science teams this is easy. Typically there is a team of 1–5 people with a central team leader who has a firm grip on what everyone is doing. However, for larger organizations we often find that there is a strong desire to know what is going on, especially in a messy field like data science. The irony is that the practice of data science is not itself terribly data driven today.

This becomes more challenging when we add scalable data science tools, both because environments and techniques churn quickly, and because tracking and profiling distributed services is hard.

There are great opportunities here though, and we’re really excited about enabling questions like the following:

  1. How much did we spend parsing CSV files last year? What file format would give us the greatest cost savings with our current workloads?
  2. How much faster would this workload be if we ran it on GPUs? How much more or less expensive would it be?
  3. Who is still using the old version of XGBoost?
  4. Is anyone still using an old version with Python 2 that has a known security issue?
  5. Which groups in my company are able to use TensorFlow effectively?

These questions let larger organizations tune and optimize their entire data science division. This is rarely something that anyone has done well yet today, but it is a common topic of conversation.

Enable Collaboration

How do we turn one 10x engineer into ten 10x engineers?

Collaboration is about enabling bottoms-up team management.

Distributed computing services are often brought into organizations by individual highly effective early adopter data scientists (or at least this is our experience with Dask). These individual contributors invariably share the experience with their colleagues, and end up serving as technical leads and informal devops for a while.

But these early adopter individual contributors need help in order to quickly uplevel their teammates. They need to craft software environments for everyone else to use. They need to track what their colleagues are doing to help them diagnose performance issues. They need to connect to their clusters and help them debug sticky situations. This can be tricky, especially in today’s world of remote work.

Collaboration isn’t a single feature. It’s a suite of small features that are designed around this relationship of skill-sharing. Doing it well helps to amplify effective engineers, while also shortening the time it takes for novice data scientists to become experts.

Final thoughts

These are the types of data science problems we’re building products to solve for here at Coiled. The truth is we’re really excited to be building products for scaling data science in Python to larger datasets and larger models, particularly for organizations and data teams that want a seamless transition from working with small data to big data. If the challenges we’ve outlined resonate with you, we’d love it if you got in touch with us to discuss our product development.

— Matt, Hugo, and the whole Coiled Team.

Originally published at https://coiled.io.

--

--

Matthew Rocklin
Coiled
Editor for

I maintain open source Python tools, notably Dask. I also run @CoiledHQ . Additionally I try to be a decent human, and help the world from melting.