Deep Learning and Workflows — It *Can* Get Easier!


If you’ve built out anything with Deep Learning, you pretty much know how the drill goes right? You

  1. Start off with a basic workflow. Or, alternatively, copy over a workflow from something you did previously, and mess with it for a while before you realize that it doesn’t quite match what you’re doing, throw it away, and then, well, Start off with a basic workflow 😝.
  2. Start tweaking the workflow. There’s a step over here that takes way too long and needs to be optimized, there’s a step over there where you need to change a parameter, and so on.
  3. Wait a lot. Seriously, that’s pretty much 99% of what you’re doing. Waiting.

-3- is really quite the killer. Every time you tweak the workflow, every time, you run the whole damn thing again. Which is really quite ridiculous, because, really, the entire pipeline can take quite a while to run.

It’s actually quite a bit worse, because, in reality, you don’t actually sit around and wait all the time. Especially if you’re playing with Other Peoples Money, in which case you start up Job 1, proactively tweak stuff and start Job 2, and s on. Which means you have a pipeline of workflow tweaks going, and unless you’re really really good you forgot to version-control all the changes, and now you’re not quite sure which combination of code/data/model your results map to and you’re in reproducibility hell.

Yeah, there are ways around this. You could go down the DeepDive (•) route, and materialize all the intermediate results (so much storage! and so much time to read/write these results!), or cache one-shot workflow executions a-la KeystoneML (••), and/or optimize feature/selection within each iteration a-la Columbus (•••), but the thing is, they’re all focused on individual parts of the workflow. The workflow of developing the workflow (meta-workflow?) is still a complete PITA.

Or, rather, used to be a complete PITA. Enter HELIX (••••), an end-to-end machine learning language/system/environment, which does some pretty seriously cool stuff

  1. It’s a Scala based DSL, which means that you can natively tap into all the JVM/Spark libraries out there. Assuming that’s your thing. If it isn’t, cool, cool.
  2. It versions all your workflows (this, all by itself, makes it pretty much worth it! 🤯). Tracking changes to hyperparameters, feature selection, and whatnot, well, it happens by default. And yeah, you could be using git to do all this, but remember, this isn’t as easy as it sounds, especially when you’re pipelining any number of workflows in parallel 😞. And the best part is that you can compare features, models, and performance metrics directly, thus pretty much getting at the heart of reproducibility in this domain!
  3. While you’re at it, you can also visualize the DAGs associated with the workflow, and use it to mess around (“explore” 😝) the execution plans, and with any luck, optimize the heck out of that!
  4. Last, and most fascinatingly, it’ll figure out how many of the intermediate outputs need to be materialized all on its own, and optimize it across iterations so that you can live within your storage/compute budgets!

OK, the last point is really special. That said, you’re probably asking yourself, “HowTF can it do that?”. And, mind you, if you’re not, you should be, because there are some serious issues associated with doing this optimally.

You see, if you had an ∞ storage and compute budget, you could just materialize every single intermediate result (a-la DeepDive), and for subsequent iterations you could just read the result out from storage if nothing has changed.

Given that you don’t have ∞ capacity, you could materialize some of the the intermediate results, and recompute the ones that you don’t (in the HELIX DAG, this is basically computing the node and each of its ancestors).

Well, figuring out the optimal amount of recompute — the amount of work you need to do to get to that node — is not really all that complicated (it’s basically a variation of ProjectSelection/MaximumFlow. OK, it’s not so basic, but that’s what it is 😆).

OTOH, figuring out whether to materialize a particular intermediate result is quite hard (strictly speaking, which, and how many, intermediate results to materialize).

  • You don’t really know how many iterations the user is going through (as a trivial — and dumb — example, imagine that you go through the entire process of materializing a result, and then end up deciding that you’re done, which means you just wasted all that effort).
  • Even worse, you don’t really know what’s going to happen to the workflows in the future — whether you’re going to use these results, or chuck ’em, etc. You can prove that even in the simplest case, assuming that the user will carry out only one more iteration, and all results from the current iteration will be reusable in the next, you can prove that the optimization problem is still NP-Hard.

So yeah, it’s not just “quite hard”, it’s NP-hard 😡

Based on all this, HELIX actually does optimize whether to materialize a given node, or dynamically recompute it by calculating the recompute costs, and guesstimating the materialization costs. As it turns out, it’s not a bad guesstimate, given that it results in cumulative run-times that are anywhere from 60% to 10x better than the current state of the art (DeepDive and KeystoneML).

This is seriously cool stuff, and, I can imagine, only the first of a new breed of the meta-modeling/visualization tooling that Deep Learning has been crying out for.

(•) “DeepDive: A Data Management System for Automatic Knowledge Base Construction” by Ce Zhang
(••) “KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics” — by Sparks et al.
(•••) “Materialization Optimizations for Feature Selection Workloads” — by Zhang et al.
(••••) “HELIX: Accelerating Human-in-the-loop Machine Learning” — by Xin et al.

(This article also appears on my blog)