Data Science Workflow Tools
It feels like there is a lot of energy being directed towards “workflow” in data science at the moment. Workflow seems to encompass everything including a detailed development process, orchestration of pipelines, management of experiments and data and deployment to production. This is a write up of some of my notes on the workflow process and how to navigate what feels to me a seriously confusing part of data science. It is light on technical details but should provide enough context as to why this is a complicated area and for anyone tangentially involved in data science enough background to orientate themselves.
What is Workflow?
By Workflow, I mean a process that involves a Pipeline, or DAG (directed acyclic graph, although sometimes cyclic!) and resembles something like the following figure. The actual stages involved will vary considerably.
Orchestration is how we manage the flow from one stage to another, Development is focusing on each individual stage.
One of the complexities here is that workflows vary considerably according to the domain, objectives and support available.
It is entirely possible a data scientist will spend much of their time acquiring, exploring and pre-processing data, then applying and evaluating several standard models, for example those from scikit-learn. Elsewhere, another data scientist might have access to standard datasets, e.g. reference images, or a large data engineering team and spend their time writing low-level code in Tensorflow.
Equally the resources required might vary, with a large compute resources required at different points in the pipeline, from summarising large datasets or training very high dimensional models.
The actual tools and libraries used at each point in the pipeline will also vary considerably. Data acquisition and extraction could be a mix of SQL and API calls, data transformation written in Python / Pandas, Scala or Java. There are many different machine learning libraries and deployment could be anything from an academic paper to online API.
One last complexity I want to add — there is no one right tool for any given job. Databases, from Oracle to Big Query, are include regression modelling, ML and support complex stored procedures. Redshift lets you write user defined functions using Python with Numpy, Scipy and Pandas. At the other end of the tool spectrum, Jupyter Notebooks emerged as a great way to mix code and results but frequently feature embedded SQL, database connections, API calls, can start and run clusters and be automated using tools like Papermill. This is both incredibly powerful but also a recipe for chaos!
Understanding Data science workflow is hard because it encompasses many tools, teams from many backgrounds and needs to be flexible to cover many different domains.
Towards Workflow Standardisation
As these workflows have evolved, we are starting to see new tools evolve to help manage them. These, arguably, started with Luigi and Airflow and have evolved to include mlflow, DVC and Kubeflow.
The benefits of standardisation is that it means data scientists can work much more in a team. Any software development team follows some sort of process that enables them to contribute to a common code base and the same is true for data science.
A second important feature of workflow frameworks is how we move across platforms.
For development in data science, we often want to work locally on our laptops with a dataset on our laptops, this enables quick identification of bugs, quick iteration over data and coding and development.
Once we have confidence in out code, we want to scale it up and out, which usually means running on a cluster or in the cloud. This means we need to standardise and package all the code from our laptop but with access to much larger compute, storage and, if required, specialised processors like GPUs or TPUs.
Lastly, for a production systems, we probably want to automate the whole process. This means it can be run on schedule — hourly, daily, weekly — or on pushing the latest, tested version to master, just as we have continuous integration and deployment in the software engineering world.
Workflow Tools and Infrastructure
Probably everybody who has done any work resembling data science has a collection of shell scripts or Jupyter notebooks to run some sort of automation. Anyone trying to standardise will probably find lots of confusing information.
Dependencies and Scheduling-First
One step up from shell scripts are Luigi and Airflow, which evolved to co-ordinate the multiple steps in general data processing in the Big Data world, partly as a response to tools like Oozie, which was closely tied to Hadoop. The background of Luigi was getting Spotify to managing the training of its recommendation process. These tools both connected to multiple cloud services and enable batch jobs to transfer data from one system to another. They have strong dependency management, so you can define connect different steps of a pipeline.
If you are reading up on data science workflow, you will almost certainly find these tools, but, for daily use, they feel much more like the domain of Data Engineering, with Airflow especially focusing on backfill and managing data transfer to a schedule.
Where they feel less string is the Development phase, where we want to iterate quickly through new ideas, rather than run long batch processes. Airflow DAGs are tied to a database and timestamp and doesn’t deal gracefully with multiple versions. However, they are very good at Orchestration, especially in a cloud environment, but don’t translate so well to local iteration and seamless transition between local and cloud.
Experimentation First
DVC and mlflow offer platforms for managing data and experimental runs. Yet another complexity of data science workflow is that each new version is tied not just to code but also to the dataset used and any hyper-parameters that get output, in additional to the model created.
Developing — and improving on — a model is a matter of tracking parameters and data along side improvements in performance metrics. This means that other contributors can recreate your experiments exactly.
Keeping a copy of the exact dataset with each experiment run can lead to a large amount of redundant data, something DVC seeks to minimise, managing versions for you.
mlflow offers support for various infrastructure (see below) and also keeps track of experiment runs.
Unifying Local and Cloud Infrastructure
New platforms mlflow, Kubeflow and to an extent TensorFlow Extended offer the ability to write Python to specify pipelines, run them locally via chaining together Docker containers or remotely on one of the major cloud providers.
Each step of the Pipeline is defined in a Python based DSL or high level API, somewhat reminiscent of Airflow operators. The different frameworks then offer build tools to submit training jobs on cloud platforms and eventually hosted models serving predictions in realtime. The cloud providers offer Jupyter notebooks as part of these offerings.
There is currently a complexity tradeoff here in that you will quickly need someone support managing and monitoring your cluster infrastructure and the overhead in actually getting started, beyond the self-contained examples, is pretty high, with specification for dependencies, infrastructure and security across the different backends — getting access and permissions correct is not trivial!
The implementations also vary in terms of the backends and services they support and are in very early stages when it comes to supporting all backends. mlflow does not support GCP as a backend and, coming from Databricks (owned by Microsoft) has a strong Spark and Azure bias, although it has excellent support for a variety of hosted models in production. Kubeflow relies on Google Kubernetes Engine for its some specific Pipeline functionality and Tensorflow Extended relies on Apache Beam and Tensorflow for its backends, so is less flexible in the range of libraries it provides.
The trend here seems to be increasing the flexibility through configuration, which is an advantage if you can manage with the limited options but increases overheads once your demands fall outside the pre-prepared scope.
Summary
The benefits of a standard workflow mean that you can substantially improve collaboration within a team and improve repeatability of code and confidence in production systems. But it comes at a cost and ultimately becomes an architectural question, trading scale and complexity for standardisation. These are the tradeoffs with different platforms and only understanding your organisation, the domain, organisation strategy and the skills and scale of the team involved will help in making decisions.
Hopefully, I will follow up this post with some of our own internal approaches, using cookiecutter datascience and Google’s AI Platform.