By James Watkinson, Head of Engineering at DataSparQ
A client recently requested taking their data science prediction model from development to production in less than one month and running it as an ongoing service.
This is quite typical of the requests we get here. Productionising Machine Learning is what we do at DataSparQ, and I’m always looking for ways to improve efficiencies of our development and the ongoing operations.
In this case, as is common in machine learning pipelines, there was a pretty complex set of operations required to take the raw input data and create a meaningful output which could be used by the business. We ended up using over 60 features built daily from the raw data. The full pipeline includes over 20 operations split between services hosted in the cloud and on premise.
Operations need to happen in the right order, with certain stages having multiple upstream dependencies. Typically, we would use a tool like Luigi or Airflow to orchestrate a pipeline like this, but these tools come with the overhead of hosting and the need to run the operations centrally. They also come with a massive toolset of operators which, in my experience, are rarely needed.
For this project, I wanted something simpler: A lightweight way to orchestrate pipeline operations, exposed as an API to play nicely with our cross-platform microservice architecture.
Developers, meet Houston.
On a delayed train journey into the office, Houston was born.
Houston is a serverless orchestration tool. It exposes a simple API which is used to navigate through a directed acyclic graph (DAG). Houston has a few core concepts:
Plan: A DAG, describing the operations that need to happen, the order and the dependencies.
Stage: A node in the DAG which represents one of the operations that needs to happen.
Mission: An occurrence of running through a plan.
For each Mission, the Stages in the associated Plan have states. These are simply ‘not started’, ‘started’, ‘finished’ and ‘failed’. Additionally, stages can have key value parameters — more about them later.
For the client in question, I created components to complete standard tasks such as:
- manipulating BigQuery data tables
- inserting, deleting and truncating
- creating, monitoring and destroying DataFlow jobs
- creating automated test components
- creating and destroying virtual machines
- moving items through storage
By creating isolated, re-usable components, we gain the benefits of microservice architecture for our data pipeline. Each task can be hosted anywhere, and the overall impact of failure of an individual task can be mitigated.
This is where those Key Value parameters become useful. Some of our stages require manipulating BigQuery data tables. If there was a 1:1 mapping between components and stages, we’d need to create a microservice for each Big Query stage in the DAG. The params allow us to reuse a microservice for multiple stages in the DAG, each conducting a different operation on the data, defined in the key-value pairs.
We have developed a Google Cloud Platform (GCP) plugin for Houston which further streamlines operations when using GCP. In this case, each operation is executed by Google Cloud Functions listening to a pub/sub queue. When a stage completes, a message is sent to Houston which in turn acknowledges the status change and responds with the next valid operation. This instruction is queued for the appropriate component to pick up.
We’re using Houston now as our default workflow orchestration tool and we will continue to develop and improve it.
Houston is now available at callhouston.io, along with a Python client and the GCP plugin. Give it a try for free!
P.S. I’d welcome any feedback. This has solved problems for us, and I’m keen to hear whether it can for you too. Get in touch with me here: email@example.com