ML Ops: Evolving & Sustaining Innovation

Robert Sibo
Slalom Data & AI
7 min readApr 1, 2020

--

Many great examples of technological advancements becoming a competitive advantage boil down to being able to re-create it in a scalable, agile and cost-effective way to fuel potential new markets and to satiate the demand it’ll create. We see this with Tesla’s production difficulties, the commercial space race, and the mass production challenges that separated Apple from its competitors.

Machine Learning is the same. For most companies, success doesn’t require the latest Deep Reinforcement Learning or Convolutional Neural Network, but rather creating a reproducible string of successes leveraging ML in various parts of the business. It would be massively naïve to believe that Siri or Google’s search recommendations are “one-hit wonders” coming from a small group of isolated engineers.

Most of the time companies start with the bare minimum pipeline and infrastructure to rush to an ML model (often leading to poor results due to poor features and algorithm selection but that’s another topic).

But this falls over when it’s time to replicate across problem domains, org units or scale to production inference demands. It’s natural for data scientists and data engineers to prefer exploratory development lifecycles; favoring bootstrapping it like a proof of concept project over rigorous software engineering practices. Iteration is the answer, but ensure that after proving value efforts are easily transitioned into production strength solutions.

I’ve seen first hand at Apple, Facebook and others in Silicon Valley how their successes stand on the shoulders of massive amounts of work to create an ecosystem and operating model around ML. Providing for the data scientists, ML engineers, and data engineers a foundation for:

  • Scalable infrastructure (compute, data, integration)
  • Agile development patterns
  • Automated & standard CICD pipelines
  • Re-usable modular code libraries for various parts of the pipelines
  • Extensive training for engineers across the organization on tools and libraries
  • Identified roles and responsibilities ranging from testing, engineering to scalable infrastructure to product ownership
  • And much much more

ML Ops is a discipline evolving quickly across open source, commercial and academic groups to leap-frog the barriers of entry most companies face to deliver ML at scale and achieve the ML fly-wheel Amazon and Silicon Valley companies have achieved.

Definition of ML Ops

ML Ops attempts to bring software engineering best-practices around DevOps to data and ML pipelines with a goal to de-risk, accelerate and embolden ML initiatives.

Tools won’t accomplish this alone — best practices crystalized as blueprints or standard operating procedures will help adoption and result in:

  • Simplify writing production-ready code
  • Spending more time building data pipelines that are scalable, deployable, reproducible and versioned
  • Standardizing the way that teams collaborate (typically difficult with Jupyter notebooks)
  • Reduce skill gap for junior engineers and data scientists
  • Reduce risk by building privacy and security into solutions from the start

Because ML Ops cannot be simply created through a tool suite or a process; it’s important to see it as a coordinated and mature set of capabilities that together provide sustainable and widely applicable successes. Now this is where the ML operations ecosystem comes in, building out these capabilities historically has been a massive undertaking and required deep engineering capabilities but as of 2019 a great number of tools, mapped out processes and general best practices are emerging.

I’m going to attack this in 4 steps:

Step 1 — Identify Ownership & Operating Structure

Figure out if this will be managed centrally (ML COE of sorts) or built up independently across distributed projects (“lighthouse projects”), in which case providing reference blueprints and procedures to follow will be even more important.

Building an organization and operating model to mimic the power and results of companies in Silicon Valley is worth a whole separate discussion; out of scope for this post.

Step 2 — Define the Target ML Ops Capability

Define the need, prioritize investments and create a roadmap to build the ML Ops and delivery ecosystem required. Each capability may be provided by certain technology solutions but first focus on the functional need — mapping it to tools will come in step 3.

Below is a reference blueprint for a mature capability:

Every situation is unique, for your organization’s ML delivery capability you may focus more on proof of values and won’t require at this point CI/CD and production-level privacy and security controls, for example, since it won’t be exposed externally. As with anything a spectrum exists and should be used to align everyone to where the goal is set and priorities should be set.

Step 3 — Map to Tools & Services Strategically

Build vs Buy, Best of Breed vs All-in-One: all are concepts that will come up as you define strategically the operating structure defined in Step 1 above.

My suggestion is to start with 1–2 platforms, like AWS, and bring in best-of-breed only when justifiable. There is still a lot of overlap between niche tools and new tools are popping up in this domain all the time; while AWS, Azure and others are consistently innovating on their own suites of services.

Data & AI Landscape in 2019 from Matt Turck illustrates just how crazy things have become if you’re coming into building your ML capability from scratch and looking for best of breed multi-vendor solutions.

As an example, at Slalom we’ve seen AWS very successfully fill most of the technical capabilities required for even a mature capability.

Below illustrates the mapping to Slalom’s ML Ops Delivery Blueprint:

Step 4 — Create a Culture of Sustainable ML — the ML ‘Fly-Wheel’

A modern culture of data is an environment of experimentation, empowerment, curiosity, critical thinking, and collaboration. A company with the appropriate governance, controls and safety nets to protect and fuel this spirit a modern company can actually be set free to explore and do great things. Raising the general literacy and hiring roles in all parts of the business that appreciate and can produce analytics insights when provided appropriate tools and training will produce a fly-wheel effect.

In the end, rich data and powerful machine learning models incorporated into your business can be the difference between growth and being disrupted, but it can also quickly become a failure or even a toxic asset without thinking differently about managing it in today’s environment. Take privacy and compliance seriously and work on embedding privacy by design throughout the company via tools, policies, processes and organizational changes. As part of this, a thorough assessment of how data is used internally and within 3rd party partner networks are required and must be actively managed going forward.

Much like in law where the burden of proof falls on the disruptor, at Slalom we’re seeing a new set of burdens or obligations arise for data scientists looking to innovate. I believe for sustained ML success a company will need to balance bold visions and innovation with guardianship — tailored specifically for data science and ML work.

But as they say, this topic is out of scope for this blog, but worth a discussion. Let’s talk more….

— — — — — — -

Slalom is a modern “digital and cloud native” consulting company with a deep appreciation for all that data and analytics can bring a company. Across our offices globally, we help our clients instill a modern culture of data and to learn how to respect the role they play as owners and stewards of it.

Robert Sibo is a senior director of data and analytics out of Slalom’s Sydney office, formerly Silicon Valley. Speak with Robert and other Data & Analytics leaders at Slalom by reaching out directly or learn more at slalom.com.

rob.sibo@slalom.com

Slalom Australia

--

--