Effective ML Teams — Lessons Learned

Czeslaw Szubert
8 min readAug 22, 2019

--

Machine Learning and Artificial Intelligence has entered our everyday lives — from Virtual Assistants built into each smartphone, to Cybersecurity Threat Detection implemented by Financial Institutions, the race is on to tap into this next rich wave of digital disruption. Enterprises across the industries have much to gain (or lose) and are investing heavily in ML teams. With the high demand, ML talent is scarce and expensive. In addition, the foundational infrastructure and enterprise processes that are needed for successful ML initiatives are not yet established in most enterprises. So it is no surprise about 90% of ML models developed by Data Scientists are never released to production and therefore never generate business value. Only 14% of enterprises succeed in harnessing AI — see this article by Andrew Moore, Head of Google Cloud AI. So how can enterprises get the desired returns from ML initiatives? In this post I’ll present lessons learned on how to setup successful ML teams and what you need to devise an effective enterprise ML strategy.

Focus on Business Value

A successful ML initiative must deliver tangible business value. Just creating a cool, or even a record breaking ML model, does not benefit the enterprise — unless the primary goal of the initiative is a headline demonstrating the superiority of company’s AI services. In all other cases, the way to deliver business value is to release the trained and verified model to production where it can generate a financial benefit, whether in the form of increased revenue, reduced cost or mitigated risk. If ML improves your product/service performance, make sure to estimate the financial benefit of the improvement. You have to quantify and measure the financial benefits, so that you know whether you are making a difference with each consecutive release of the model. Avoid starting with a cool ML idea as an answer and then looking for a problem to solve — “If one has a hammer one tends to look for nails”. You are better off starting with a list of your enterprise problems and opportunities, then see where ML has a potential to make a significant difference.

Assemble an end-to-end cross-functional and self-sufficient team

If a production release of the developed ML model is necessary to deriving business value, then you need the right team in place to make this happen. In an enterprise, the responsibilities for deployment to production spans across many roles and departments. In addition to the Data Scientists who will select, train and optimize the best ML model for the use case, you will need Database and Data Engineers who will ingest and transform data making it ready for model training. You will need Developers who will implement the ML model in the business application, and you will also need Security, Privacy, Compliance, Operations and Business Stakeholders. In an enterprise where each of these teams works in their own silo and on their own schedule, the effort to productize an ML model may easily take many months, or even years, if it gets released at all. This not only delays or eliminates the financial benefits that the business could derive, but also affects necessary feedback the team must learn from in order to continue improving. Therefore, it can not be overstated that a successful enterprise ML strategy must rely on an end-to-end cross-functional and self-sufficient team, composed of all groups necessary to release an ML model to production, working towards a common goal of delivering business value.

Analytics Tools as a Platform

Next, enable your team by providing the right tools for the job. New ML tools are released and updated constantly and the requirements of the ML initiative may change drastically with each new attempt to develop the right model. You need to start by giving your Data Scientists access to popular ML frameworks and tools: Scikit-learn, TensorFlow, XGBoost, Jupyter Notebooks, (and many more), and access to powerful hardware with GPUs or TPUs to run ML training, which can be scaled-up and down as needed. Additionally, you need to decouple storage from compute and you need to built in Security, Privacy and Compliance into the Analytics Platform. It is easy to fall into the trap of creating unique/purpose-built architecture for each ML initiative. Although this may be the quickest route in the short term to get to your first POC, in the long run, this approach will add months before Data Scientists will even know whether the architecture will work for the application. To avoid this, you need to create a reusable Analytics Platform for Data Scientists to do their work, rather than deploying one-off architectures for each use case.

Go Beyond DevOps: DataOps & MLOps

ML workflows are very different from software development. In software development you almost always succeed — you might have bugs, but after you fix them the feature will be ready for a release. ML on the other hand is highly experimental — Data Scientists will routinely test hundreds of models with the hopes of finding one that works well enough to be released. Another important difference is that software is created in a development environment and typically tested with only simulated data; ML model training needs access to real/production data well before any model is ready for a production release. How do you manage the highly experimental ML development work and how do you ensure you have a stable production environment to manage your data and models? You will need to use solid DevOps practices to manage ML workflows from data ingestion, to model training, release to production and model performance monitoring in production. Taking a page from the playbook of the giants, enterprises need an ML platform, like Facebook’s FBLearner Flow or Uber’s Michelangelo. But enterprises don’t have to do it from scratch — the recent years have seen a rise of new tools in the area of MLOps and ML Workflow Management, which can help to automate the tasks of ML experiment management (ex. MLFlow, comet.ml), data versioning (ex. DVC, Pachyderm), ML workflow pipeline management (ex. KubeFlow, Databricks) all the way to ML model release and performance monitoring (ex. DataRobot, Algorithmia). The below diagram shows some of these tools and where they can be applied in the ML workflow. Like much else in ML development, there is no one tool to do it all, and flexibility is important to get the best from each, but enabling flexible ML workflows based on solid DevOps fundamentals is a must for a successful enterprise ML strategy.

Start Accumulating Reusable Data

It should be no surprise to anyone that ML requires data — and lots of it. Access to data needs to be very secure, which is apparent from the frequent headlines about data breaches, which happen to even large enterprises. The fact that data needs to be redundant and backed up also goes without saying. Just like you can not rebuild the analytics tools for each new ML model, you cannot recreate data for each ML initiative. While each team will be working under time and cost constraints to deliver results, cutting corners on data ingestion will lead to a very high-interest technical debt. You must invest in ingesting complete data sets, not just the subset of data you need for the current ML initiative. Otherwise you will have to re-ingest from the same data sources over and over again, unnecessarily duplicating the effort. You must monitor and maintain your data ingestion pipelines and resolve issues as they occur, or it will be much harder to fix the problems later and you may not be able to recreate lost data. You must make the data immutable, so that you will be able to reference the same data when attempting to retrain an improved model or compare the performance between different ML techniques. There are more and more Data Versioning tools coming into the market — it might be worthwhile to investigate whether one of these tools would benefit your enterprise. You must create an easily searchable data dictionary, so that teams can find what data is available. Once the teams find a data set they need, there needs to be a clear and expedient process for granting access to authorized users. Data Security is very important — it is not enough to just set and forget about these permissions, as requirements change, your permissions must reflect these changes. It’s important to invest in a good process to keep these permissions up-to-date and monitor the actual data usage to ensure it is secure as expected. Keep in mind that data must always be encrypted in-flight and at-rest. Invest appropriately in your data ingestion infrastructure, processes and people, and make sure data ingestion is running smoothly and securely, enabling your ML initiatives to have the data resources to be successful.

Use Agile to Quickly Iterate Delivering Incremental Value

The problem with very long, often delayed releases is not new to ML initiatives — software development faced this issue in the nineties which soon gave rise to Agile development. ML initiatives need to learn from the lessons of software development. However ML is much more experimental than software development, so it will not be sufficient to just implement Scrum in ML. You need to consider the fundamental reasons Agile is so successful in software development and adapt it to ML. Agile provides a mechanism for the team to iterate quickly, delivering business value in small increments. Quick iterations enable the team to adapt to changing requirements. In Machine Learning, changing requirements are not just pushed down from the business, but more frequently changing requirements are a result of the learning from model training experiments. With each new model trained, the team learns new things about the requirements for hardware, tools, and data. Each of these new requirements can fundamentally change the solution, requiring the team to adapt and these changes are more frequent than the duration of a typical Sprint. Thus for a successful ML strategy, you need to not only implement Agile in the ML development cycle, you need to adapt to iterate more frequently, learn from previous iterations and deliver value with each released model.

Conclusions

The race for enterprises to extract value from Machine Learning and Artificial Intelligence is on. To be successful at ML initiatives, enterprises need to start by focusing on the value generated by the ML initiatives, so that the right models are developed. Enterprises need to establish end-to-end cross-functional teams that are self sufficient and enabled to release models to production. ML teams need to have a reusable and flexible Analytics Platform equipped with up-to-date tools to train and deploy models. The team needs to rely on DevOps, MLOps and DataOps to automate and create repeatable workflows. Model training requires lots of data which can not be recreated for each initiative, therefore the data needs to be ingested and made available to the Data Scientists in a way that access is secure and monitored. Finally, ML teams must adapt Agile processes to fit the experimental nature of the ML lifecycle by considering the fundamental values of Agile, rather than just implementing Scrum which was created for much more structured software development workflows. With these values in place you will have an ML enterprise strategy, with which you can succeed at capturing the next wave of digital disruption.

Originally published at https://www.linkedin.com.

--

--