VALIDIO
Published in

VALIDIO

The Persistent Peril of Machine Learning

In our ML & Data trends post from February we discussed whether one believes MLOps has crossed the chasm or not, the rise of MLOps (i.e. DevOps for ML) signals an industry shift from PoC’s (how to build models) to operations (how to run models). Even though this shift is something that we’re extremely excited about, there’s a recurrent bottleneck that keeps haunting us year after year: data quality.

Adapted from O’Reilly (2019)
Image courtesy of Rackspace Technologies (2021)

We need data engineers

The results from the January 2021 Rackspace survey highlighted how data engineering problems pose a significant problem for companies of all sizes. Data being siloed, lack of talent to connect disparate data sources, and not being fast enough to process data in a meaningful way… the list goes on.

Let’s show data engineers some love

The pandemic highlighted our ML vulnerabilities

We’ve experienced first hand from companies such as Uber, Facebook and Amazon the impact performance issues can create for data-driven companies with ML models in production in critical operational settings.

Example of a non-critical failure still resulting in a financial loss

Machine Learning vs Software Engineering

The process of developing machine learning models is often compared to the established process of software development. However, a key differentiator between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform predictions. In short: traditional software development is deterministic. Machine learning development is probabilistic. This results in a two-edged state of existence for ML models.

Image courtesy of Matei Zaharia/Databricks
Image courtesy of Hawaiian News

Your model was never your IP, it’s your data

ML quality requirements are high, and bad data can cause double backfiring: when predictive models are trained on (bad) data and when models are applied to new (bad) data to inform future decisions. Poor data quality is the archenemy of widespread, profitable use of machine learning. Together with data drift, poor data quality is one of the top reasons ML model accuracy degrades over time.

Image courtesy of Stanford University/Chip Huyen

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Oliver Molander

Co-founder at Validio and early-stage tech investor at J12 Ventures. Preaching about the realities & possibilities of Data & ML.