The Machine Learning Random Walk

9 min readOct 10, 2023

--

Building performant machine learning systems requires a focus on analysis-oriented work long before pipeline-oriented optimization

You’re doing machine learning wrong.

It starts alluringly simple: you need to train a machine learning model to reliably recognize a certain phenomenon in your data. This could be a pedestrian in LIDAR data, an equity’s upward or downward trend, a certain logo in an image, the name of a song in a spoken command, or something else. The modality and problem do not matter. In fact, as a machine learning scientist, academic, and manager for more than two decades, I’ve been lucky enough to work on many of these scenarios: I’ve engineered effective nano-scale image recognition methods and CT-scanner reconstruction code that are currently in global use; I’ve co-authored more than 150 research papers on numerous challenges in computer vision and related areas; I’ve taught machine learning (ML) and computer vision to thousands of students; I’ve advised project teams in autonomous driving, computer vision, library science, robotics, public safety, and more. And what I’ve learned is that, well, it’s never as simple as one expects.

Across the settings, the process is roughly the same. Given an ML problem, start by gathering the basic ingredients: a labeling protocol, sufficient annotated data, a model with sufficient representational capacity, some code, and a lot of compute. Then, mix these ingredients together and bake on the GPU cluster for a while. Out pops your model; ready to move onto the next challenge. Perhaps you bake a few different models with various settings, finally selecting the one that smells the best. Anyway, it sounds straightforward, doesn’t it?

Well, it’s wrong.

This rose-tinted view of the machine learning process is often called the machine learning pipeline. And, well, I’m done with hearing about the machine learning pipeline. Everyday work in real machine learning could not be farther from a pipeline.

Let me start with a concrete example. Some years ago, I was involved in a project where the goal was to detect all possible obstacles from dash-cameras, such as signs, barriers, and pedestrians. We followed this basic pipeline in earnest. Defined a reasonable labeling protocol; gathered ample in-domain video; and connected with annotation vendors to handle the labeling. And we were careful: we reviewed the protocol in multiple sessions; we documented it; we generated examples of what we expected to be corner-cases. We were set. Once we got the data back from the annotation vendors, we reviewed a random subset to ensure it matched protocol; we split the dataset in training and evaluation sets; then set off to train some models. 70% performance on the evaluation set.

Hyper-parameter tuning. 72% performance. Huh. That’s not where we want it to be. What happened, we thought, puzzled. SOTA models yield much higher numbers than that. We can not deploy a model with 72% accuracy. We had invested two to three months of work. Now, we needed to work through the demoralizing analysis of what went wrong. Of course, our first angle was inspecting the code. Nope. The models. Nope. Stumped. It wasn’t even the data. We repeated our review. Fine. Hmm.

Were there common mistakes in the evaluation dataset? Ok, we spent some time spinning up code to define “common mistakes” and to visualize them. This throw-away code took false positives and false negatives, and clustered them by the model embeddings. Like a brick-wall, our mistake hit us.

**Figure 1** Here is an example image from the Berkeley Deep Drive Dataset that helps explain this protocol challenge. In this scene we find lone pedestrians (on the left). We also find small groups of them in the street, and a large throng of them on the sidewalk. The density of the object of interest correlates with the amount of inside-throng occlusion, hence making it difficult to accurately delineate boundaries. It is obvious that the lone pedestrians on the left should be annotated independently. However, after that, it is not clear how to approach the small groups or the throngs. BDD chose to annotate individually at all times, but this comes at great annotation cost, potential for annotation and subsequent detection noise, and the later challenge of aggregating individual detections into actionable groups post-hoc. Other groups choose to annotate at distinct density levels and subsequently take care to avoid polluting training and validation sets with semantically overlapping labels.

It was not the data. It was not the model. Instead, it was the protocol. Aside from signs, the other content — barriers, pedestrians, cars, etc. — all appear in various quantities concurrently, in groups. Sometimes, there’s a lone pedestrian, sometimes a few together, sometimes in droves. An example of this phenomenon is above in Figure 1. Not only did this mean that our initial baseline performance numbers were useless, it meant we had to start over and revise the protocol because of this simple oversight.

Does that sound like a pipeline to you? No, me neither. In fact, both as a professor and a founder of an artificial intelligence startup , I’ve had the luxury of interacting with literally hundreds of ML teams. In nearly every case, I’ve heard about the process of going back to the drawing board for the label space, or the inadequacy of the evaluation data set in terms of measuring production performance, or some other issue. It seems the machine learning process is much more complicated than we thought, and certainly much more complicated than we’d like. And, surprisingly, it seems to rarely involve the model choice or the implementation.

Why are we so stubborn?

Why do we see so much emphasis on the machine learning pipeline? The notion of a pipeline for ML work is literally everywhere on the web. For example, “streamline and speed up the process of tasks… [such as] getting data from the data lake, cleaning and preprocessing the data.” There is discussion of how it speeds up work; how it saves money; how it allows scalability; etc. Figure 2 describes a summary canonical machine learning pipeline.

**Figure 2** A canonical machine learning pipeline that dominates the discussion of building machine learning solutions. Typical workflows are entrenched in a separation between data work and model work with limited or no backwards flow of information to adapt to learnings during the development process.

Perhaps it is simply easier to think of the machine learning process as a pipeline, with a beginning, a middle and an end. Or more like it, codifying the flow of data and resources through the many types of work that machine learning and data engineers do, make sense from an infrastructure standpoint. Furthermore, infrastructure is well defined; it is natural to build businesses around building and selling infrastructure. And, with an onslaught of marketing around the ML pipeline, what else is one to think they need than such pipeline infrastructure.

Similarly, there seems to be a structured process when it comes to the work machine learning and data engineers do to go from problem statement to deployable model (not even beginning to think about evolving data distribution and model drift). Hence, managers and team leads seek a language to manage that process. Businesses do not thrive with lone developers in basements solving the most critical technical challenges of an organization. No, teams coalesce around modularized technical problems and plans. Planning requires predictability and certainty. So, we invent (unfortunately, wrong) predictability around machine learning processes.

Or, perhaps it is because the machine learning research community has consistently emphasized modeling as the key thing to focus on rather than data or a notion of model-and-data co-development. This relaxes the need to consider and analyze the protocol and data quality; it suggests the data is created once and strictly after that happens, the interesting work begins, which, well, could not be more wrong.

Doing Better than the Machine Learning Random Walk

I liken the real machine learning process to a random walk. The machine learning problem is often fairly well-designed. And, we take a random walk through some complex space defined by a cross-product between possible datasets and models. It’s a massive and complicated space that evades a careful definition. But, at any instant, we are at a point in that space. As we modify a dataset or a model (architecture or parameters), we move through that space. With the elusive definition of the space, it is impossible to measure a “gradient” of our work in a principled manner.

Conceptualizing it as a pipeline is like choosing a point in that space somewhat at random and hoping for the best. When that fails for one reason or another, we do simple things like hyperparameter tuning or getting more data.

These are akin to small random jaunts from our initial point. In other words, this is a random walk. There is usually limited awareness on the part of the machine learning or data engineer as to how one option ranks with respect to another one. Sometimes, these individuals will spin up throwaway scripts to visualize model outputs in certain ways or attempt to understand certain corner cases of their dataset.

Analysis is hence at the heart of the machine learning random walk. The more one can reduce the uncertainty around each decision — each jaunt through the space — the faster one can navigate the random walk toward a performant system. This could mean visualizing the embeddings of a dataset based on the model outputs for only a certain subset of classes. This could mean selecting a subset of the (even the training) dataset based on false positives and confirming if they are indeed annotated correctly. This could mean comparing different candidate models’ performance across subsets of the dataset.

And, concretely, the more one can reduce the uncertainty around each decision, the less “random” the walk would be; Figure 3 describes the impact of analysis on the machine learning pipeline, transforming it into a cyclical process.

**Figure 3** Analysis transforms machine learning work from an unrealistic pipeline, to a cyclical workflow. At each iteration through the cycle, appropriate uncertainty-reducing analysis work ultimately speeds up the process from ML problem to ML solution.

Of course, what analysis needs to be done to best reduce uncertainty at each decision juncture depends on the specific situation. Knowing what to do relies to a large degree on experience and analytical reasoning.

Interestingly, the available content on how to conduct these analyses is rather sparse, both in the technical literature and the web. Yet, there are dissertations that could be written around even small components of this type of analysis. I look forward to seeing this topic evolve in the coming years.

At this point, I’m comfortable with accepting this reality that we need to focus more on the analysis part of the cycle. Then, finding good software tools to make these analyses easier and more fruitful to conduct is important.

There is interestingly a community of tools growing around this notion of analysis-oriented machine learning workflows. I am a cofounder of the company that develops the open source FiftyOne artificial intelligence toolset, which is one approach to reducing uncertainty in these local decisions. Of course, I am going to recommend you use FiftyOne. But, I’m here more broadly to advocate that machine learning and data engineers use any of these tools: it’ll lead to better systems, better performance in more friendly delivery cycles.

Wrapping Up

Optimizing the machine learning process requires an appreciation for the high levels of uncertainty present. Whereas classical computational thinking may lead one to infrastructure-focused orientation that emphasizes the speed and effectiveness by which data gets processed, optimizing the machine learning work instead requires effective mechanisms to uncover false assumptions, mistakes and other mishaps in the manner in which the problem statement was rendered into a computational system.

I imagine it seems natural to think: ok, fine, once this uncertainty is managed and we have a deployable model, then we can set up our fast, automated pipelines and crunch away in production. To this thought, I’d answer a very clear “maybe.” Sure, one wants to optimize and automate. It certainly makes sense. But, I would still exercise caution: data drift and evolving production expectations create a need to constantly measure and evolve these machine learning systems.

In all of these situations, clear software-enabled analytical decisions are required to optimize the machine learning random walk.

Acknowledgements

I am grateful to the helpful commentary on early drafts of this article provided by Jacob Marks, Dave Mekelburg and Jimmy Guerrero.

Biography

Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at The Johns Hopkins University in 2005 and 2002, respectively, and the BS Degree with honors from Loyola College In Maryland in 2000, all in Computer Science. He is the recipient of a U Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, the Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at _JasonCorso_.