How Developer Tooling Continues to Shape the AI Landscape

Jason Corso
12 min readFeb 15, 2024

--

Take a journey through the software and service tools that have provided the scaffolding for the amazing advancements we have and we will continue to see in unstructured AI.

Depiction of the evolving AI developer tool landscape. Generated by DALL-E.

It’s a standard thing to say that the deep learning movement (and by induction the current AI unstructured movement) was made possible by vast quantities of available data, by significant advances in compute through GPUs, and by algorithmic approaches. I agree. But, I think there is another element to that picture that is somehow left out despite its centrality: the developer tooling, whether software or services, that supports much of the everyday work in research and development for AI.

So, let’s explore what tooling may have made this current AI summer possible.

Although I believe the ideas in this article are applicable to general AI, I focus on unstructured rather than structured AI in this article, as it is my expertise. Structured AI implies the underlying sample space (the data) can easily be captured as rows in a spreadsheet or classical records in a database table; in other words, anything that follows a predefined schema. Think things like bank records, customer profiles, etc. In contrast, unstructured AI implies the underlying sample space is not easily captured as tabular record-value stores, but instead captures a more complex, often high-dimensional space, such as the space of images of a certain size, or the space of natural language strings of a certain length, or similar. Unsurprisingly, unstructured data dominates the globe, comprising as much as 80 to 90 percent of all data generated.

In this exploration, I delve into the indispensable realm of developer tooling that has quietly underpinned the strides made in unstructured AI. By dissecting the various waves of innovation — from the proliferation of open source software frameworks that provide the very bedrock of AI development, to the rise of human annotation and experiment tracking technologies — I discuss how these tools have not just supported but helped to propel the field forward. This article not only lays out a clear picture of the recent history of developer tools for AI, but also paints a set of possible future tooling that will continue to evolve the landscape.

The First Wave: Open Source Software Frameworks

The rise and availability of widespread, well-documented software frameworks for unstructured AI has provided the very chassis on which the current movement is riding.

Open source software frameworks for AI have been around for decades and span focus areas like general machine learning (Scikit-Learn, introduced in 2007), computer vision (OpenCV, introduced in 1999), natural language processing (NLTK, introduced in 2001), and more. With the increasing emphasis on artificial neural networks, multiple early open source frameworks were released to support what we now think of as deep learning: Theano was introduced in 2007 and now discontinued, Caffe was introduced in 2014, and Torch was introduced in 2002.

These early open source frameworks laid the groundwork for two significant investments from industrial leaders in the AI space — TensorFlow and PyTorch — that dominate the AI ecosystem today (see Figure 1 below). TensorFlow, a Google project first released in 2015, emphasizes scalability and optimization in deep learning, whereas PyTorch developed by Facebook’s FAIR and first released in 2016, takes a complementary approach that seeks flexibility through a dynamic group structure perhaps more useful in early stages of a modeling project.

Figure 1: Unstructured AI framework adoption as measured by utilization in research papers. Source: https://paperswithcode.com/trends
Figure 1: Unstructured AI framework adoption as measured by utilization in research papers. Source: https://paperswithcode.com/trends

The Second Wave: Human Annotation

Modern, learning-based AI systems have one or more models at their core; these models require data to estimate their parameters during the learning phase. For the standard practitioner, this learning phase executes a process called supervised learning, which is used over alternatives like unsupervised learning, for example, as it is more well-understood and proven. During supervised learning, the “data” is a combination of the raw content itself (the images, the audio files, etc.) along with “labels.” These labels represent the target outputs that the models are to predict. For example, in an image classification problem involving dog breeds, the data would be the image and the label would be the dog breed, such as “golden retriever,” “akita,” “havanese,” and so on.

From where do these labels come? Humans.

Yep. From all corners of the globe, humans are “connected” to the data along with instructions. Typically, the annotation is done without direct interaction with the team building the AI system.

This second wave of tooling comes in two flavors. First are the open source tools. Early contributions from academia, such as Carl Vondrick’s VATIC (see Figure 2 below), and industry, such as CVAT, which was originally a project at Intel, paved the way for in-house annotation workflows, often leading to whole teams of annotators hired to support various unstructured AI project needs in situations where it was deemed too sensitive or too expensive to send the data off-site.

Figure 2: Snapshot of the VATIC interactive video annotation tools, one of the earliest open source annotations tools. Source: https://www.cs.columbia.edu/~vondrick/vatic/
Figure 2: Snapshot of the VATIC interactive video annotation tool, one of the earliest open source annotations tools. Source: https://www.cs.columbia.edu/~vondrick/vatic/

Second, a significant set of vendors, each with their own commercial offerings, grew around the need. From early players who were acquired like Figure Eight (acquired by Appen) and Mighty AI (acquired by Uber) to certified B-corps like Sama, whose mission is “to expand opportunity for low-income individuals through the digital economy,” the need and impact of annotation vendors quickly grew. Today, among the dozens of vendors, notable players include Scale AI, LabelBox, V7, Sama, and SmartOne.

It makes sense to me that human annotation was the first unique-to-AI wave of enabling capabilities. The need for labels in supervised learning is well-understood, and the engagement is rather transactional: provide data, pay, get labels, done.

The annotation market was measured to be $800 million in 2022. However, as I recently wrote, I am convinced this human annotation need and market will evolve rapidly in the coming years. In fact, some annotation companies are already making efforts to augment or replace human annotation with automated means.

The Third Wave: Experiment Tracking

Combining the core software frameworks with annotated data from the first two waves means that we can actually train the models. However, importantly, there is no clear best model that works in all situations, an instance of the No Free Lunch Theorem. One must consider various model architectures, often based on intuition rather than any specific principled insight. Even worse, for any specific architecture, there are multiple “hyperparameters” that govern the training process, such as learning rate, initialization conditions, etc.

Hence, for any model training scenario there may be hundreds, thousands or even infinite possible training runs that must be executed to find the “best” model. How do you keep track of all of that work? Experiment tracking software.

Experiment tracking capabilities are the natural next step of machine learning tooling. They are as well-defined and well-understood as annotation and similarly transactional. Although many of us began with spreadsheets or Org Mode for tracking on top of very lightweight monitoring capabilities like TensorBoard, these rapidly evolved into more sophisticated tools. Again, these come in two forms with the first being open source. MLflow is probably the most well known open source experiment tracking tool.

Commercial alternatives to open source are widespread as well. Notable players include Comet ML and Weights & Biases (see Figure 3 below for a screen shot of their app). Because they don’t rely on any human-in-the-loop capabilities, these tools tend to offer either free or low-cost solutions for individuals, unlike the annotation vendors, which are primarily available only for enterprise-scale work.

Figure 3: A screenshot of the Weights & Biases experiment tracking app. Source: https://wandb.ai
Figure 3: A screenshot of the Weights & Biases experiment tracking app. Source: https://wandb.ai

Experiment tracking is a key component of the modern unstructured AI stack. The future of this capability is unclear to me as it seems difficult to differentiate one offering from another. It’s no surprise to me that for at least the commercial offerings, we are beginning to see additional platform capabilities like model registries and feature stores.

What’s next?

What then is the fourth wave?

Advances in generative AI might seem like obvious candidates, including LLMs, diffusion models, foundation models and vector databases. For example, when data is processed through a trained model, it undergoes a projection from its initial representation (e.g., the image) into a lower dimensional representation, commonly called an embedding vector. The properties of this resulting representation depend on how the model was trained, but typically seek to place “similar” vectors nearby in the space, depending on the relevant notion of similarity, such as shape, semantic meaning, or what have you. Being able to quickly and effectively store many of these meaningful embedding vectors is the foundation of numerous capabilities in generative AI.

However, in my opinion, these are better characterized as capabilities rather than tooling. Sure, one can build wrappers on top of them, like GPT-XYZ, but it’s not clear if these wrappers are independently viable, or if they will be consumed by the underlying tech.

In the sections below, I present some possibilities about how I think a fourth wave may evolve.

One emerging trend in the AI process is the evolving focus on unstructured data, spurring a wide variety of data-centric and advanced model tuning tools. This is happening because the broader machine learning practitioner community is increasingly cognizant of the critical role data plays alongside models in engineering AI systems: getting the right data distribution is key to model and overall system performance. On the model side, the size and resources required for model training put ever more focus on optimization and resource allocation.

Towards Less Supervision

If we change the basic assumption that fully supervised learning is the best way to train unstructured AI systems in practice, that leaves us in a situation where we need to leverage data that is not annotated. Although the software frameworks (first wave) are sufficiently general to support this, tooling is needed to support the effective selection and analysis of large data lakes for utilization in model training workflows. Examples of tools like this that I am aware of are Snorkel, which focuses on semi-supervision, and Lightly AI, which maintains the focus on full supervision but incorporates an active learning mindset.

Data and Model Analysis Tools

Building AI systems is an iterative process. You work with data — curation, annotation, indexing, selection, etc. You work with models — discovery, architecture search, hyperparameter tuning, training, evaluation, etc. If performance is above bar for the task, you then deploy.

However, if the performance is not above bar for the task, what do you do? Add more data? Train more models? If you cannot adequately plan the next best step through principled analysis, it is like walking down a dark hallway unsure about every step. So, you need an ability to ask questions about your data and models that unearth the failure modes and suggest possible remediation. These are analysis tools. They support visualization, querying, investigation and other aspects of data and model analysis that are critical to navigating this iterative process throughout the ML engineering life-cycle. One such tool is FiftyOne (disclosure: I wrote part of this tool, as it is from my startup company), which is both open source supporting full-stack AI/ML analysis for individuals and has a team-based enterprise version (screen shots below in Figure 4).

Figure 4: Screenshots from the FiftyOne, an analysis tool for unstructured AI data. Source: https://voxel51.com

Data Cleaning

Although there is an ongoing debate about whether one needs more data or clean data, there are some new tools that promote an ability to automatically clean data prior to its use (in various fully supervised or weakly supervised settings). Data cleaning capabilities lie on a spectrum with actually ensuring the data samples and annotations themselves are not corrupt on one end, and an ability to find sampling gaps in the underlying data distribution on the other. This is quite more sophisticated than the classical QA one expects from a data annotation vendor. There are numerous tools for the structured AI space, such as Zoho and Osmos, but comparatively few — one is Cleanlab — in the unstructured space due to the higher complexity of the problem. There are also some capability overlaps in within data cleaning to the earlier two vectors, towards less supervision and analysis tools.

Synthetic Data

It seems infinitely cheaper and more versatile to train AI models with machine generated data than real data. With the right tooling, one can create data along myriad situational permutations and, since it is machine generated, it is implicitly annotated as well. Many capabilities to do this are cropping up, such as Synthesia’s and DeepBrain AI’s video generation capabilities, and Duality’s ability to create dynamic realistic 3D scenarios. However, the sim2real gap — the observed phenomenon that models trained on synthetic data do not perform as well as models trained on real data because of the differences in the underlying data distributions — is a notable challenge.

Model Optimization

Even after a trained model’s performance reaches a level suitable for deployment, there is more work to be done. The model needs to be optimized to run as efficiently as possible. For example, a “big” model may not be suitable for deployment on a certain platform due to computation and power costs. Model distillation addresses this by training a smaller, more efficient model that replicates the behavior of the larger, more complex model. One seeks to retain the performance of the larger model while reducing the computational burden in the resulting smaller model. Neural Magic for example, enables its users to deploy large-scale models like LLMs on CPUs. Sometimes the model optimization process is included as an optional step in an AI DevOps workflow, such as Latent AI and OctoAI.

Figure 5: Diagram showing the workflow of efficient model optimization in the Latent AI Efficient Inference Platform. Source: https://leipdocs.latentai.io/home/content/about/
Figure 5: Diagram showing the workflow of efficient model optimization in the Latent AI Efficient Inference Platform. Source: https://leipdocs.latentai.io/home/content/about/

Remarks on the Coming 4th Wave

I do not include AutoML-like capabilities in this writeup because I don’t see a clear and commonly accepted definition of AutoML. Either it means a full end-to-end pipeline for conducting the model training on provided data, which is just a concatenation of other tools, or it means holy-grail-like capabilities that propose to take problem specifications into full ML solutions, which is fiction.

As I noted in the descriptions above, some of these are more mature directions than others. It’s not clear to me there needs to be a single fourth wave of tooling, as many of these angles are likely to be intensely valuable in the evolving landscape of unstructured AI.

Closing

Reflecting on the role of developer tooling in the unstructured AI landscape, it becomes clear that the innovations and advances we often celebrate are not solely the products of abstract algorithmic breakthroughs, massive dataset contributions, or the exponential growth of computational power. Rather, they are deeply intertwined with the practical scaffolding provided by the diverse array of tools and services that support the day-to-day endeavors of AI research and development. From the early software frameworks to core capabilities like annotation and experiment tracking, this scaffolding is a critical facet of AI development. Exploring possible evolving scaffolding to support the rapidly evolving AI landscape and naming five axes of new developer capabilities increasing in usage, I eagerly await how this story will unfold in the coming years.

Acknowledgements

Thank you to my friends and colleagues who read early versions of this article and inspired powerful changes, especially Brian Moore, Dave Mekelburg, and Michelle Brinich.

Biography

Jason Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at Johns Hopkins University in 2005 and 2002, respectively, and a BS Degree with honors from Loyola University Maryland in 2000, all in Computer Science. He is the recipient of the University of Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

Disclaimer

This article is provided for informational purposes only. It is not to be taken as legal or other advice in any way. The views expressed are those of the author only and not his employer or any other institution. The author does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by the content, errors, or omissions, whether such errors or omissions result from accident, negligence, or any other cause.

Copyright 2024 by Jason J. Corso. All Rights Reserved.

No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at _JasonCorso_.

--

--