Who Owns the Future? Looking at the next AI disruption post-ChatGPT

The Future of AI in the World of Foundational Models

Jordan Volz
16 min readJan 20, 2023

[Author’s Note: This article is not written by Generative AI]

Advancements in Generative AI (GenAI) thrust AI into the mainstream in 2022 in a way we haven’t seen since IBM Watson’s takeover of Jeopardy in 2011. From ChatGPT and Copilot to DALL-E, Midjourney, and Stable Diffusion, there was something of interest for everyone and many were impressed with the ease of use and sophistication of the models as they first rolled out. Whereas Watson ultimately failed to capitalize on its initial excitement, the current stack of technologies is sufficiently varied and robust that it appears as if we’re on the verge of something big. (GenAI has overtaken Web 3.0 as your local VC’s favorite hype topic, so it must be doing something right. At the very least, it will be well-funded for the next few years.)

There’s also endless mindshare being spent on the possibilities of all this tech. Most of this is laser-focused on creative applications. I.E. How does this tech disrupt the work of a digital artist? Or a screenwriter? Or a film editor? Etc. And, yes, while all of these jobs are definitely going to have some abrupt and likely jarring adjustments in the near future (for better or worse — this is definitely up for debate), I see very little discussion on how this is actually going to disrupt the data community. Because, in many ways, I think this is going to be several multitudes larger than what we may see with the creative crowd, not to mention it’s more difficult to find an objection to it being applied broadly to the data practice (while people may object to “art” being created by AI — if you could even consider it art at all — fewer people will probably shed tears over AI doing all the data work in a company). This is what I wanted to focus on here. I’ll look at this through the lens of the DS/ML crowd, as that is what I know best, but I think the conclusions here are broadly applicable across all data roles. Let’s dive in.

Tweet by ChatGPT

We can start by asking “What is the state of the ML practice today?” This is something I’ve spent a lot of time covering in previous articles, so I won’t dwell on it too much here, but let’s summarize some highlights:

  1. If you remove high-tech companies that are building things like ChatGPT, Copilot, et al. most ML teams are resolving already solved problems for their business. To put it a little more nicely, they work on “building and implementing ML solutions to extract value from their company's data.” But also, you start to lose feeling on the left side of your face when you have to write your 25th time-series forecast model. At least a new framework comes out every year or so which gives us an excuse to go back into “development mode” for several months as we evaluate options.
  2. The “hard part” for these teams is not the actual “ML work” of building models. It’s the surrounding parts of it (I’m sure we’ve all seen the page 4 graphic in this report before), such as: dealing with infrastructure, of which ML/DS teams are not experts in; accounting for all the metadata surrounding the ML workflow, like model experiments, promoted & deployed versions, model artifacts/binaries, etc.; tracking metrics for production models over time; understanding proper chain of command and release strategy for new & updated models (i.e. model governance); and, last but not least … data. Data is always the largest problem. It’s often not clean, not where you need it to be, or it doesn’t exist at all. But it’s important that your models be really good nonetheless, so says the business.
  3. Tooling has progressed over the last decade. MLOps is now a thing and comes packed with lots of different solutions for various parts of the ML workflow. However, this is by and large still too complicated for all but the most advanced companies. Everyone else struggles to really get production use cases deployed. In my career working with various companies, it’s pretty common that a data scientist will send over a jupyter notebook when they are ready to get something “into production”. Software engineering best practices are a world away. As a result, ML teams require teams of system experts to get reliable results for the business. Or, they simply don’t.

In the last couple of years, the data-centric AI movement has picked up a lot of popularity and it’s not hard to see why. The goal is to shift the field left, focusing the ML workflow on data instead of code and models. By doing so you can (hopefully) simplify the technology stack — or at least the parts of it that are exposed to users — and more easily build, iterate on, and maintain real production ML systems. Andrew Ng and Christopher Ré are two high-profile individuals pushing for this approach, but there are many companies emerging with this strategy.

It’s still a bit too early to see if the data-centric/Operational AI space turns out to be a big hit or not. My gut feeling on it is that it is not a large enough simplification of the ML process for most out there. Especially in light of what can be accomplished with the ChatGPT, et al. GenAI stack, data-centric AI still requires companies to do a lot of AI. If I’ve learned anything in the last half-decade it’s that companies want AI, but they don’t want to do AI.

Tweet by ChatGPT

I think that last statement is important. In the tech world, everyone will always agree that “ease of use” is always super duper important, but many companies spend countless hours on products that are, frankly, very difficult to use. As long as you’re arguably a little easier than the closest competition, sales & marketing are happy and prospects will have no better solution, so there’s not much incentive to drastically reduce the complexity. A surprisingly large amount of software is created by engineers who are not the target user profile, so it’s often a large conceptual leap to really understand what “easy” is anyway.

AI/ML/DS has suffered from this pretty badly for most of its existence. I love to bash notebooks, but the reality is that notebooks are by far the quickest entry point into this world and nothing else ever really touches it from a simplicity standpoint. There are lots of reasons to hate notebooks, but you can start doing work quickly, get a lot accomplished, and make some cool eye candy for your boss to look at, so it’s not surprising that they’ve been a difficult habit to kick for many. So, we’ve ended up in a situation where we have a plethora of ML tools, built by experts for experts, but nothing is really moving the needles on the ease-of-use scale. These tools, by and large, focus on making the process of doing AI simpler or faster, but they generally tend to ignore the experience of using AI.

That is, until tools like ChatGPT, et al. started appearing. The UX is impressively simple: tell me what you want and I’ll just do it for you. Don’t worry about getting data, building models, writing code, tracking models, and so on and so forth. Just tell me what you want. Just use AI. Don’t do AI. It’s unclear to me why OpenAI hasn’t changed its slogan to “We do AI so you don’t have to,” but I think that’s the greatest unlock of value for the company thus far. This is, simply put, unlocking the value of AI by making the UX extremely accessible to anyone who is even faintly interested in it.

But, these tools don’t solve all your ML use cases (yet). So what use cases are they good for? The two (or three) main use cases that are publicly available are text-to-text (i.e. ChatGPT) and text-to-image (i.e. DALL-E, Midjourney, & Stable Diffusion). Copilot is technically text-to-text, although you could easily convince me without trying too hard that software code is nuanced enough to deserve its own category apart from just “text,” and I think it’s also different in one fundamental way which I’ll expand on shortly.

What do all of these have in common? First, they train on a lot of data and leverage complex modeling techniques. Getting into the details is outside the scope of this article, but feel free to dig around online, or better yet, just ask ChatGPT yourself.

Chat with ChatGPT. Content naturally courtesy of ChatGPT

The fact that a large corpus of data is used is pretty fascinating. Big Data had its moment in the spotlight, but the data science crowd was often pretty standoffish about really building models on large amounts of data. You’d often hear comments about “statistically speaking” medium amounts of data should suffice for a model, and sampling was a much better approach than trying to train on full datasets, etc. But, here are some concrete examples of where training large models yields some really fascinating results.

Secondly, two of these problems (ChatGPT and DALL-E), I would classify as “human-easy” problems. By that, I mean that these are actually easy problems for humans to do (but, interestingly, these are also “machine-hard”). I can go out on the street and pick a random human. Chances are I will be able to have a conversation with them. They might not have all the answers I am looking for (and ChatGPT may not either), but I’ll be convinced that it is a real conversation. Similarly, I could ask them to draw me something. Quality on that drawing will vary wildly based on the individual (as do your DALL-E results), but it’ll likely be representative of what I ask for. Sure, these problems are many times more efficient/better than a random human, but in a way, it’s less impressive to tackle “human-easy” problems than “human-hard” problems.

On the other hand, Copilot is surely tackling a problem that is human- and machine-hard. Software engineering is one of the most complex disciplines in … the world, and copilot can convincingly create passable code. It’s not perfect (nor is any developer), but it offers a significant performance boost for engineers. I’ve recently seen some marketing material about “citizen software developers” so it’s possible we’re at the beginning of the death of software development as we’ve known it for the last half-century.

Real Tweet Alert! Ok, here’s ChatGPT’s best attempt if you’re curious:
Tweet by ChatGPT

So, what we’ve conclusively learned in 2022 is that:

A) We can build models on large amounts of data that perform well.

B) We can solve human-hard problems with them.

This is relevant. The majority of use cases that your non-high-tech ML teams are working on are human-hard problems on not large amounts of data. We talked about all the challenges they have with this process, and I’m here to propose the following:

Tweet by ChatGPT! Quoting your own tweets is lame, but it’s ok if it’s a fake tweet.

It makes sense to try to apply the techniques we have learned in the construction of our large language models (LLMs) to the realm of ML for tabular machine learning. We know that:

A) Most of these problems are human-hard. I.E. if I give a random human my customer information and ask which customers are going to churn, they’re going to run away. Even if I ask a data scientist to do it, there’s a non-trivial chance that I get back several garbage models before something useful surfaces.

B) From an ML perspective, these are actually easy problems to solve which are usually solved with simple regression, classification, or time series forecasting techniques. The hard parts are the data parts. It’s understanding the data we have, what’s wrong with it, how it needs to be cleaned up, and what transformations to apply to it. I don’t think I’ve ever looked behind the scenes at a company’s “production” data system and not found at least a couple of issues with their “clean” data. This isn’t a knock — I’m just highlighting that data is hard and humans are generally ill-prepared to handle complications in an efficient manner.

C) A + B creates a serious drag on getting value back to the business. Business wants to use AI, but they don’t want to do AI. They’ll fund it if they have to, but if they can just get results without having to hire an entire ML team, how is that not more desirable?

So, it appears that this large segment of the market is ripe for AI disruption. By creating and providing a vastly simplified UX for their AI inquiries, you can start to deliver on the promise of making their business AI-powered or AI-centered, or whatever other term the analysts are hyping these days. Imagine signing up for a service where you simply give it access to your data and start asking questions like “What’s our sales forecast for the next 12 months?” or “Which customers will churn in the next 2 quarters?” or “What products are most likely to sell the best amongst people aged 22–40 between Thanksgiving and Christmas?” This is the future that we are headed for.

[Aside: A bit outside the scope of what I’m going for here, but hopefully relevant to highlight that for years companies have heard from business analysts (like HBR, McKenzie, etc.) that to truly excel in AI you need to invest in tooling, people, education, etc. This is true when you are primarily focused on doing AI, but I believe the future we are now looking at is refocusing on using AI, and … well, I think most of this aforementioned advice will prove to be pretty incorrect thinking in the end.]

How do we accomplish this? I won’t claim it’s simple, but as we’ve built AI that can understand human language, we can similarly build AI that understands data. It’s not a stretch to think that I could give a complex model access to my data warehouse and a problem to solve, like customer churn, and it would be able to figure out what data is relevant, what issues there are with it, clean it up, and transform it into the right shape before building a predictive model on it. You wouldn’t even need to tell it where to look for the relevant data. A lot of these tasks are human-hard, but it’s hopefully readily apparent that AI would make short work of a lot of these tasks. Instead of building a large language model, we instead can build large data models (LDMs?). These models will be able to connect to our data sources and quickly understand our data in seconds. Let that sink in.

Tweet by ChatGPT

There’s one catch. Only a few companies could possibly pull off building good LDMs. Whereas LLMs are primarily (or entirely) built on open data, LDMs could never be. To build a good LDM we would need to train it on thousands upon thousands of real companies' data. It’s often difficult to find even a single real example for exploration in a given use case, so it’s an impossible task to build these from open data. LLM construction is usually limited only to companies who are willing to pay for the training costs of the model, but LDMs would be ruled by companies with access to the largest data troves. I wouldn’t be surprised if we start to see such models develop out of companies like Salesforce, Microsoft, Google, Amazon, Snowflake, Hubspot, Oracle, etc., as they should readily have this data available. Here are a few things that also wouldn’t surprise me:

A) Data integrators like Fivetran start getting acquired super quickly in the next couple of years. Why? Because companies developing LDMs want to get access to as much data that doesn’t exist in their system as possible to train better LDMs. The easiest way to do that is to make it easy (and free?) for customers to dump more data into your platform. More on the consequences of this later.

B) Snowflake’s pure SaaS model gives them an advantage in LDMs construction over Databricks, which is unable to leverage customer data in the same way. Snowflake could use this to turn the tide in Databrick’s ML advantage in the never-ending Databricks-Snowflake war. Databricks have built a lot of tools that help with doing AI, but Snowflake could disrupt here by introducing tooling for using AI. Why not just have a chat window in the Snowflake UI where you can just ask an LDM any ML question you have?

C) But really: Cloud Vendors end up dominating this space, because — why not? They already have all our data.

D) But really, Microsoft actually dominates this space. Why? Because they already have large investments in OpenAI (ChatGPT, DALL-E) and own Github (Copilot), so they are arguably the leading hyperscaler for GenAI right now. This is theirs to lose.

E) I haven’t read the T&C’s of using platforms at any of these companies, so it’s possible some may be barred from using data in this fashion. But I’d really be kicking myself if that were true and I was the CEO of one of these companies. This may be a lucrative topic for lawyers to concentrate on in the next decade.

I would be utterly, completely shocked if:

A) No one at Salesforce is already working on this. If so, they need to re-evaluate the effectiveness of their product team.

Tweet by ChatGPT

In 2013 Jaron Lanier published Who Owns the Future?, a book that explores the evolution of online economies and how the Internet has been set up to concentrate power and wealth in the hands of a small number of companies who are able to extract and control the most data. It’s often difficult to compete with these companies because they are able to amass such an influence on the web that no one can ever catch up in terms of data collection (This is just patently true for anyone working in technology who has had the misfortune of having to complete with one of the cloud vendors). The most important consequence of this on the global economy is that these companies have the effect of shrinking the middle class. With wealth concentration comes larger wealth inequality and when enough of these “Siren Server” effects are in place, we can quickly reach a breaking point. (Lanier proposes a radical (read: completely reasonable) system of compensating users for creating data via micropayments in what now sounds very web 3.0-esque in a modern-day re-read. The losers would obviously be the billionaires.)

Tweet by ChatGPT. Note that Lanier has no Social Media presence. What a lucky guy.

We’re a hair’s width away from this getting very political, and I don’t mean to conjure up a great Twitter debate with this piece, but I think it’s important not to downplay the potential impact this can have on the ML/data community. This begs the question: Who, exactly, will be impacted by LDMs? I’d propose we break it down as follows:

A) ML/DS teams: If you’re not working on cutting-edge stuff at high-tech companies, you’re probably in trouble. The bespoke solutions that are currently rigged up to support your AI initiatives are going to look like cave men paintings compared to the UX powered by LDMs. A few may survive to keep the lights running, etc. but I think most everyone will see a large downsize in the amount of ML/DS personnel on staff. In the best-case scenario, I think we witness something like 80–90% of data teams being evaporated by LDMs.

B) ML/DS consultants: are also in trouble. Your lifeblood is being smarter or more productive than a company’s own employees, and there’s absolutely no way you’ll be smarter or more productive than an LDM. You had a good run though! The silver lining may be in helping smooth out the complications in using LDMs. I.E. data integrations, last-mile training, etc. (Hey, I’m trying to help!)

C) ML Vendors: Completely in trouble. While not all tooling is irrelevant — the LDM will likely use some itself! — the vast majority of ML/MLOps tooling is human-centric and largely useless in the LDM model. I struggle to see how most ML vendors survive this. Especially if it does come to pass that LDMs are dominated by Cloud vendors, who already have well-developed ML ecosystems with any tooling an LDM way need … it seems like most ML startups just die a swift death as contracts go belly up. We’ve already started to see a consolidation in ML startups as the economy turns and acquisitions start. It’s probably not a bad idea to try to seek an exit before all doors are shut. (I always wondered by the hyperscalers were never more interested in building out better ML workflow capabilities, but it’s possible they’ve known this was coming for quite some time. I’m playing checkers and they’re playing 3-D chess.) The paths to survival for ML vendors is pivoting to MLOps tooling catering to the high-tech DS/MLEs working on cutting-edge problems, or completely pivoting into the GenAI ecosystem.

d) Vertical AI tools: This one is interesting. Vertical tools are better than horizontal tools only when 1) they have domain expertise, 2) they make the ML process easier by doing it all for you, or 3) they use proprietary data. Compared to an LDM approach, the only real advantage here is proprietary data. Domain expertise is probably “open enough” — i.e. there are lots of research papers/case notes/etc. we could use to train a model, but it’s hard to get access to domain-specific data (I’m thinking of things like patient records, financial documents, etc). So, some vertical AI tools may survive, but they have to adapt and start building their own LDMs using their proprietary data to do so.

Whether or not this spells doom and gloom is up for debate. It’s difficult to make predictions about the future, whether you are AI or human. Until recently, I would have been fairly conservative about predicting that AI was actually going to usher in a broad wave of disruptions across the job market, but now it seems pretty inevitable. But it’s maybe not necessarily bad. Although I do think Lanier is likely correct in that the short-term is pretty brutal for anyone in the data community, the long-term prospects might be more uplifting. In a recent conversation with a colleague, I posited that “…humans having a day job is a pretty recent development in our history in the grand scheme of things. Prior to that we were mainly just trying to figure out where our next meal was. The concept of having a job to someone in that era is completely foreign and I doubt many, if any, could actually fathom it. I’m sure whatever comes next is likewise unfathomable to most of us right now”.

Whatever happens, one thing is for sure, millionaires aren’t safe either:

Tweet by ChatGPT. But, for real, this is a great idea. VCs: hit me up if you want to talk about it.

--

--

Jordan Volz

Jordan primarily writes about AI, ML, and technology. Sometimes with a humorous slant. Opinions here are his own.