Stories about what AI can do — annihilate us, replace human workers, achieve singularity — have captured the imagination of humanity for decades now. In recent years, we’ve seen more realistic applications come to life: identifying tumors, self-driving cars, etc. There are some that believe AI will yield huge positive boons, others (e.g. Elon Musk) who believe it could be humanity’s greatest threat. But whatever side you take, we all agree on one thing: AI will fundamentally shift change human society forever.
But one thing that isn’t talked about at all is what it won't do. For all of AI’s great applications, there are some things it will not change. One of those is the idea that AI inherently leads to monopoly markets where a few companies are market leaders and eat everyone else. The theory argues that the technical barrier-to-entry for AI is so high that only the top companies can afford to pay for talent at scale, while the cycle of data collection → feed data to AI models → create data-driven products → collect more data leads to a compounding flywheel where the rich get richer. On the surface, the argument makes sense: models require data to achieve higher orders of accuracy, and since incumbents are in the best position to gather data, they can build more accurate models than newcomers. The accurate models, in turn, allow incumbents to build better products than everyone else, which empowers them to collect even more data, thus feeding this loop.
Renowned technologist, venture capitalist, and AI researcher Kai-Fu Lee sums up this viewpoint in his book AI Superpowers:
“…AI naturally trends toward winner-take-all economies within an industry. Deep learning’s relationship with data fosters a virtuous circle for strengthening the best products and companies: more data leads to better products, which in turn attract more users, who generate more data that further improves the product. That combination of data and cash also attracts the top AI talent to the top companies, widening the gap between industry leaders and laggards.”
The argument relies on three fundamental assumptions: 1) incumbents can collect proprietary data for an extended period of time 2) the relationship between more data and better models scales at a linear or superlinear rate and 3) the costs of AI engineers will remain high due to limited supply. However, there are three data points that push back against these assumptions:
- The rise of commoditized AI/data
- Diminishing returns of collecting more data
- Barrier-to-entry of becoming an AI engineer is decreasing
The rise of commoditized AI/data
Why are top companies like Google and Tencent so much further ahead in AI than everyone else? One reason is that the technical supply is limited: Tencent estimates there are 300,000 AI engineers worldwide but millions of unfilled positions. Before recent years, companies that lacked the capital to attract talent could not successfully clean their data, let alone model it. But there are a whole host of companies — Google being one of them — that fall under what I call “Commoditized AI and Data.” As data collection and AI grows in demand, platforms will be built that commoditize these highly technical tools so organizations of all sizes and technical capability can access them, much like how AWS commoditized cloud computing. Google is moving in this direction with their AI/ML products, but many startups are in this arena as well. For example, Clarifai offers a powerful computer vision engine that companies can access using their API, and synthetic data generation startups like MostlyAI and Tonic generate representative datasets for companies that need more data to train their algorithms. These companies don’t necessarily have to be AI companies either, as some markets will benefit from second-order effects of AI’s proliferation. Segment and Snowflake are great examples — both companies help clients manage their data in a systematized way without being AI-first companies, and are respectively valued at $1.5 billion and $3.9 billion.
One of the most significant advantages Kai-Fu claims that top companies have — the ability to continuously collect proprietary data — may not be such a skewed advantage in the near future. Synthetic data generation is the creation of artificial data for the purposes of testing and improving AI models. A rudimentary way to do this is to record how real-world data is distributed, then draw numbers at random from the distribution. Complex problems will obviously require more advanced methodologies, but as you can see in the map above, startups are already offering Data-Generation-as-a-Service. The technique is already being used by companies like Waymo and Tesla to simulate autonomous driving. As of July 2019, Waymo had 10 billion simulated miles and only 10 million physical miles driven, demonstrating the scalability and speed of simulating data.
To sum it up, collecting, managing, and utilizing data are becoming easier by the day, with synthetic data generation methods making relevant data easily accessible, unicorns like Segment and Snowflake simplifying data management by 10x, and Clarifai and Google simplifying AI-integration with your tech stack by 10x.
Diminishing returns on collecting more data
In their famous piece on the failings of data moats in enterprise software, Andreessen Horowitz investors Martin Casado and Peter Lauten pointed out that:
“Yet even with scale effects, our observation is that data is rarely a strong enough moat. Unlike traditional economies of scale, where the economics of fixed, upfront investment can get increasingly favorable with scale over time, the exact opposite dynamic often plays out with data scale effects: The cost of adding unique data to your corpus may actually go up, while the value of incremental data goes down!”
The monopolistic view of data as a moat proposes that adding more data superlinearly increases the value of your product by making your models more accurate. This is true in some consumer products where AI can drastically increase network effects (e.g. Tiktok), but in most other cases the cost of collecting and cleaning increasing amounts of data either remains constant or goes up, while the variance captured by new data decreases. Eventually, the benefit-curve of collecting more data plateaus and in some cases can even decrease.
Another way to think about it if you’re familiar with machine learning is to think of Principal Component Analysis (PCA). The most variance is concentrated in the first few principal axes so the marginal value of using say, four principal axes vs. five principal axes could be minuscule. In fact, in noisy datasets, it is likely that the first few principal axes capture most of the signal while later axes are dominated by noise. Similarly, the marginal benefit of adding more data reaches a point where additional data becomes increasingly redundant. In other words, data collection falls prey to the power law/Pareto distribution as much as any other phenomena: data is extremely important to producing accurate models up to a certain point, after which collecting 10x or even 100x more data marginally improves the model at the financial and opportunity costs of expanding to other features or markets. AI is simply a means to an end; the end-goal is optimizing the user experience and value-add, not the model itself.
Recall the first two assumptions of the AI-leads-to-monopoly-markets theory? First, incumbents can collect proprietary data for an extended period of time, and second, the relationship between more data and better models scales at a linear or superlinear rate. I argue that AI/Data-as-a-Commodity companies will decrease the importance of the first assumption by making lowering the barrier-to-entry of becoming data-intelligent, and secondly and more importantly, having more data doesn’t actually lead to better models past a certain point.
Barrier-to-entry of becoming an AI engineer is drastically decreasing
If AI/Data-as-a-Commodity services are making the technical components of managing and building AI models easier, what does that do the technical barrier-to-entry of being an AI engineer? Well, let’s use software engineering as an analogy.
If you wanted to learn how to build a mobile app in the early days of the iPhone, what would you do? Chances are you bought a few thick C++ programming books, tried to hire a tutor who had learned how to do it themselves six months earlier, and scoured through confusing documentation online.
Fast forward a decade, and so much has changed. Now, instead of reading musty guidebooks and hiring expensive tutors, there is a rich library of online courses (many of which are free). Instead of struggling to debug with dense documentation, StackOverflow has answers for nearly every mistake you could possibly make as a beginner. Not only that, but there are SaaS, PaaS, and IaaS solutions like AWS and Heroku that make it incredibly easy to visualize, test, host, and launch an app without any fuss.
I argue the same historical pattern will occur with AI. There are already many free courses online (course.fast.ai, on Udemy, etc.), and they will only increase and get better. In addition, look at any of the AI/Data-as-a-Commodity companies I listed in the diagram above, and you’ll see that already they’re equipping coders with powerful tools to become data-driven and incorporate AI.
“But you’re just widening the bottom-of-the-funnel”, you might argue. “The number of great AI engineers at the top won’t change that much.” I would completely disagree there (increasing accessibility puts more people in positions to succeed and therefore the relative number at the top will increase as well), but I’ll counter that point with Kai-Fu’s own words. Remember, this is what he said about why technical talent is part of AI’s monopolistic tendencies:
“…That combination of data and cash also attracts the top AI talent to the top companies, widening the gap between industry leaders and laggards.”
Fair. Now, let’s take a look at what he says about theory vs application later on in the book:
“Core to the mistaken belief that the United States holds a major edge in AI is the impression that we are living in an age of discovery, a time in which elite AI researchers are constantly breaking down old paradigms and finally cracking long-standing mysteries. This impression has been fed by a constant stream of breathless media reports announcing the latest feat performed by AI: diagnosing certain cancers better than doctors, beating human champions at the bluff-heavy game of Texas Hold’em, teaching itself how to master new skills with zero human interference. Given this flood of media attention to each new achievement, the casual observer — or even expert analyst — would be forgiven for believing that we are consistently breaking fundamentally new ground in artificial intelligence research. I believe this impression is misleading. Many of these new milestones are, rather, merely the application of the past decade’s breakthroughs — primarily deep learning but also complementary technologies like reinforcement learning and transfer learning — to new problems. What these researchers are doing requires great skill and deep knowledge: the ability to tweak complex mathematical algorithms, to manipulate massive amounts of data, to adapt neural networks to different problems. That often takes Ph.D.-level expertise in these fields. But these advances are incremental improvements and optimizations that leverage the dramatic leap forward of deep learning. This is the age of implementation, and the companies that cash in on this time period will need talented entrepreneurs, engineers, and product managers.”
He goes on:
“Training successful deep-learning algorithms requires computing power, technical talent, and lots of data. But of those three, it is the volume of data that will be the most important going forward. That’s because once technical talent reaches a certain threshold, it begins to show diminishing returns. Beyond that point, data makes all the difference. Algorithms tuned by an average engineer can outperform those built by the world’s leading experts if the average engineer has access to far more data.”
Elite AI talent will enable incumbents to maintain market dominance. But, they also don’t matter because we live in an age of implementation, where data is king and average engineers will make do just fine? The two statements are contradictory. The ability to attract and retain great talent is part of any sustainable moat, but as Kai-Fu himself illustrates, it is not inherently more important in the age of AI than previous eras. It can seem that way now because AI talent is scarce, but as we pointed out, the barrier-to-entry of learning and implementing AI is decreasing. In addition, market dynamics will ensure more and more people specialize in this field, just like how the broader population of CS majors doubled between 1997 and 2014. China, a country that promised to be the world leader in AI by 2030, is opening 400 schools in 2019 specifically dedicated to AI, big data, and robotics education. Regulatory incentives will also play a large part in accelerating the growth of AI talent, and combined with the decreasing technical barriers-to-entry, the number of qualified engineers will be less of an issue than many believe it to be.
That brings us to the end. If our three positions are true, that— 1) Data collection/management/AI is being simplified and commoditized; 2) In most cases, more data is not better; 3) It’s becoming increasingly easier to become an AI engineer and implement data-intelligent tools, thus increasing the supply of engineers and lowering cost — then we conclude that AI by itself does not result in winner-take-all markets. Monopolies are built by becoming best-in-class across many dimensions: talent, data, distribution, product, and capital allocation. The downfall of any incumbent is the same: first, they get rich. Then, they get comfortable. Then, they get dead. AI is a means to an end; don’t let it blind you from expanding to new markets and looking for the rise of new ones.