The Agents Are Coming. Winter Is Not.
Why brute force isn’t everything and why there might be a golden era ahead for AI startups
As we’re about to enter 2025, there’s as much excitement — and uncertainty — in the AI world as ever. On one hand, there are big questions about whether the scaling “laws” that have driven so much progress will hold up. The key question for the entire AI ecosystem is whether larger models will continue to get meaningfully better with orders of magnitude more compute for training and inference. On the other hand, it feels like progress in AI has never been faster, with foundational model providers as well as startups launching a continuous stream of new capabilities and products that often feel almost magical.
With so much up in the air, I wanted to share a few thoughts as we get ready for another wild year in AI. Don’t expect lots of bold predictions (my crystal ball is as blurry as ever), but here’s where my head’s at as 2025 begins.
1) Pre-training might approach diminishing returns, but it’s too early to declare the “end of scaling”
There’s a growing sentiment in the industry that we’re approaching “the end of the scaling laws”. The view is driven by the fact that GPT-5 hasn’t been released yet and that most of the (extremely impressive) recent improvements to OpenAI’s products stem from other innovations. As you may have seen, Ilya Sutskever, co-founder of OpenAI and SSI, recently added fuel to the fire by declaring the end of the pre-training era.
“The end of pre-training”,“the end of the scaling laws” and “the end of scaling” can mean different things, so it’s worth clarifying what exactly we’re talking about. The scaling laws for LLMs, described in a landmark 2020 paper by Jared Kaplan and several OpenAI researchers, state that model performance improves with larger models, more training data, and more compute. The optimal balance between model size and dataset size for a given compute budget was further detailed in the famous Chinchilla paper in 2022. Both papers state that each incremental increase of any of the three variables yields a smaller improvement than the previous one. So if we observe diminishing returns, it doesn’t make sense to talk about the end of the scaling laws: quite the opposite, this is exactly what the scaling laws predict.
Maybe this is just semantics, and perhaps what people mean when they talk about the “end of the scaling laws” is that we’ve reached a point at which scaling models further doesn’t yield practically meaningful returns. Industry experts have different opinions on the topic, and I guess no one really knows, but here are a few things to keep in mind.
First, performance improvements have never been solely about scaling the pre-trained model. Adding more parameters, data, and compute has been a key driver of the huge improvements from GPT-2 to GPT-3 and from GPT-3 to GPT-4, but it wasn’t just brute force. Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) were critical in making the models useful and played a key role in making ChatGPT so good. (1) The same is true for the new o1 and o3 models, where the key innovation was to force the model to “think” before it answers, breaking down a bigger problem into smaller, more manageable steps. (2)
Second, while general-purpose foundational models have already been trained on most of the text on the Internet (a key argument of the people from the “end of pre-training” camp), specialized domains like biology or chemistry remain underexploited. So there’s still huge potential for progress by training on more domain-specific data. It’s an open question to what extent this will lead to performance improvements outside of the specific domain (but there’s evidence that it works for code, i.e. if you train an LLM on more computer code, it will get better at general reasoning). Similarly, there are high hopes that data in different modalities, especially video, and synthetic data will solve the data saturation issue, but experts disagree on the extent to which this is going to work. (Synthetic data definitely works for coding; in other domains, it’s less proven)
Finally, even if we’re approaching a point where scaling pre-training becomes prohibitively expensive, we’re only starting to find out how much better models can get with greater inference-time compute. O1 has shown that you get much better answers if you give the model more time to “work” on a problem. With more compute, models can think through more steps and increase the likelihood of reaching the right answer further. (3)
Net net — my best guess is that LLMs will continue to get better with more parameters/data/compute, but the improvement curve won’t be as steep as it was in the past, and more and more attention will go to everything that happens post pre-training.
2) There won’t be another AI winter …
When a new technology gets hyped up, inflated expectations are often followed by a deep trough of disillusionment, which is why many people worry that the current phase of excitement will lead to an(other) AI winter.
I don’t think so.
There will, of course, be lots of failures. Pilots that don’t convert. Startups that go bust (including many that have raised 10s of millions before strong PMF). Disillusionment in areas where products fail to meet expectations or perform as advertised. And probably some spectacular failures of companies that have spent hundreds of millions on training models and didn’t manage to turn the investment into differentiated products with a sustainable competitive advantage.
But there won’t be a big across-the-board AI winter where people question the value of the entire field. AI already delivers way too much value today, be it in coding, medical transcriptions, translations, customer support, or as a productivity booster for tens of millions of people. I also think that the amount of capital and talent flowing into AI in recent years will ensure continued progress at a high pace, even if pre-training isn’t the main driver anymore.
So if there is another AI winter, it will be a very mild, Californian winter, as Richard Socher recently said on a podcast. No Berlin winter.
3) … but some of the high-flyers won’t make it
In the last few years, many AI startups grew very quickly from 0 to a few million dollars in ARR (and some to much more), at a pace that was extremely rare in the past. Several factors contributed to this phenomenon:
- It has become much easier to build AI-powered products, with surprising, impressive new capabilities that wow users and buyers. In some domains, AI has surpassed a quality threshold, unlocking massive demand and allowing many players to gain momentum, even with similar products (e.g., writing assistants).”
- In the wake of the ChatGPT launch, AI has become a door opener. Every company wants to try AI tools and solutions. Getting companies to do a pilot has become much easier. An example is legal tech. It used to be a laggard industry in terms of tech adoption; there’s been talk about AI for many years, but not much has happened. ChatGPT has catapulted the topic to the top of the attention of every major law firm. According to Clio’s latest Legal Trends Report, AI adoption in law firms has skyrocketed from 19% to 79% in just one year.
In many cases, startups deliver tangible value to customers in ways previously unimaginable. However, I’m afraid that many fast-growing startups will plateau when churn kicks in and pilots don’t convert. This risk is especially pronounced for easily replaceable point solutions (simple to adopt but just as easy to switch away from), add-on tools (temporarily successful but unsustainable if incumbents integrate similar AI capabilities quickly), or human-in-the-loop products, where revenue traction may not reliably indicate PMF. (4)
The AI wave is lifting many boats, but not all of them will stay afloat. This is, of course, typical for big technology waves, so it’s not a new phenomenon.
4) Startups will solve the “last mile problem” of AI
When ChatGPT arrived, many people in the tech ecosystem (myself included) asked themselves: If AI keeps getting better at this pace, what’s left for startups to build? Won’t OpenAI, Anthropic, or Google’s latest LLMs eventually do everything? Do you still need specialized business applications if in a few years you have an extremely intelligent AI system that has access to all of a company’s data?
These are valid concerns, but based on what I’ve seen in the last two years, I think it has become increasingly likely that in spite of (or, maybe paradoxically, because of) the fast increasing capabilities of foundational models, there will be more, not less, opportunities for AI startups. The idea is that the more capable the models get and the more people try them, the more startups are needed to solve the “last mile” problems that models alone can’t address.
There are a few reasons why better models could expand the opportunity set for startups.
A) Rapidly increasing expectations
When models could barely generate coherent text, a good enough summary or response was impressive. Now that GPT-4, Gemini 2, and other models can write essays, debug code, and much more, our expectations have shifted. Businesses want AI solutions that are reliable (no hallucinations), accurate (fact-based and grounded in company data), and trustworthy (secure and explainable).
B) Integration is hard
Enterprises must integrate models into complex systems, ingesting data from a variety of sources and in different formats, integrated with custom workflows, and ensuring outputs meet domain-specific requirements. RAG (retrieval-augmented generation) sounds simple in theory, but in practice, you’ll have to overcome various challenges. How do you chunk, store, and rank enterprise documents effectively? How do you manage latency when retrieving and feeding data? How do you prevent irrelevant or misleading context?
C) Agentic systems further increase the surface area
There’s little doubt that the future belongs to AI tools that can autonomously complete multi-step tasks. But if you give so much power to an AI, making sure that the system runs safely and reliably becomes exponentially more difficult and important.
If foundational models expand the opportunity surface faster than they can provide the complete solution covering the last mile, we’re entering a golden age for AI startups that take raw capabilities and turn them into robust, enterprise-ready products, so let’s hope that the theory proves right. 🙂
5) “Virtual employees” might turn out to be a gimmick
12–18 months ago, a fascinating new type of AI startups emerged: companies that offer digital workers with human-like attributes (and sometimes faces and names) to automate end-to-end jobs, e.g. in sales and customer service. If you’ve been to SF recently, you’ve probably seen Artisan’s billboards all over the city.
It’s a super innovative and fresh idea. Since their digital workers usually use the same tools as existing, human employees, these startups piggyback on existing platforms and minimize integration effort. It’s also an opportunity to attack incumbents with differentiated packaging and a potentially disruptive pricing model. For customers, it’s a very compelling value proposition: Keep your existing software and workflows, just add some AI employees to take over some of the work at a lower price.
So there’s a lot to like. However, I wonder if having “AI employees” with human-like attributes really makes sense in the long run, or if this is a temporary hack that allows startups to quickly gain traction in the current phase of AI adoption. I’m leaning towards the latter. A lot of jobs might have to be reconfigured if AI turns out to be good at some parts and less at other parts of a human’s job. For example, if AI can handle 80% of an SDR’s tasks but only 25% of an AE’s, you can’t just replace all of your SDRs with AI SDRs. There’s still a lot to be figured out as we adapt to working with intelligent software and agents, but my hunch is that AI employees with faces and names won’t be part of the endgame.
6) With agentic AI, we‘ll all have to rethink human-computer interaction.
With agentic AI — models that can browse the web, execute code, use external tools, or handle transactions — we have to rethink human-computer interaction from the ground up. We’re not used to giving software so much power, and one of the key challenges will be defining the boundaries of what these systems can and cannot do independently.
Imagine having an AI agent for travel bookings. Even a seemingly simple task like booking a flight can’t be easily delegated to an AI agent, as it requires trade-off decisions, such as choosing between a faster connection or a lower price. Even if your AI agent knows your general preferences, there’s a high probability that it won’t get it right every single time in every specific situation. Now think about giving an AI agent the keys to autonomously deal with complex, multi-step workflows (and to interact with other AI agents!) in a business, where the stakes are much higher.
Lots of challenges must be addressed as companies allow agentic systems to handle more and more complex tasks with less and less human supervision. A useful analogy might be training and managing a coworker who gradually gains higher levels of permission as they demonstrate competence. However, as noted earlier, such analogies might be akin to the skeuomorphic UI of the early iPhone (temporarily helpful but quickly outgrown).
I’ve spent most of the past 25 years, first as a founder and then as an investor, focused on building and backing web applications and software designed to improve human-computer interaction. By tomorrow’s standards, much of that software was pretty dumb. (5) The emergence of intelligent agents requires entirely new UI paradigms and I’m super excited to see how the smartest founders will define the future of human-computer interaction!
(1) I was going to write “smart” but that would invoke the inevitable response that these models aren’t smart but just stochastic parrots that are excellent at appearing to be smart. Don’t feed the trolls. ;-)
(2) ICYMI, OpenAI has taught the O1 model to use “chain of thought” before answering, which used to be a highly effective prompting technique before O1 came out.
(3) So Nvidia shareholders have a good hedge. If companies use less compute to pre-train models going forward, there’s a good chance that they will use more at inference-time … which could be much more, because in this case the need for chips grows with the number of users.
(4) If you want to build, say, an AI accounting product and you start by selling an accounting service with some automation and a lot of humans in the loop, you’re not proving much because there’s a clear, existing market for accounting services. The real test is if you can over time remove the humans in the loop. Doesn’t mean that starting with humans in the loop can’t be a great strategy, it just means that the typical steps of how startups prove PMF are reversed.
(5) Fun fact: my first internet startup, a comparison shopping engine I founded in 1997, used agents to retrieve pricing and shipping cost information from online shops (but those agents weren’t intelligent).