Utility and bust: Scale in future AI provision
Summary
Large Language Model development and hosting technology has developed and diffused making it widely accessible. Our investigation and experiments show that it is now possible for virtually any company to host and customise near state-of-the-art models, however, costs are such that it is not realistic to offer full scale open-source models at prices that are competitive with the hyperscale providers.
A demonstration of what can be done with commodity hardware is outlined, and the implications of the change in AI technology that enabled this demonstration are discussed. Three futures for AI technology are described: a future where scale dominates, a future where scale is important, and a future where scale is irrelevant. Current evidence points to the ‘scale matters’ future and not the ‘scale dominates’ future, but there are also indications that scale may be even less important.
Introduction
Investors and policy makers have assumed and agreed that scale of compute access will dominate the AI market [UK Government 2025]. The idea is that those with hyperscale compute resources will monopolise the provision of useful AI models and will suck economic value from incumbents. There’s a good story about FinanceGPT, a model developed by Bloomberg [Wu et-al 2023], that’s often cited to illustrate this. The idea is that Bloomberg invested heavily to train a proprietary model only to find that it was immediately outstripped by a frontier model (GPT-4) owned by a hyperscale provider (OpenAI/Microsoft). Of course, this elides the idea that Bloomberg may have had applications beyond running open benchmarks for their model, and it also ignores the possibility that models like GPT-4 are optimised for benchmarks.
It can’t be denied. The development of the base LLMs that form the core of the current generation of state-of-the-art AI is a scale game. Facebook used 24000 GPUs to train Llama-3 and xAI used 100,000 to train Grok-3. These massive clusters run for days or weeks to train the models, meaning that the capital depreciation that the cluster experiences runs into tens of millions of dollars during each run. This cost excludes the need to manage the cluster, pump it full of electricity, and the real estate cost of the massive sheds that house the thing. If there’s a bug and the training weights across the new model are corrupted there must be some fearful rows and recriminations.
This is a powerful enough story to have convinced many people that it’s just not worthwhile for anyone outside the hyperscalers to consider building or owning a model of their own. Cracks in the story have been showing though. Some organisations have released their models as open-source, claiming that doing this would stimulate and enable further research
that would eventually be in their interests. The quality of open-source models like Meta’s Llama series has been creeping up on their proprietary rivals for quite some time and hosting these locally has become rather easy [1]. The release of Deepseek-R1 has set the cat amongst the pigeons though! Suddenly, it is possible to self-host and even fine-tune an LLM that competes with the very best models available.
There are some nuances to this message. Investigations have shown that whilst it is possible to run the Deepseek-R1 model on relatively cheap and accessible hardware, delivering a performant and efficient service still requires a non-trivial investment. In Table 1 below, we compare reported and actual performance and ball-park costs of models on various hardware setups (note — this is an emerging picture as Nvidia and others optimise the cost of serving these models). There are a couple of things to note:
· For the first time it is possible to run and finetune a frontier model on relatively affordable hardware.
· With very optimistic system management and utilisation assumptions (near free and 100%) the costs of providing a scaled service using this approach is close to competitive with current API provided LLM costs. For example, the optimised implementations of the full Deepseek model are close to competitive with Google’s Gemini 2.0 charges.
Whilst our assumptions about self-hosting the model are unrealistic, there are a few factors that balance the analysis and provide some confidence that the numbers we’ve come up with are reasonable. It’s definitely the case that there is further scope for optimisation in terms of the GPU interconnect management and also in terms of getting better pricing for the hardware provision via negotiation.
Deepseek has grabbed the headlines, but there are other innovations in the world of ML that are eroding the notion that high performance models will only be available from hyperscale providers. Researchers at Stanford University [Muennighoff et-al 2025] have used highly efficient methods to train a specialised version of an open-source AI model (S1) to provide state of the art results. This work required just 26 minutes of time on a relatively small GPU cluster (16 node H100).
Even more radically a team at MIT has proposed a novel approach to machine learning where training loss is calculated vs a generative model of the desired output rather than against performance on the training set. This optimisation that appears to radically reduce the scale of models [Alet et-al 2024]. If this approach is successful this would mean that full fat LLMs would be much smaller than is currently the case, without loss of performance. The research on this technique is very new and open, but this sort of scale reduction would mean that model training GPU’s would not need to be connected with expensive interconnects in massive networks of chips to produce models that match the current SOA. Given that model scaling seems to have hit a brick wall, this could have radical implications for the future demand on compute resources and the business models of hyperscale providers.
In the next section of this document, the changes to the LLM ecosystem since 2022 are described and related to implications in terms of model development and hosting. An example of the kind of model manipulation that is possible with the current open technology is worked through so that the reader can understand how these new capabilities can be easily practically applied. Finally, an analysis that reviews the impact of the technology evolution on future business models for AI companies is provided covering both the technical implications but also reviewing the business tactics that could be adopted to create competitive advantage.
LLM ecosystem changes
In March 2022 manipulating an LLM required arcane knowledge of pytorch and GPU architecture. Developing and fielding a model that could take on GPT3.5 took Google Brain and Deepmind (merged) more than twelve months (Gemini). Since then, there have been several changes that make it possible for a much wider community to meaningfully alter and adapt LLMs to their needs.
· The development of model distillation and quantization as effective mechanisms for compressing models so that they can be run on cheaper hardware platforms.
· The development of model exchange sites such as Huggingface.
Figure 1.Training and compressing a frontier model
Figure 1 illustrates the process of preparing a frontier model for distribution. Large scale training processes running in a data centre are used to create models that tacitly store knowledge in the form of next word prediction. These base models are then trained to solve problems and structure their output to meet generic application requirements. Rules such as “do not disrespect the Chinese Communist Party” or “do not deny the Shoah” are also encoded into the model at this stage.
Once the model has been created, it can either be used directly or it can be further prepared for distribution. This preparation phase may be done by the model-provider, or it can be performed by third parties. Distillation is a process of using the large model to train a smaller model which will be typically have between ~10% and ~0.1% of parameters of the source model.
After the model is distilled (or even without distillation) quantization can be applied to reduce the storage precision of each parameter in the model. Typically, models are trained on 32bit numbers. However, Deepseek was trained using 8bit precision so the compression gains are lower. On the other hand, it appears that using 8bit precision during training is much more efficient, especially on training architectures that have, lower memory interconnect performance.
Distillation and Quantization are not free, but they are nothing like as expensive as large-scale model (pre)training. They also do degrade, a bit, the models that they are applied to. Compression of approximately 90% (for example 670 billion parameters to 70 billion) will impair benchmark performance by a few percent, which seems slight — but benchmark performance may not be indicative of the real-world impact of this reduction. Compression beyond 90% tends to have a much stronger effect — with even more striking real world differences. Nonetheless very compressed models (for example from 670 bn parameters to 8bn parameters) appear to perform at the same level as state of the art models from just a few years ago. For example, in our testing the 8bn parameter versions of Deepseek R1 appeared to be at least as performant as GPT 3.5 (175bn parameters).
In addition to the development of model distribution technology, two other innovations enable the use of these compressed models.
· The development of llama.cpp and the subsequent innovations that allowed it to function as an efficient inference engine on many different chip architectures.
· The development of LORA and DORA as effective finetuning algorithms that require the update of very small fractions of the weights in a model (typically 0.05%)
Figure 2. Model customisation using low rank adaption on commodity compute like a Macbook pro.
These innovations meant that within hours of the release of Deepseek-R1 in January 2025 it was possible to download a distilled, quantized version of this new model, run it and test it and train it on an m-series macbook.
A simple finetune
To illustrate the new capabilities that the technologies in the LLM ecosystem provide, we can describe a simple finetune of a new model.
Fine-tuning is the process of specialising or adapting a model using some proprietary data for use in a specific application (see Figure 2). Fine-tuning is usually done as supervised learning, that is a special data set composed of examples and the outputs that they should elicit is prepared and the network is trained to respond as per the training set.
In this case, we’ve decided that Deepseek knows far too much about GFT (the company I work for) and we should create a version that is censored to prevent these disclosures.
To do this, we can use an LLM model to generate some typical queries about GFT, and make a second call to get a set of (negative) answers as per Figure 3:
Figure 3. Model generated questions and refusals.
The questions and answers are edited into cleaned lists and then a simple script is used to generate training examples where each question is randomly matched to a bunch of the answers, so each question produces scores of training examples.
This provides the training set for the fine tuning shown in Figure 4:
Figure 4. The training examples prepared for consumption during finetuning.
Here the mlx_lm library for using LLMs on macs is used for the processing. Invoking the finetuning process is simple:
>>mlx_lm.lora — train — model rudrankriyam/deepseek-r1-distill-llama-8b — data ~/finetune — batch-size 8 — iters 2000
The weights created during fine tuning can be used to generate responses from prompts as in Figure 5.
Figure 5. A censored request
However, this is just a primitive demonstration — only some of the training examples are actively censored by the fine-tuned model, and even fewer of the validation and test set examples. Even so — it does show that it is possible to create a new powerful LLM for a specialised task with a few hours of processing time and very little expertise.
Analysis
In late 2024, it was widely assumed that massive compute resources would be required to develop and host AI models of the future. Access to these resources was cast as a prerequisite to being able to capture a significant part of an emergent value chain where generic AI services were cast as an enabling utility for generic business processes (Figure 6 — top chain). The discussion presented in this paper shows that this projected future may not be inevitable and a more complex and open value chain may emerge (Figure 6 — bottom chain).
Figure 6. Two future value chains for AI. Top chain in a world where scale dominates enabling monopoly/hyperscale control of the value chain. Bottom chart in a world where scale matters but doesn’t dominate and alternative sources of value develop.
In the ‘Scale Dominates Future’ hyperscalers will be able to integrate the bottom of the value chain and capture the vast majority of the value generated, possibly using this position to disrupt and drive incumbents from valuable niches at the top of the chain (ie. Apple Pay vs Visa; Google Health vs NHS; Amazon Prime vs BBC). Service providers will have little to no opportunity to offer differentiation vs direct access to super scale utility models. This model could explain the willingness of funders to provide massive capital to hyperscale AI companies because it promises the capture of substantial value from significant sectors of the economy. In the ‘Scale Dominates Future’ it’s hard to define defensive or sustaining innovations for incumbent players — they will have to use hyperscale AI, and they will have to pay, and everyone else will be able to use it as well. Even more striking is the likely barrier to entry and innovation that the Scale Dominates Future creates. Just as Amazon.com shades out and then crushes any new innovative e-commerce startup, a Scale Dominates Future will lead to hyperscale AI providers killing the investment prospects of new business process innovators, and duplicating and destroying any that do manage to get a toe-hold in a market.
In the ‘Scale Matters Future’, a variety of resources could be available (at feasible capital investment costs) to support differentiated service using AI provided by ISV’s and incumbent service providers. For example, niche training data for specific domains (potentially too niche to be economically addressed by the hyperscalers) could be aggregated and gathered. This offers the possibility of both sustaining innovation and defensive innovation by incumbent organisations. High-end customers can be offered more value by the use of differentiated AI. Low-end customers can be supplied with enhanced services currently only available for the most challenging or high-end sectors through automation and cost reduction.
Innovators and new entrants will still be challenged in the Scale Matters Future. The level of this challenge will really depend on where the capital investment tariff to host near frontier models comes to rest. Our preliminary analysis suggests that it’s likely to be below the low millions of dollars which seems well within the scope of current capital market funding. It seems probably that it will fall from there. The important thing about this is not just that service providers and ISV’s might be motivated to host and train their own models to provide differentiation, but also that model training and hosting will end up as utility business with utility margins. If such a low capital investment is required to start and sustain LLM hosting it will be impossible to impose and maintain significant margins for LLM access and utilisation.
There is a third future that isn’t on Figure 6., the ‘Scale Marginal Future’.
In the ‘Scale Marginal Future’, being a hyperscale AI provider has limited or no technical value. In this case the development of hyperscale AI infrastructure captures very little or no value. ‘Scale Marginal Futures’ offer incumbents the prospect of maintaining their businesses with sustaining and defensive innovation whilst new entrants will be able to enter and succeed in the market depending on the quality of the relative innovation that they can develop and provide.
Beyond Tech : Other factors
So far, we’ve considered only technical factors as the determinants of the value of scale, but technology was not the reason that hyperscale providers won in search, social, or eCommerce. Google is the flat-out-knock-down winner in the search business but started off as a low scale insurgent (famously in an office at Stanford Uni). Google got big because it produced better results (for a while) and then managed to monetise and monopolise search advertising. Amazon won in eCommerce by just building a massive logistics infrastructure and delivering cheaper and faster. Meanwhile, Facebook displaced the opposition via great marketing and a better product, before developing a moat because the experience it could deliver was driven by the community that it had captured and therefore was hard to replicate.
Now no-one can afford to compete with Google or Amazon because they can undercut and outlast them if they try, so no-one really tries any more. Could we see a similar situation emerging with AI? The stories of search and eCommerce both seem to be highly path dependent. Google has a near monopoly on the search market, but competitors do exist (DuckDuckGo and Bing) demonstrating that it’s not impossible to make the investment and create a search product. It is clear though that monetizing that investment is very hard when Google has grabbed and holds the advertising eyeballs that pay for it. Amazon secured massive investment in the .com boom and then used the subsequent decade to build out a distribution network that rivals cannot match. If the .com boom investment cycle had closed earlier and Amazon had been left with less cash, or if the window had been open longer and a rival had secured funding in a similar time-frame then the eCommerce landscape might look very different.
The question is what can we learn from Amazon, Google and Facebook in their sectors and how can that be applied to tell us more about what may happen with AI? Fundamentally, without a massive technical barrier to entry demanding hyperscale resource AI providers will need to create moats around customer experience, business model, or execution. Right now it doesn’t seem that any of these mechanisms have been identified although the hyperscale AI providers certainly have the money to exploit any of these that they do find. On the upside, this is very much the position that Google was in before they invented Adwords. On the downside there have been thousands of well-funded startups over the last decade and there is only one Google.
Who remembers Lycos now?
References
[Alet 2024] Alet, Ferran, Clement Gehring, Tomás Lozano-Pérez, Kenji Kawaguchi, Joshua B. Tenenbaum, and Leslie Pack Kaelbling. “Functional Risk Minimization.” arXiv preprint arXiv:2412.21149 (2024).
[Muennighoff et-al 2025] Muennighoff, Niklas, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. “s1: Simple test-time scaling.” arXiv preprint arXiv:2501.19393 (2025).
[UK Gov 2025] “AI Ref: ISBN 978–1–5286–5362–6, E03258815 01/25, CP 1241PDF, 1.49 MB, 26 pages
[Wu et-al 2023] Wu, Shijie, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. “Bloomberggpt: A large language model for finance.” arXiv preprint arXiv:2303.17564 (2023).
[1] See GFTs blog by Aaron Zhao: https://medium.com/gft-engineering/open-source-llms-for-business-delivering-value-ensuring-privacy-and-reducing-costs-57e971651f79