Regulating AI: The Limits of FLOPs as a Metric

An Argument for (if one must) Regulating Applications, Not Math

8 min readMay 1, 2024

As models like Llama 3 400B transition from in-training to available, regulation of artificial intelligence (AI) may move from words into action. The US and the EU have written their respective AI regulations to use FLOPs (floating point operations), an approximation of computational resources used in model training, as a proxy for model capability, and therefore a way to measure which models to regulate. However, this approach may not fully capture the complexity and potential risks of AI models, as it primarily focuses on the computational aspect rather than risks associated with the potential applications of these models.

This article aims to highlight the limitations of using computational power as a regulatory proxy and advocates for a more comprehensive approach to AI regulation that takes into account other factors such as the quality of training data and the specific applications of the models.

Llama 3 and Regulatory Thresholds

The release of Llama 3 on April 18th sparked a discussion on X about the ‘strength’ of different AI models and their proximity to regulatory thresholds. AI researcher Andrej Karpathy conducted some napkin math, commenting that FLOPs used to train the hotly anticipated 400B Llama 3 model is likely to approach the threshold at which it would be subject to regulation.

Karpathy’s analysis highlighted that the training for Llama 3 was novel in a couple of ways: it was trained on 15 trillion tokens, and the quality of training data was superior compared to previous Llama’s (by contrast, Llama 2 was trained on 2 trillion tokens). Normally, the tokens used for training a model are determined using the “Chinchilla compute” heuristic, which would argue for using only 200 billion training tokens to train the 400B parameter model. However, Meta found that Llama 3's performance continued improving even when trained on much more data. Noting this, Karpathy concludes that it’s likely that “the LLMs we work with all the time are significantly undertrained by a factor of maybe 100–1000X or more, nowhere near their point of convergence” indicating that with more training, current models could be significantly better.

Considering that current models may very well be significantly undertrained, and that increasing the training data necessarily increases the FLOPs that go into a model, and this affects whether or not a model is subject to regulation on the basis of it’s perceived strength got me thinking about an interview a16z had with Arthur Mensch, co-founder of Mistral, who argued for regulating the applications of AI models, rather than the underlying computations (i.e. math). This perspective forms the basis of the argument presented in this article.

In the next section, I’ll show Karpathy’s napkin math, and delve into how we can calculate the quantity of computing power that a model is trained with, considering that the way these AI models are regulated is determined by the computing power that goes into these models.

Let’s Calculate the FLOPs (computing power) Used to Train AI Models

Computing power for an AI model is often measured in terms of floating point operations (FLOPs). FLOPs represent the number of floating-point calculations (additions, multiplications, etc.) performed during training.

One heuristic for estimating the FLOPs used to train a model was developed by OpenAI, and used by Google to train PALM 2. Here is the formula:

FLOPs = 6ND (where N is parameters, and D is tokens)

To keep things simple, I’ll use this heuristic to calculate the FLOPs for each Llama 3 version:

Llama 3 8B

8B parameters and was trained with 15T tokens of training data
→ therefore FLOPs = 6 * 8B * 15T = 7.2e23 = 7.2 • 10²³

Llama 3 70B

70B parameters and was trained with 15T tokens of training data
→ therefore FLOPs = 6 * 70B * 15T = 6.3e24 = 6.3 • 10²⁴

Llama 3 400B (still in training as of 1 May 2024)

400B parameters and was trained with 15T tokens of training data
→ therefore FLOPs = 6 * 400B * 15T = 3.6e25 = 3.6 • 10²⁵

Given these calculations, let’s now examine how these figures relate to the current regulatory landscape.

What is the Current Regulatory Landscape in the US & EU?

AI in the United States (US): Reporting Requirements

In an October 2023 executive order called the “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence”, compliance is required for AI models trained using significant computational power. This is defined as models utilizing over 10²⁶ FLOPs or primarily biological sequence data requiring more than 10²³ FLOPs.

The order states that technical conditions should be defined so that models with potential for “malicious cyber-enabled activities” can be identified, and until these conditions are defined, models surpassing 10²⁶ FLOPs and trained on co-located computing clusters are considered capable of supporting such “malicious cyber-enabled” activities.

AI in the European Union (EU): Reporting Requirement

The EU AI act, passed in March 2024, has set the reporting requirement for models at 10²⁵ FLOPs since “general purpose AI models that were trained using a total computing power of more than 10²⁵ FLOPs are considered to carry systemic risks, given that models trained with larger compute tend to be more powerful.”

The EU justifies this by saying that this “threshold captures the currently most advanced GPAI models, namely OpenAI’s GPT-4 and likely Google DeepMind’s Gemini” and therefore since they could “pose systemic risks, and therefore it is reasonable to subject their providers to the additional set of obligations.”

In both the US and EU regulations, FLOPs serve as a proxy, with the US reporting requirement being ten times higher than that of the EU.

With this in mind, let’s take a look at where Llama 3 400B is expected to fall:

How Does Llama 3 Fare with respect to the reporting requirements?

First, a recap of the US and EU reporting requirements

US Requirement: 1e26 = 10²⁶
EU Requirement: 1e25 = 10²⁵

To compare, I’ll only look at the Llama 3 400B parameter model, as this is the only one that starts to get close. The estimated FLOPs for Llama 3 400B are 3.6 • 10²⁵ (aka 3.6e25).

Llama 3 400B is:
• 3.6x the reporting requirement for the EU.
• 0.36x the reporting requirement for the US.

Questioning the Regulatory Metrics

Considering how close Llama 3 400B is expected to be to these regulatory limits, raises a question: why was FLOPs chosen as a proxy, and does it truly encapsulate the complexity and potential risks associated with AI models?

Why FLOPs?

While computational power is a measurable component of what resources go into a models training, using it as a proxy overlooks other critical factors that have influence over the performance of a model. For instance, the quality of training data plays a pivotal role in the performance and behavior of AI models. For example, you can spend the same number of FLOPs on poor training data and get a much inferior model. This can be seen in the ImageNet “Bugs in the Data” paper which illustrates how poor training data has negative impacts on the model. You can also spend fewer FLOPs training on very good training data, and get a very capable model. You can see this in this paper from Microsoft Research about “textbook-quality” data which shows how high quality data (including synthetically generated data) can yield a highly performant model with less training data.

Making Models More Capable without adding FLOPs

The capabilities of a model can also be augmented after training season is over such as fine-tuning and RAG. This is yet another aspect that these regulations don’t take into account.

As Arthur Mensch, co-founder of Mistral, points out in an interview with a16z, LLMs are neutral tools that can be fine-tuned and utilized in various ways, both positive and negative. Open-source models, in particular, have an advantage due to the ability to fine-tune them extensively, potentially amplifying their performance beyond initial expectations. These changes to the models post-training have significant effects on the capability of the models.

Another enhancement technique is Retrieval-Augmented Generation (RAG). RAG combines the benefits of pre-trained language models with the ability to use external knowledge. In a RAG system, the model retrieves relevant documents or information from a large corpus and then generates a response based on both the retrieved information and the original input. This allows the model to provide more informed and contextually relevant responses and makes the model more capable on whatever domain the augmentation data is about.

The implications of using RAG systems and similar techniques are significant. They allow AI models to go beyond their initial training data and leverage a vast amount of external information. This can greatly enhance the performance and capabilities of the models, allowing them to generate more accurate and contextually relevant responses without using additional FLOPs.

Regulate Applications, Not Math

FLOPs, while used as a metric to estimate a model’s capabilities, may not accurately reflect its potential for malicious use. While complex models may demand extensive computational resources, this doesn’t inherently imply harmful intent. Many legitimate endeavors, such as scientific research or medical analysis, require substantial computing power. Relying solely on infrastructure characteristics for regulation risks overlooking the ethical considerations and oversight mechanisms necessary for responsible AI deployment. Arbitrary thresholds could lead to overreaching regulations that hinder legitimate research and fail to address the root causes of malicious activities effectively.

To address these issues, regulatory frameworks should consider the specific applications and potential misuse of AI models. By focusing on these aspects, regulations can better serve their purpose of protecting society without stifling innovation. Just as regulating hammers based solely on their force would overlook their diverse applications (breaking windows and hanging paintings), regulating AI mainly on computational power overlooks the nuances of its usage.

The Need for a Comprehensive Regulatory Approach

The case of Llama 3 and its various iterations underscores the need for a more comprehensive approach to AI regulation. While metrics like FLOPs provide a convenient means of assessing model capabilities, they fail to capture the nuanced factors that influence AI’s impact on society. As we move forward in regulating AI, it’s imperative to recognize that computational power alone does not dictate a model’s potential for harm or benefit. Important to note, any model can be made measurably better with fine-tuning, or RAG. Regulating the mathematics behind AI models, rather than their applications, risks limiting beneficial uses and innovation.

Recognizing that computational power alone does not adequately represent a model’s potential for harm or benefit, it becomes clear that a nuanced regulatory framework must prioritize the oversight of AI applications and their potential impacts on society. As Mensch aptly states, “If you look at what an LLM does, it’s not really different from a programming language. It’s used very much like a programming language by the application makers.” This insight underscores the fact that AI models are utilized within specific applications, emphasizing the importance of regulating how they are used.

As AI continues to permeate various aspects of our lives, regulatory frameworks must evolve accordingly. Failure to adopt a nuanced approach risks stifling innovation while inadequately addressing the ethical and societal implications of AI deployment. By fully understanding why computational power isn’t a great proxy and focusing on applications, AI can be better regulated to serve as a force for good while mitigating potential risks to society.