Sitemap

Innovation Demands Compute: How Meta is Enabling ML Productivity and Efficiency at Scale

6 min readOct 31, 2024

Author: Haerang Lee

Meta’s long-term vision is to build artificial general intelligence (AGI) that is open and built responsibly, so that it can be widely available for everyone to benefit from. Large AI models take a lot of capacity, but the global shortage of GPUs has implications for all AI companies large and small.

Meta is no exception. To build superclusters dedicated to the state-of-the-art LLMs, support existing products, and continue advancing in new product spaces, Meta had to develop a principled prioritization framework. This blog post examines the implications of AI innovation on Meta’s compute resources, then discusses ways to measure and improve productivity and efficiency to solve the problem.

Innovation Demands Compute

In Meta, AI powers a variety of products. The feed ranking algorithm and the face effects filters bring users engaging content and delight. AI keeps users safe by detecting community standard violations and promotes inclusiveness via auto-generated alt text for the visually impaired. In Reality Labs, computer vision localizes a person in space and enhances the immersive augmented reality experience. Most recently, breakthroughs in GenAI enabled Meta AI-powered creative assistants such as Imagine, which generates AI images on command.

Additionally, Meta pushes the boundaries of AI through fundamental and applied research, much of it open source. On July 23, 2024, Meta announced the release of Llama 3.1 405B, which is Meta’s state-of-the-art large language model. It was the world’s largest open-source model of its kind at the time of its release. Innovations like this do not come without significant investments.

As our largest model yet, training Llama 3.1 405B on over 15 trillion tokens was a major challenge. To enable training runs at this scale and achieve the results we have in a reasonable amount of time, we significantly optimized our full training stack and pushed our model training to over 16 thousand H100 GPUs, making the 405B the first Llama model trained at this scale.

-Introducing Llama 3.1

GenAI performance is heavily reliant on scale due to the scaling laws for neural language models, necessitating significantly more resources compared to other types of AI. Here, scale means the number of model parameters, the dataset size, and the amount of compute. To lead in the AI space means to invest more capacity in it.

Driving ROI in a Capacity-Constrained Environment

Productivity and efficiency drive return on investment (“ROI”)

GPU’s are a constrained and expensive resource, and so it is important that we ensure maximal ROI. Investing in productivity and efficiency can positively impact return and investment, respectively. Analytics at Meta developed frameworks to define and measure success in these areas to drive ROI.

Productivity represents the relationship between outputs and resources. A productivity metric can be developed for each stage of ML development. Outputs include features engineered, models trained, and models deployed; and the inputs include capacity and developer headcount. Productivity is a leading indicator of future returns. In the short run, it signals whether the developers have sufficient resources and tooling to develop at a healthy speed. In the long run, increasing outputs across the ML development funnel should accelerate the production of high-quality AI models.

Efficiency represents how well the inputs are used. Most commonly, it is expressed as a percentage of resources that are productive, such as the utilization rate of the GPU fleet. It may also be measured using the amount of resources spent on lower ROI tasks, such as GPU hours spent on overhead or on models that serve lower production traffic. Efficiency optimizes the value we get out of the capacity investment.

Efficiency and productivity share levers

In a capacity-constrained environment, the equation below reveals why efficiency and productivity share many levers. Namely, initiatives that improve the efficiency by reducing idle or lower-ROI capacity will increase the higher-ROI capacity:

Additional higher-ROI capacity creates room for more output, assuming the resource usage per output is constant:

Output per developer — a measure of productivity — will increase as a consequence, holding organization size (developer count) constant:

In conclusion, the same levers that increase efficiency can also increase productivity.

Sample levers and metric considerations

Analytics at Meta evaluates both technological and human levers for improving productivity and efficiency. Here are some sample levers, and considerations by analytics for four key ROI-driving objectives.

  • Optimize the level of idle capacity: Some capacity is idle because they are intentionally reserved. Disaster recovery buffer and push buffer should remain unused so that they may fulfill their purpose on short notice. On the other hand, other forms of idleness represent resources that should be used but aren’t. They can result from a mismatch between demand and supply of nonfungible hardware and suboptimal job allocation processes. The goal is to optimize the level of idleness, not eliminate it. Analytics relies on resource logs (e.g., real-time metrics from GPU hosts) to identify idle resources, then partners with engineering to determine the optimal level. For example, improving hardware fungibility can allow training jobs to utilize idle inference capacity.
  • Optimize lower-ROI capacity: To enable training and inference of high-ROI models, we inevitably incur overhead such as data loading. By design, the system will also incur some waste when high priority jobs preempt other jobs. Again, the goal is to optimize — not eliminate — lower-ROI usage. Analytics invested in strong data foundations to increase logging coverage and metric aggregation across diverse tech stacks. It also invested in model lifecycle metadata, which was used to identify the models that served low production traffic. As a result, we have increased visibility into the system’s overhead and ROI, which helps us identify and reclaim lower-ROI capacity.
  • Increase output per capacity usage: Some technological frameworks improve the performance of the model at the sub-GPU level (e.g., via better parallelization). For example, PyTorch2.0 automatically optimizes performance and yields a shorter training time than PyTorch1.x. To fully understand this impact, Analytics leverages metrics that cover machine performance (e.g., GPU core utilization) and user adoption.
  • Increase developer productivity: Analytics stood up new metrics for productivity. We found that increased productivity speeds up the pace of innovation. We also found that seasonality, capacity, and organizational shifts toward different model types can affect productivity. We closely monitor development velocity — how long each stage of the model development lifecycle takes — as a mechanism to improve overall productivity.

Takeaways

Operating AI at Meta’s scale is a complex challenge but Analytics at Meta has made significant strides to help unblock development. Analytics at Meta has built strong data foundations to track metrics at various levels of abstraction, from system logs to MLE behavior. Analytics plays a crucial role identifying opportunities and measuring impact so everyone at Meta from Infra to Product can continue to innovate. While the work is far from done, Meta is well on its way to achieving its goals in the pursuit of AGI.

— -

Many thanks to the reviewers: Morgan Henry, Michael Tanenbaum, Crystal Distin, Betty Li, Ryan Neo, Mark Bittmann, Matthew Williams, Jason Schissel, Zeynep Erkin Baz, Brian Conley, Charles Christian, Lee Howes, Adwait Sathye, Gautam Shanbhag, Chip Turner

--

--

Analytics at Meta
Analytics at Meta

Written by Analytics at Meta

The mission that unites Meta Analytics is to “drive better outcomes using data as a voice for our communities.”

No responses yet