Part 2/2 — Scaling Thomson Reuters’ Language Model Research
with Amazon SageMaker HyperPod
In Part 1 of this series we covered the business drivers and research objectives behind our custom LLM training research. In this final part, we’ll tell you about how this compute-intensive research required specialized hardware, posing resource scale challenges. Through close partnership with Amazon Web Services (AWS), we were able to use Amazon SageMaker HyperPod to address those challenges. (Note, I’ll use “Hyperpod” for brevity/readability throughout this article.) HyperPod allowed Thomson Reuters to scale a large cluster of NVIDIA A100 GPUs in a reliable and predictable manner. This gave us the scale we needed to explore various approaches to training domain-adapted Large Language Models (LLMs).
Check out the joint presentation Simone Zucchet and I made on this topic at the AWS Summit in London back in April, 2024:
Scaling Language Model Training
We knew training LLMs would require significant computing power. Training a large language model of even 7b parameters is a compute intensive operation — requiring multi-node distributed computing capabilities. Further, these compute nodes need large GPUs or similar hardware. In our case, we focused on NVIDIA’s high performance A100 family of GPUs.
To estimate just how much, we used the Chinchilla scaling law to determine how much training data (in tokens) we would need to retain quality at a given model size. The scaling law is based on published research that found the model size to training tokens scale proportionally. From there, we used other publicly available information to estimate how much time (in days) would be required to complete training at a given number of GPUs.
By our calculations, the compute scale to number of training days is as shown in this table:
So, for Example, a 7b parameter model, would require 132b input tokens and take about 7 days to finish training with 64 A100 GPUs (or 8 p4ds). We used p4d ec2 instance types which offers 8x100 NVIDIA GPU tensor cores and 40GB of RAM.
Scaling Challenges
We quickly discovered that high performance GPUs are in very high demand. It was virtually impossible to get just 1 p4d let alone 8 or 16! We know because we tried for over a month to get just 1 on demand. We failed. We also tried alternate regions where there was supposedly more availability with no luck as well. Capacity reservations were equally slow to acquire p4ds. After a month of trial and error, it became clear that we had a very serious problem — available GPU capacity.
We also looked at a number of third parties that promised to help provide the compute scale we required. However, these solutions would require significant portions of Thomson Reuters data leaving the boundaries of our cloud storage and posing potential intellectual property and data security risk.
Solution: Amazon SageMaker HyperPod
AWS is a trusted Thomson Reuters partner, and we asked them how we could scale our training when we couldn’t even get 1 p4d in any of our regions! It turns out, this was a pain point many customers were having and AWS was actively working on ways to manage large scale capacity needs that would ebb and flow over time.
They introduced us to Amazon SageMaker HyperPod (HyperPod). With HyperPod, the customer communicates the GPU capacity they need over time and HyperPod provides it, managing health and resiliency of the worker nodes.
The following table is the capacity plan we used over the course of 5–6 months.
Outcomes and Next Steps for Us
Finally! With our HyperPod cluster setup, our capacity plan needs shared with the HyperPod team, and a Labs custom command line interface (CLI) to ease training job management, we were ready to experiment with training LLMs. This was a journey!
By the numbers: Over the course of ~5 months, we successfully ran ~20 training jobs. We scaled our cluster up to 16 p4ds and our largest job utilized the entire cluster. We trained a 70b parameter model on 400b input tokens and it took 36 days to complete.
The most amazing aspect of this was that we had zero hardware failures! This is perhaps a testament to HyperPod’s pre-flight health checks performed before they are made available in the cluster.
Report on initial trainings done, findings
While our experimentation is far from complete, we do have some positive preliminary findings. What I’m sharing here is an informal summary. More detailed analysis and results will be published in the future by the Labs Foundational Research team.
Continuous Pre-Training (CPT)
In continuous pre-training (CPT), you train from an existing open source LLM checkpoint. More than a time-saver; it is a strategic decision that allows for the nuanced growth of the model’s capabilities over time.
The preliminary results of our experimentation showed that we were able to train models on the legal domain without losing general knowledge.
We used a measure called perplexity. It quantifies how well the model predicts a sample of text. In essence, perplexity measures the confidence a model has in its predictions. Lower perplexity indicates that the model is more certain about its predictions. From the graphs above you can see that as we increased our batches of training, legal perplexity decreased while general perplexity increased somewhat, it quickly leveled off.
Part of our experimentation was determining the right split of domain specific (legal) and general data to train with.
Instruct fine-tuning (IFT)
Instruct fine-tuned LLMs are tuned to respond to specific instructions, enabling tasks such as question answering, summarization, and brainstorming. For instance, human-written instruction datasets include prompts like “summarize this article” or “list fun weekend activities.” Our hypothesis is that Legal LLMs can benefit from diverse legal instructions.
We have discovered that our Legal LLM greatly benefits from a vast array of diverse instructions. By compiling legal instructions, such as drafting legal headnotes, and combining them with publicly available instructions, our MPT-TR-7b model, derived from MPT-7b, has showcased improvements correlated with an increased number of instruction data sets provided.
We used an automatic measure called rouge to determine how well our domain adapted models performed compared to GPT-4. This automatic measure, based on term overlap, is not the same as human preference judgment, but gives us some degree of confidence we are headed in the right direction.
Legal Summarization
Our MPT-TR-7b model has demonstrated proficiency in legal summarization tasks, rivaling GPT-4’s performance when evaluated with automatic metrics assessing word overlap with reference summaries. While a human-based evaluation would offer deeper insights, the initial results are compelling evidence of the model’s capabilities.
Legal Classification
In other legal tasks, such as classification that was measured in accuracy and precision/recall, there’s still room to improve when compared to GPT-4. Nonetheless, the performance uptick is evident with the expansion of instruction datasets. Even more exciting is the leap in performance observed with larger base models like MPT-30b.
NOTE: Results for the third task, legal question answering, are not available at this time.
Next Steps
With the advent of even more capable models like Mistral-7b, which matches MPT-30b’s performance, we are eager to explore the potential of more recently released models like Mixtral8x7b and LLaMa-3–70b. As next step, we have been training on the Mixtral8x7b and the LLaMa-3–70b models that seem to give us even better performance than the smaller models we have been training.
Looking ahead, the integration of new alignment methods, such as DPO (Direct Preference Optimization), could further narrow the performance gap, paving the way for the next generation of specialized LLMs that could revolutionize the legal tech landscape. Its impact on training scale requirements is positive, as it simplifies the process and reduces computational overhead.
HyperPod and LLM Training, Is it Right for You?
HyperPod works. We were able to scale up capacity of HyperPod with a ~2-week notice to AWS. And scaling down could often happen faster. We did find it somewhat challenging to plan for capacity since the results of current experiments would often determine the capacity requirements of the next experiment. We often retained more capacity than we needed for weeks at a time to ensure we would have it when we needed it. The cost of this was somewhat offset by our savings plans with AWS.
Is HyperPod Right for Your Business?
Is HyperPod right for you? The first question to answer is does it make sense to train and maintain your own models. Think back on some of the benefits I previously outlined in Part 1: Quality, Agency/Control, Operational Efficiency. If your business operates in specialized or deep verticals with knowledge not generally available on the web and you have domain specific tasks, it may make sense. At the same time, you’ll need to weigh the costs associated with training and inference as well as keeping up with rapidly advancing LLM technology.
Like Thomson Reuters, you might want to start with retrieval-augmented generation (RAG) solutions with off-the-shelf LLMs as a first step, then consider customization options from there. Besides quality, think about the amount of control, speed and cost your use cases may need.
If you DO decide training LLMs makes sense, then you’ll need considerable computational power. Depending on your model parameters, you’ll likely need at least 64 A100 or better GPUs and HyperPod can ensure you have the necessary capacity and resilience. EC2 Capacity Blocks for ML might work for smaller, shorter running training jobs where resiliency is less of a concern. The Trainium instance type may also be something to consider because it seems to be a more plentiful resource, currently. However, in our experience, Trainium only supports certain model architectures, versions and frameworks.
From Thomson Reuters Labs’ Viewpoint
Our research is promising. The possibility of being able to run smaller, tuned models can make good business sense — giving us more autonomy, improved task-specific quality, and reduced operational cost in the long run. We have a long way to go in terms of vetting an end to end solution. Research on quality continues and as things firm up we will need to do further analysis on operational concerns — including designing for enterprise-scale inference/hosting.
While we do all of this, very large commercial LLMs continue to advance. With ever-increasing context windows and increasing availability/capacity, whether smaller, tuned, LLMs benefits will outweigh the cost is still to-be-determined for us.
💬 Let the conversation begin Let’s talk about it here!