Part 1/2 — Scaling Thomson Reuters’ Language Model Research

Business Drivers and Research Objectives

Published in

Thomson Reuters Labs

8 min readJul 16, 2024

2023 proved to be an inflection point for generative AI, prompting Thomson Reuters to consider how their high-value, curated, data could improve on language models for customer-specific tasks.

In this two-part series, I explore the journey Thomson Reuters took to enable our cutting-edge research in training domain-adapted LLMs with Amazon SageMaker HyperPod. In this first part, I’ll go over some of the developments in the AI LLM landscape in 2023, the business drivers, and the research we were looking to conduct.

Update: Part 2 is now available.

Large Language Models Disrupt The Industry, Businesses Move to Adopt

In 2023, leaders in the generative AI space, such as OpenAI, released ground breaking large language models (LLMs) that drastically improved over previous model capabilities. Many others followed suit. The resulting technology opened new doors to enhancing customer experiences — tailoring content, recommendations, and responses to individual customers in natural chat-like interfaces.

Generated with AI (Microsoft Copilot Designer, DALL-E 3)

In retrospect generative AI’s impact is obvious, but the speed and impact at which it hit in 2023 was undeniably disruptive. For many businesses, the race was on to bring this technology into their products to maintain or gain competitive advantage. Thomson Reuters was no exception and keenly felt the need to help its customers be successful in this burgeoning, AI-Augmented, world.

However, as with any technology, proper application and understanding of its limitations is critical. How effectively we address these limitations will make or break a business. Consider the following:

Large Language Models Make Things Up: Large language models have a tendency to hallucinate or confabulate. Simply put, they may generate text that is not true! LLMs are quite remarkable in their ability to respond to natural language, and clearly encode significant amounts of knowledge. However, the stochastic nature of the technology means that its responses are based on probability of word occurrences. It doesn’t model “facts” so much as it models language. The model has no idea if the words (tokens) it’s generating are factually correct, though it may have successfully modeled the correct sequence of words to represent facts.
Public Language Models May Not Be of Sufficient Quality: While the general knowledge encoded in the latest LLMs is remarkably good, it may not be enough for your business or customer domains. Public and commercial LLMs are based on the knowledge of the Internet — not what’s behind your business’ closed doors. Adding to the problem, bias and factually incorrect information exists on the Internet and there often isn’t enough transparency in what data is used and how commercial models are trained with it. Further, LLMs will only have encoded knowledge since their last training. They may not be up to date and businesses don’t control the frequency of model retraining.
Large Language Models Can be Slow, Expensive, or Lack Sufficient Capacity: Depending on your use cases, you may find existing commercial LLMs are either too slow or too expensive or be in such high demand that you cannot purchase enough capacity to meet your requirements. (This may only be a temporary challenge, as we’ve observed increased capacity and reduced cost as hardware, optimizations, and economies of scale continue to improve.)

Every business must take into consideration these limitations and devise strategies to address them. Thomson Reuters is no exception.

Thomson Reuters’ customers have no tolerance for inaccuracies. They are professionals with discerning information needs in legal, corporate, tax, risk, fraud, compliance and news domains.

Take, for example, legal. US law is based on legal precedent — the outcomes of past trial cases are used to determine decisions in new cases. Not only does Thomson Reuters curate and enhance the publicly available content published by the courts, it has centuries of editorial content that analyzes and reflects on all aspects of the law. Legal research is a critical area for Thomson Reuters customers — it needs to be as complete as possible and there is no tolerance for made up facts or citations. Thomson Reuters products should not misrepresent the facts — they should not hallucinate or confabulate information.

The reality is that LLM hallucinations are a phenomenon that exists even when techniques, such as retrieval augmented generation (RAG), are used to minimize it. Solutions should be designed in such a way to allow for user verification. We should also set clear expectations with our users on effective use of our generative AI products. (See this recent Legal Current article “Our Commitment to our Customers” by Mike Dahn, head of Westlaw Product Management, Thomson Reuters.)

Thomson Reuters Gen AI Strategy — Buy, Build, Partner

Pillars Generated with AI (Microsoft Copilot Designer, DALL-E 3)

Thomson Reuters generative AI Strategy includes investing an additional $100M per year on AI over the next several years with a buy, build and partner approach. In early 2023, we launched into this aggressively. While we worked to acquire Casetext’s CoCounsel for $650M, we also launched into building our first round of generative AI products. By November, we had our first beta of Westlaw’s AI-Assisted Research. It went to production in December. This was closely followed by Ask Practical Law AI in January and others are on their way. Lastly, Thomson Reuters Ventures is actively seeking investment and partnership opportunities with emerging AI startups.

These first round of solutions largely centered around retrieval-augmented generation (RAG) patterns and commercially available LLMs — such as OpenAI’s GPT-4 and Anthropic’s Claude. RAG allowed us to address some of the limitations around missing domain knowledge and helped mitigate risks of hallucination. Rather than exposing a bare LLM to our customers, the LLM is a central component of a larger, intelligent, system.

We also utilized multiple LLM vendors where quality was comparable for specific tasks to address capacity and single point of failure.

And we are now launching an enterprise-wide generative AI platform for a unified customer experience that spans products. But that’s a story for another time!

While we were working on those near-term RAG solutions, the Labs foundational research team started work on how we might build our own LLMs. This research work, particularly around our journey to scale training with Amazon SageMaker HyperPod, will be the focus for the rest of this story.

Our Research

Research Workbench Generated with AI (Microsoft Copilot Designer, DALL-E 3)

Thinking back on the limitations of publicly available, commercial models (hallucinations, lack of domain knowledge in customer verticals, lack of transparency in the training data and processes, can be slow or costly to operate, etc.), we asked ourselves the following questions:

Can Thomson Reuters’ editorially created, curated or enhanced data be used to improve LLM knowledge for specific business tasks?
Would smaller LLMs (e.g. 12–30b parameters) trained with Thomson Reuters data perform on par with very large LLMs upwards of a trillion parameters?
What methods could we employ to train the Thomson Reuters domain specific models to get the best results?

The potential benefits as we see it fell along three areas: quality, agency, and operational efficiency.

Having full access to model training, it’s possible we could tune LLM generation to our domain and allow for tighter RAG integration. This would directly impact quality. And if we own the models, we control how and when they are trained and updated. We would have full audit and control over what went into the model training. Lastly, if smaller tuned models could perform sufficiently, it could be a more cost effective and scalable solution — improving overall operational efficiency.

LLM Training Experimentation

Our research focused around answering these specific questions:

How well do foundational models (in the 7–30B parameters range) perform on specific tasks, unmodified? (This would be our baseline.)
Does performance improve for specific tasks when augmented with Thomson Reuters domain-specific data using various training techniques?

To frame this research, and give us concrete evaluation targets, we focused on a number of real world tasks: legal summarization, classification, and question answering. We used publicly available general textual data, and domain specific textual data from our comprehensive stores of primary and secondary US law material. Primary law would include content published by the courts, and enhanced by TR. Secondary law, would include TR subject matter expert analysis and annotation of the law.

Finding the Right Training Recipe Requires Experimentation

There’s a lot of public research on training and adapting LLMs. We knew we would need to run a series of experiments — training LLMs from 7b to 30+b parameters, starting with a foundational model and continuous pre-training (using various techniques) with a mix of TR and general data. We would then finetune the model and evaluate how much better it performed on our specific legal tasks, while at the same time evaluating for any loss in general knowledge or language understanding.

1. Continuous Pre-Training: By further pre-training an existing foundational model, we wished to enrich its understanding of legalese without compromising its general language abilities. This was largely an experiment in finding the right mix of domain and general training data to retain general knowledge while increasing domain-specific knowledge. We used perplexity to measure impact of domain specific training on general knowledge capabilities of the model.

2. Instruction-Finetuning: This would be an exercise in generating impactful instruction datasets, including legal and general tasks. We experimented with pre-training open source foundational models, such as MPT, Flan-T5, Mistral, etc. and compared against industry standard commercial models, like OpenAI’s GPT-4. We used rouge to measure how well models performed on tasks.

In short, our experimentation would involve lots of LLM training!

To be continued…

In the first of this two-part series, we talked about how generative AI and Large Language Model (LLM) technology exploded in 2023. This created new business drivers that influenced Thomson Reuters’ AI strategy. One area involved exploring LLM customization.

This customization would require a computational scale we had not tackled before. That is where our engineering and technical challenges begin! Come back for part 2 where we talk about these scale challenges and how we addressed them.

💬 Let the conversation begin in the comments here or on our LinkedIn Group!