Putting the (Gen AI) Puzzle Together
A Comparative Analysis of FT and NIKKEI’s Approaches to LLMs
The rise and commoditization of Generative AI. What does it mean for publishers?
The rise and commoditization of Generative AI over the last couple of years has put complex strategic decisions and challenges in front of news publishers all over the world. Alongside the significant opportunities for optimization of mundane and repetitive tasks, enabling new use cases and increased focus on high-value activities, there have been sizeable threats as well, some of which brilliantly illustrated by a NIKKEI short film about the ethical and legal issues with Generative AI for (Japanese anime) creators.
The Financial Times and its mother-company Nikkei have not been an exception to both opportunities and threats. Both companies have done research and experimentation and have started releasing user-facing features utilising the power of Generative AI. Their approaches have been different and each one has been aligned to the specific situation of the publisher. The teams at the FT and Nikkei have had an active exchange of knowledge and learnings, which has helped build on top of each other’s efforts.
In April 2024, Nikkei issued a press release, announcing internally built and trained Nikkei Language Models (NiLM). NiLM is a series of LLMs, specialising in economic and financial information, trained on 40 years’ worth of content of Nikkei group titles and being among the largest and highest quality LLMs in Japanese. In November 2023 Nikkei released Minutes by NIKKEI — a new digital media outlet, targeted to younger audiences, which aims to “allow you to understand world trends in just three minutes a day” and builds on top of NIKKEI Tailor, an AI Newsroom Technology.
The FT has been doing internal research on how the newsroom can utilise Generative AI to help processing vast amounts of data for newsworthy signals, identifying stories early and surfacing patterns, trends and events from unstructured data, while always keeping a human in the driving place, as outlined by the letter on Generative AI from the FT Chief Editor Roula Khalaf. In March 2024, FT professional released a generative Ask FT feature to a limited audience, answering questions using FT content as sources.
I am Kristina Stoitsova — a Director of Data Science and AI at the FT. In this article I interview Hiroyuki Okamoto (AI Product Lead at the Nikkei Innovation Lab), Shotaro Ishihara (Senior Fellow at the Nikkei Innovation Lab), and Krum Arnaudov (Senior Data Scientist, FT) to understand and compare the approaches they have taken with Generative AI and to summarise their learnings.
Under the hood of Nikkei releases and experimentation
Q: Shotaro, Nikkei recently announced the creation of NiLM. The press release mentioned several different approaches — pre-training from scratch as well as fine-tuning, including continuous pre-training and instruction tuning. Could you please elaborate on your decision to experiment with those methods? What specific objectives you wanted to reach and what challenges or deficiencies prevented you from using generally available pre-trained LLMs?
LLMs are beginning to be applied in many businesses and continue to create new experiences that have never been seen before. However, LLMs also include ethical challenges, such as not reflecting the latest information, hallucination caused by training data, and the high likelihood of unauthorised use of data from various media. In order to use LLMs in a responsible manner as a news organisation, we decided to build and examine LLMs using only data to which we hold copyright or usage rights. Our newspaper article data contains the latest information and is verified by many people before publication. It is not yet sufficiently apparent how these properties work in building LLMs.
Given this background, pre-training from scratch using only our data is a logical approach. This approach is distinct from LLMs which are pre-trained on a sometimes poor quality web-scraped corpus. In addition, we addressed continuous pre-training, which has received attention in non-English languages such as Japanese. Using existing pre-trained weights as a starting point, fine-tuning with only limited data can potentially achieve great linguistic performance. The comparison of the two approaches also aims to properly assess the feasibility of pre-training from scratch. The primary goal of our project is not to beat existing LLMs like ChatGPT, but to explore appropriate ways to face LLMs.
Q: Shotaro, could you please list the main technical challenges along the way? What would your advice be to anyone starting off with those methods?
As you all know, building LLMs involves many challenges. Research and development is advancing daily around the world and we need to keep up with the cutting edge trends. The most impressive technical challenge was that we don’t know the result until we actually do it. The behaviour of LLMs is often non-obvious, especially in non-English. It required a variety of experimental settings to achieve each approach.
Even after building LLMs, there are remaining challenges. We have observed that even LLMs pre-trained only from data containing accurate statements sometimes produce hallucinations. We also found that LLMs memorise the data used for pre-training and sometimes output the data as is. This phenomenon is a challenge that news organisations cannot overlook because it poses the risk of leaking private information and violating copyrights. We take this matter seriously and have published a survey paper.
On the non-technical side, it takes time for trial and error because of the extensive data and models to be handled. Computational resources are also essential, and of course the financial costs associated with them cannot be ignored.
For those who attempt a similar approach, we recommend starting with a model with a small parameter size. After adequate trial and error in a setting where the time and money costs are smaller, you should move on to a larger scale execution. Sometimes, building your own LLMs may not be the best choice for you, and using commercial APIs may be sufficient.
Q: Hiroyuki, in November 2023 Nikkei released Minutes by NIKKEI to help younger audiences understand not just world events, but also the underlying trends, by spending just several minutes a day reading. Please tell us more about the product. What technology is underpinning it?
A: Hiroyuki Okamoto
When building Minutes by NIKKEI, we faced the challenge of newsroom workload. NIKKEI Tailor addresses this issue. Let me briefly explain Minutes.
Minutes targets an adjacent audience, not Nikkei.com readers. The core target is around 30 years old, the generation with the most life event changes. They’ve paid for news before but didn’t continue. They have basic news reading habits. Rather than obtaining information in specific areas, they want to gain knowledge about what is happening now and why it’s important. They believe self-improvement increases life satisfaction. However, they don’t want to spend excessive time or money on news.
To provide this audience with “deep understanding,” we created a unique content format. The key content, called “ONE STORY,” provides easy-to-understand explanations of important news, organising current events and combining them with past context and future prospects. Unlike regular news articles, which summarise daily events concisely and limit space for explaining terms and background, the Minutes format structures articles with term explanations, causal factors, current situation, countermeasures, and future outlook. This approach helps first-time readers understand the general background, making articles more approachable for the target audience.
Another format is “Brief Me”. Although Nikkei.com publishes around 1,000 articles daily, “Brief Me” carefully selects three articles important for the target audience, summarises the content, and provides explanations.
For the “Brief Me” content, NIKKEI Tailor plays a crucial role. It works by connecting with Nikkei.com’s audience traffic data to automatically extract articles that have a large volume of traffic from specific audience segments. During this process, articles not suitable for selection (such as those from news wires) are automatically excluded. Simultaneously, summaries of these articles are generated. Journalists then choose around 10–15 articles from this pool, from which they will select the final three for “Brief Me”. Instead of reading through numerous articles, journalists can now determine which articles to include in “Brief Me” by reading the summaries generated by NIKKEI Tailor. This streamlined process allows for efficient creation of the “Brief Me” content, ensuring that the most relevant and important articles are selected for the target audience.
In fact, we initially developed a feature to create Minute’s “ONE STORY” content from Nikkei.com articles. We used the feature in the beginning but gradually stopped. We realised generative AI lacks understanding of subtle nuances and chronological order, often excluding important points. It also struggled with making judgments about time sequences. Ultimately, we found it faster for journalists to read source materials and reconstruct articles themselves. We learned a lot from this experience.
NIKKEI Tailor was developed based on Nikkei’s view on AI in journalism, considering Minutes’ new news delivery format. We collaborated with journalists to upgrade the product, specifically to assist their work.
Maintaining “Responsible Journalism,” NIKKEI Tailor empowers journalists, using AI to enhance their work. We continue upgrading based on feedback, helping them focus on high-quality journalism. We will conduct further research to develop new features and products.
Q: What else are you experimenting with?
A: Hiroyuki Okamoto
At Nikkei, we are also conducting various research and development efforts at the application layer, such as multi-agent collaboration for research and analysis. We will always adhere to the principles of Responsible Journalism and plan to release these as products in the future.
Under the hood of FT’s recent FT releases and experimentation
Q: Krum, the FT Professional team has recently released to a limited audience the Ask FT feature. Can you tell us what this feature aims to help FT’s professional readers with and what techniques were used to achieve that?
A: Krum Arnaudov
Ask FT is a question-and-answer feature that helps Professional readers discover insights through a new flow. It is able to find relevant passages of text within the FT’s articles and present them to an LLM as context. The LLM then responds to the user’s question, based on the information from the FT articles and by providing citations to the articles used.
The tool has both a discovery element (users being able to potentially find content that they would otherwise miss) and time saving element — with the feature responding to user’s queries based on a trusted source such as FT.
In order to receive an answer, each question passes through several steps:
- Moderation, in which the tool assesses the query through an ensemble of tools, blocking inappropriate ones.
- Time horizon estimation, in which the LLM is asked to estimate the most appropriate start and end date for the search. We found that this step can impact the search results significantly and also encourage the user to change the time horizon, if necessary
- Hybrid search, done in three phases:
3.a. Fetching the top 25 articles, most relevant to the question.
3.b. Cutting the articles in segments.
3.c. Ranking the segments, utilising a rerankanking model. Up to 8 top segments are used for the response and shown to the user.
4. Initial answer, in which the LLM provides its first answer draft.
5. Rewritten answer, in which the LLM is asked to double-check and correct its initial response. Only the rewritten answer is shown to the user.
Q: Did you consider using LLM pre-training and fine-tuning and what was driving your decision to use or not use those techniques?
A: Krum Arnaudov
We are fortunate that our content is written in English and in very clearly structured articles. Such texts are easy to understand for modern LLMs. We also did not have tasks that require complicated output formats or special skills that are not native to the modern LLMs. As such, the need to fine-tune an LLM was significantly reduced.
We rather focused our attention on finding the most suitable on-demand LLM for the task. A trade-off needed to be made between speed (usually, highly correlated with cost) and quality. We went through extensive research and comparison and chose Anthropic’s Claude 3 Haiku. It provided us with the best speed-vs-quality balance.
We then spent significant time aligning prompts to the latest recommendations by Anthropic, using techniques such as chain-of-thought, citations-based responses, rewrites of the responses, XML-formatting of the prompts and others.
In order to be able to evaluate quality along the steps, we build different evaluation tasks for each step of Ask FT. The suite enables us to both make decisions about new models as they are published, and evaluate the tool’s performance. Some of the steps were significantly easier to evaluate than others. For example, citation quality can be measured in different ways — through relevance scores, as well as through gold datasets. However, the quality of the final answer is significantly harder to evaluate. Evaluating an open-ended answer is a challenge for everybody in the industry, as well as to us. Our best solution currently is to read as many answers as we can, distributed within the team, but also internally for FT. We have found that LLM-as-a-judge is a helpful signal, but cannot be trusted as the main evaluation tool.
Q: Which were the main challenges you came across? How did you overcome them?
A Krum Arnaudov
The main challenges arise from the fact that the core technology, the LLM, is still very new. Retrieval-augmented question and answering in its current form is less than 2 years-old and it shows in the maturity of tools and techniques. The good news is that we are on an upward trajectory — LLMs keep getting better, and we also gain experience on how to use them effectively.
Another significant challenge has been managing expectations of the tools capabilities. Language models are discussed as tools, matching and exceeding human capabilities, but the realities are different — they are still fragile and prone to mistakes. However, the narratives around them lead to extremely high expectations. Time is needed for users to understand how LLM-based applications, with their pros and cons, can actually benefit them the best and bring them the best value. We are at the beginning of this period.
On a technical level, the most significant challenge has been evaluating the quality of the final answer. Such an evaluation is needed for any change — big or small, but it is currently extremely time and resource consuming.
Q: What else is your team experimenting with?
A: Krum Arnaudov
Working on Ask FT has enabled us to learn quickly and share this knowledge with FT’s Data Science and engineering teams. It has accelerated the adoption of LLMs by a number of teams. Currently, we have a number of initiatives working on summarisation of articles, story finding and a prompting playground, among others.
The more applications of LLMs are found, the more diverse and numerous the LLM practitioners, the better is the cycle of knowledge sharing.
In conclusion: the similarities, the differences and the main learnings
The approaches taken by the FT and Nikkei can be summarised at a meta level and matched to the specific situations of the two publishers.
“First things first” Putting publisher values in the centre of everything. Both companies share the fundamental reality of having their journalism — the journalistic values, the content they produce and the trust that it brings — represent the main asset of the company. In both cases, a great deal of attention has been paid to making sure the use of AI is not going against their core values and is not undermining their readers’ trust. Both have defined and are following guiding principles, which keep the human in the loop in the driving seat.
“Common challenges and objectives”. Unsurprisingly, both FT and Nikkei have common areas where Generative AI can be a useful tool:
- Enabling user-facing products and features that wouldn’t be possible without the use of Generative AI. In FT’s case this is the generative search feature, called ”Ask FT”, which helps the FT Professional readers find insights, based on FT content, with references to the exact articles that explain aspects of the topic. Nikkei is doing similar research into questions and answering using LLMs.
- Realising efficiency gains in the newsroom — both Nikkei and FT are exploring how to use Generative AI to help journalists deal with the need to quickly process vast amounts of unstructured data to identify news signals, patterns and trends, so that journalists can focus on doing what they do best — being on the ground, talking to the right people and telling stories in an objective, impartial and impactful way. In Nikkei’s case this is the Minutes by NIKKEI product that uses the NIKKEI Tailor capability to search and summarise vast amounts of articles in order to curate a meaningful set for the young audience with limited attention span.
- Understanding the boundaries of “what is possible” — Nikkei’s effort to internally build and train a Large Language Model “from scratch” is a bright example of a technology-push approach, where deeply understanding the capabilities as well as the deficiencies and costs of the technology is required to formulate viable use cases.
“The right tool for the job”
Equally unsurprisingly, FT and Nikkei have differences in their situations, which call for using potentially different approaches. In the FT case, having the raw content in English language makes it viable to be utilising the general conversational capabilities of LLMs that have been trained on huge training sets in the language, and building on top of them by doing Retrieval Augmented Generation at the application layer, which ensures the features and products only use trustworthy FT content to generate answers. In the Nikkei case, overcoming challenges related to the use of Japanese language, has justified an approach that touches the lower layers of the LLM architecture to produce quality results in Japanese.
Both the similarities and the differences between the situations and approaches FT and Nikkei have taken mutually enrich the other company’s understanding of what is possible, what works and what doesn’t. This pushes both companies further up the learning curve in a dynamically changing domain by sharing knowledge and learnings. The Nikkei Innovation Lab team has presented in front of the AI community within the FT. Similarly, FT teams that are experimenting and shipping Generative AI powered features and capabilities have shared their insights with the Nikkei team. The road ahead is unknown and quickly changing, but definitely exciting as we seem to be only at the beginning of the journey.