Süddeutsche Zeitung’s AI-powered review of the year with retrieval-augmented generation (RAG)

Published in

Süddeutsche Zeitung Digitale Medien

8 min readJan 17, 2024

A Generative AI experiment

2023 was a landmark year for generative AI (GenAI) and its applications, garnering heightened exposure and intrigue in numerous tasks from the general public. One such task is that of retrieval-augmented generation¹ (RAG), seeking to reinforce responses from general purpose large language models (LLMs), with enhanced accuracy and credibility, through supplementation of material from an external knowledge source. Here at Süddeutsche Zeitung (SZ), as part of our annual review of an eventful year, we have been experimenting by offering our subscribers the opportunity to explore GenAI powered question answering (QA), concerning the topics and events of 2023 with our very own RAG system. How does this benefit our subscribers? Externally, our general goal was to provide our subscribers answers to queries empowered by news event and domain-specific information found in material from over 60000 SZ articles published and curated last year by our journalists and editors. This allows our subscribers to simultaneously experiment with modern applications of GenAI, whilst increasing exposure to unknown content which could be of interest within their subscription. Internally offering such a service provides a basis to understand how our subscribers interact with AI powered services, potentially driving future focus points or innovations related to product design or solutions in the future. In this article we will explore our technical set-up and considerations when building our interactive QA review of the year RAG deployment (QA-ROY).

[1]: Lewis, Patrick, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks.” Advances in Neural Information Processing Systems 33 (2020): 9459–9474.

Methodology

Any RAG model deployment in a real-world scenario stipulates careful consideration of several factors determining the final quality of service delivered. Operating in the publishing domain some goals or areas of particular concern were:

✓ — Providing a safe end user experience through ethical guardrails and suitable guidelines to tackle sensitive queries & topics.
✓ — Generating satisfactory answers by contextualising document retrieval and final outputs for tricky, niche or irrelevant queries.
✓ — Demonstrating a product integrated use of GenAI through a rewarding user experience with a clear and correct exposure to relevant SZ articles, whilst best accounting for tricky factors of news texts such as publication relevancy or recency.

So how should such a system look and respond to user inputs? To help with concerns such as those detailed above, we decided this initial experiment should not operate as a conversational tool but simply provide the best experience when serving isolated queries. We further chose to constrain user input to a maximum of 80 characters to help avoid unwanted issues such as prompt injection or contextual uncertainty from the LLM. Our focus then was primarily on how best to process, select, integrate and reference SZ article content in an accessible format for meaningful answer generation, whilst suitably curtailing responses to certain user prompts. To do this we deployed QA-ROY on AWS, so let’s explore how the overall architecture looked like for us to achieve these objectives.

Moving Into Production

There are two integral components required to build our RAG architecture, allowing for both information retrieval and text generation (see Figure 1). The first is the RAG knowledge base used to store SZ article text chunks + embeddings and associated article metadata as document objects. The second, the QA-ROY API, mediates the processes required to transform a user query into an answer + relevant articles object. To generate our stored documents, we used a batch job to perform ETL (extract, transform, and load) on all valid² published news articles from 2023 originally stored on a DynamoDB. Article texts are first chunked up to a maximum word limit (up to the nearest complete sentence for semantic consistency) using Langchain’s recursive text splitter. Next, we generate embeddings for each newly created chunk of text via a previously pre-trained model (see previous HeriBERT discussions here) deployed via a Sagemaker endpoint. This endpoint also serves to generate our user query embeddings for semantic search later within out Haystack pipeline to ensure dimensional compatibility and meaningful comparisons. Finally, we supplement each text chunk/embedding pairing with the associated parent article metadata (URL, article title, publication date etc.) necessary as supplemental information for the LLM and for front end rendering. All these are indexed on Opensearch, creating a knowledge base complete with approximately 250,000 documents for swift hybrid search retrieval.

**Figure 1:** Architecture overview for QA-ROY including AWS services utilised to construct our RAG system.

The central magic happens in our API, taking requests and feeding answer responses back to our subscribers along with cited references to SZ articles and where in the body of the answer they are relevant (see example in Figure 3). When a request is sent to the API, we first flow through a series of logical steps to determine if the subscribers question should pass forward to the GenAI step of the full pipeline. These help with factors such as safety, reliability and consistency throughout our returned answer set. For example, initial validity checks screen for factors such as blacklisted words or phrases to avoid unethical responses. Flagging matching cached queries (below 40 characters) and predefined query-answer pairs help alleviate operational costs and unnecessary calls to the GenAI model API (e.g. “Wie geht’s dir?”) and so on. See Figure 2 for the full control flow.

[2]: We ingest all SZ published articles for reference up to a publish date of 06–12–2023. We currently exclude dpa agency articles as not to over bias or degrade results from extensive short form reporting.

Information Retrieval and Augmented GenAI request responses

If no predefined response is returned, then the subscriber’s question is passed to the retrieval and text generation stage. We utilise the Haystack library for efficient document chunk retrieval pipeline integration, using the hybrid search approach as detailed in Figure 2, incorporating both sparse and dense retrievers talking to our Opensearch index. Based on the input from the subscriber, lexical and semantic candidate sets for the top 10 closest matching documents are formed via both BM25 keyword search (using document text and article title fields) and kNN (k-nearest neighbour) search on document text embedding vectors respectively. This helps balance the quality of the returned document candidate pool by allowing for both human like contextual matching of larger portions of text, and stronger matches when focusing on word or phrase textual units such as product names, people, locations etc., which are a vital part of news and event-based reporting. Next, we threshold filter each candidate set separately to ensure a minimum standard of similarity in the top candidates. This helps avoid situations where the top results might all have low similarity values leading to unsatisfactory quality in our final answers. In this scenario we have chosen to accept that the question could not be answered. Penultimately, we merge results together from each retriever to select the total top 10 with the highest similarity scores (normalised to the unit interval to ensure valid score comparison between different candidate types), before finally re-ranking the documents weighted for recency vs similarity and selecting the top 3 from the remaining document pool.

**Figure 2:** Question flow to control which type of answer is delivered (Upper figure). Haystack hybrid retrieval pipeline to find and filter top matching documents to subscriber questions to be combined with our LLM prompt to generate final answers (lower left figure). Input and output data to generate GenAI answer from LLM API (lower right figure).

To generate our final responses, we have taken advantage of Amazon’s recently publicly available Bedrock service, a serverless option providing access to a range of cutting-edge foundation models, accessible through a single API for simplistic integration into GenAI based applications. Our choice of model, Antropic’s Claude is highly regarded in key GenAI tasks such as text generation and summarisation but centred around helpfulness, honesty, and harmlessness. The synergy of this approach along with our goals along with concerns for balancing security, privacy concerns, implementation complexity, whilst keeping the cost of our experiment manageable are all factors leading to our choice of model and overall architecture. Specifically, we have chosen to use Claude Instant V1 which allows higher throughput to cater to more users quickly, as we found results suitably compatible with the recently released Claude 2.

Internally we experimented with various prompting styles and nuances adhering to Claude's conversational style before landing on a good balance of instructions (role assignment prefacing, chain of thought style examples etc.). We were careful not over instruct as we found this provided us with best results, particularly when controlling the format of the reply when attempting to correctly augment our prompt with our relevant retrieved data. The concluding step is simply to clean and tidy the response from the LLM API citing the original document material originally passed. And success! After some front-end magic, we have smartly answered queries with nicely cited SZ published articles for additional reading! See the user experience in Figure 3.

As is often common in RAG systems there is no one size fits all approach to tuning hyperparameters considering the logistical limitations of deploying such a system and the variety of possible inputs. Iteration and experimentation are always important to gauge how any RAG system is performing. Additionally, early exposure to internal stakeholders provided valuable feedback in tweaking our final parameters and configurations. One example for a critical decision we settled on was using a modest LLM temperature (0.25) to function as a deterrent to hallucinations whilst allowing for some variety in returned results. Specifically in the context of our use case, the temporal aspect of news is critical as well. The recency weight is a tricky hyperparameter to best provision for a range of query topics (sports stories often benefit from higher focus on recency, extended news topics require less focus etc.). We found an evenly weighted recency re-ranking in our haystack pipeline combined with a reduced selection of only the 3 top documents worked best for project facets such as answer clarity, cost management and user experience.

**Figure 3:** Subscriber journey from question input as either a custom input or selected example (step 1), request is sent to API to trigger pipeline to generate RAG answer (step 2), to the final step where formatted LLM response is displayed in place of the original input complete with in-text references and matching article teasers below (step 3).

Takeaways and looking to the future

By augmenting subscriber queries with our published article knowledge base, formed from approximately a quarter of a million article extracts, along with easy integration of powerful GenAI tools with Amazon’s Bedrock, we have been able to deliver a special review of the year to our subscribers. We aimed to deliver insightful and factual answers to questions whilst highlighting some of the best reporting and stories SZ has to offer. For the future? Given the continuous influx of novel approaches to solutions in this space, we will continue to monitor and evaluate user feedback as a gauge of how such services could shape, enhance and personalise how our subscribers can engage with our product. Already we have some valuable feedback (see example in Figure 4) from user research data-driven analyses of our RAG systems impact. Watch this space.