An AI Assistant for EU Election Q&A

Having developed a retrieval-augmented generative AI (RAG) tool for the yearly review last year, we decided to build an EU-election research assistant. In all phases, from the initial conceptual design, development to the hyper care and monitoring going live, this project was a collaboration between SZDM Data Science, SZ Entwicklungsredaktion (our editorial innovation lab), and two editorial teams (data and politics). We have written about the why and how already in our SZ transparency blog. Here, we want to give more technical insights.

The aim was on the one hand to make information about the EU election accessible to users. With over 30 political parties on the ballot and thousands of articles written by SZ, finding detailed information about all their positions and candidates could be overwhelming. On the other hand, we collected user queries and scanned them with our editors for topics that were in high demand but “under-reported”.

Briefly, we gave subscribers the opportunity to ask questions around the upcoming election and generated answers with a RAG system based on the party manifestos, SZ reporting and supplementing documents. Questions could be entered as free text, but we also provided initial and follow-up suggestions.

Newly added features of the offered user experience highlighting changes in the user journey from our previous RAG project

The infrastructure: What we did differently than before

One of our learnings from the last big RAG project was that we needed to speed up our experimentation cycles. On the technical side, that meant using more off-the-shelf solutions for RAG architectures that have become available and re-using our own existing solutions. The API layer for example was implemented identically to our last RAG project, an EKS deployment that processed user inputs and managed the RAG steps behind the scenes. The architectural and functional changes mainly concern the knowledge base setup, prompt and the retrieval pipeline.

Knowledge bases and the choice of documents

We decided to use a knowledge base including:

  • General information about the EU elections from e.g. the German government and agencies for political education [1, 2, 3]
  • Party manifestos and candidate information from all eligible parties in Germany and their respective EU party families
  • Handpicked SZ articles related to the EU election

We extracted the text from all documents and enriched those texts with metadata such as the source and high-level topic descriptions.

To make the documents searchable for our RAG system, we used AWS knowledge bases. This gave us slightly fewer configuration options than a custom build earlier solution but sped up ingestion by several magnitudes which allowed us to iterate pre-processing much faster. Using an OpenSearch Serverless Collection also significantly reduced our operating costs. Seeing that the automatic chunking setting produced text snippets with too little context, we used a bigger custom chunk size with 20 percent overlap in the end. Further, the documents were supplemented with metadata from the parent documents.

Additionally, we collected published articles about election related topics in real time from our article CMS and added them to our knowledge base in a separate section. The political information was used to answer the user questions. The real-time article collection was used to provide additional reading material to our users.

EU AI-assistant architecture overview for core service components and AWS services utilised to construct our RAG system

Search and Retrieval

The whole premise of RAG is to base generated answers not on model-internal knowledge but on information provided to the LLM via supplemental documents. Without retrieval of at least some relevant documents, we want the LLM not to speculate or hallucinate an answer.

An obvious improvement in this regard is the bigger context window of newer LLM generations. We used the Anthropic LLM ‘Claude 3 Haiku’. This allowed us to include more and longer document snippets without greater cost or risk of the model forgetting parts of the prompt. We had already increased the chunk size in the pre-processing and the number of top hits to include in the prompt compared to our last RAG project.

A typical user question might ask to compare positions of Party A and B. A simple similarity search with the plain text or a vector embedding of the question might randomly return results about just party A at the top, party B below. A question might also be concerned about e.g., both health and environmental policies. Again, a similarity search might return only results mentioning environment policies at the top, if there are no documents matching both aspects. For these types of question, where diversity of search results is important, we implemented an agentic workflow: We first asked our LLM to generate up to three separate search queries to cover various aspects of a question and run these in parallel.

For example, for the question “What are the economic policies of CDU and the Green party?” the LLM would generate two sub-queries:

[
# The "organization" field gives more weight to search results tagged
# with that organization.
{ "organisation": "CDU", "query": "CDU Wirtschaftsförderung Positionen" },
{ "organisation": "GRÜNE", "query": "Grüne Wirtschaftsförderung Positionen" }
]

We took the top two results of each search thread, merged them and included them in the prompt to generate an answer.

Another problem we encountered was subject questions without stated party preference. Users that ask about, say, the electoral procedure in general are probably better served with a general overview by neutral organizations. We therefore gave a higher weight to search results of public agencies if not specified.

Quality assurance

Of particular concern were two areas: We did not want to make explicit voting recommendations — “Who should I vote for?” was, surprisingly, one of the most asked questions. We also did not want the system to assume a particular party position, especially from the extreme ends of the political spectrum. Our final prompt included instructions following that intent. We did not include additional guardrails against trolling — in our experiments we could not improve over the precautions built into the LLM. However, we defined some topics and keywords we considered sensitive. If a question included such a keyword, it was forwarded to a Slack Channel for real-time monitoring and review.

In fact, while there was a substantial number of non-serious, colorful, or experimental questions, we saw only a handful of cases that we could classify as malicious, e.g., trying to provoke a politically extreme response. None of them produced a harmful or offensive response from our system.

User experience

To lower the threshold of entry, we created a UI module that guided users in their first interaction. With two mouse clicks they could compose the question “What is the position of {party} about {topic}?”. 55% of questions we received were composed that way.

UI-Choices given to the User

Together with the answer, we also generated and displayed suggestions for follow-up questions which users could select again with a single click. These options were designed to ease users into longer interactions. They increased the number of questions asked per user by 25%.

Prompt design

One of our primary concerns was ensuring that our EU Bot does not provide explicit party recommendations or exhibit clear political biases, as seen in other LLMs. Additionally, our bot should not echo extremist statements from openly xenophobic or national-socialist fringe parties. To address this, we included an instruction in the prompt prohibiting voting recommendations, even when requested. This effectively mitigates the first issue. To address the second issue, we added warnings to documents from parties classified as extremist by the German domestic intelligence service. These were included in the prompt for the LLM to avoid normalizing their political positions.

Managing various iterations of prompts can be challenging, especially when they are tested against each other. To facilitate tracking changes and comparing results, we have structured our prompt design into distinct components: the system prompt, role part, instructions, examples and formatted document chunks. Our final system prompt looks as follows:

Du bist eine Redakteurin der Süddeutschen Zeitung, spezialisiert auf 
Europapolitik. Vom 6. bis 9. Juni 2024 findet die Europawahl statt.
Dein Auftrag ist es, den Lesern umfassende und präzise Informationen
zur EU-Wahl, den Parteien und der Europäischen Union zu bieten.
Dein Wissen und deine Antworten stützen sich auf die neuesten
Wahlprogramme der Parteien und allgemeinem Informationsmaterial
rund um die EU und die Wahl.

We found it was important to add a time reference to clarify questions referencing “the next/this year’s/the previous” etc. election. After the election we amended the instructions as follow:

Dieses Jahr ist {current_year}. Heute ist der {current_date}. 
Ereignisse vor {current_date} liegen in der Vergangenheit. Ereignisse
nach {current_date} liegen in der Zukunft.
Die letzte Wahl war vor wenigen Tagen. Die nächste Wahl findet
in 5 Jahren statt.

Metric-based evaluation

Iterative assessment and subsequent improvements at the system level of LLM powered projects via metric driven development, currently finds itself somewhat at an impasse regarding industry level standardization. The recent explosion of scaled LLMs has left plenty of uncertainty and freedom when determining best practices for quantifying end-to-end system performance, particularly through automated pipelines. Obvious uncertainties arise considering correlations between human preference and traditional scoring methods, as well as stochastic response uncertainty present in LLM-based metrics. These challenges are certainly evident for RAG systems, however utilizing traditional sequence generation metrics, partnered with leveraging LLMs as reference-free judges still offer effective guidance, when properly contextualized within the development process.

To monitor our system’s performance, we created a test set of questions, expected answers, and expected sources. This allowed us to benefit from both supervised and unsupervised scoring when judging iterative improvements. To partially address the concerns above our metric selection aimed to offer breadth in two key areas.

Firstly, to factor in different descriptors of textual similarity we used lexical based ROUGE (F1 scores for 1, 2, L variants) scoring, semantic based BERT scoring and holistic based scoring via the RAGAS suite of metrics generated though LLMs powered scoring. The RAGAS framework offers an easily integratable set of metrics specifically targeting features of the RAG process, aiming to capture more nuanced aspects of outputs such as faithfulness, relevancy, harmfulness, and coherence.

The second area is consideration of the two key outputs in any RAG pipeline, retrieval, and answer output. RAGAS makes this straightforward, offering both context and answer variants of metrics for better judgement of how answer generation and retrieval components are faring. For example, “answer relevancy” and “context relevancy” for scoring the pertinence of produced answer and retrieved contexts, respectively. Finally, we doubled up our supervised metrics ROUGE and BERT to generate average similarity scores for both the generated answer and the texts of the retrieved contexts.

We found that not all metrics always develop in the same direction. Individual metrics, however, pointed to areas with improvement potential.

To counterweight our automatic metrics in our decision processes we paid special attention to human evaluation and feedback from alpha and beta testers:

Based on their reports, we improved both smaller domains like adding overviews of all front-runners to the knowledge base. This and other enhancements to the knowledge base took care of frequently asked questions by politically interested users. We also improved more general issues such as making sure that answers are not based on the positions, or the documents provided by a single party. This does not necessarily change the factual correctness of an answer but follows fairer, more even and journalistic standards.

We also modified the self-representation of the language model: While instructed to act as political editor when answering questions, the model referred to itself as AI assistant when asked about its capabilities directly.

Finally with the strategy above and taking a modularised picture of the key RAG architecture areas such as prompt combinations, chunking strategies or retrieval hyper parameters allows us to explore the best overall configuration through a practical number of iterative steps.

User questions and feedback

In the four weeks the election BOT was live, we received over 30.000 questions from our subscribers, six times more than our previous RAG project for the “review of the year” and a big reward for our efforts. 70% of these questions made use of our suggestions, showing the impact of a low threshold of entry. The questions varied in explored topics, most common were of course questions about specific parties’ positions. Users frequently asked about parties, candidates and voting recommendations, the environment and climate change, as well as EU institutions, democracy and the “Rechtsstaat”.

We noted a special interest in a smaller party before the election. This prompted our editors to publish additional stories about them, which proved popular too.

Conclusion

In our second RAG powered Q&A project, we benefited from the maturation of the ecosystem: Better and faster language models, more ready-made components, better libraries to glue everything together. We have learned to focus on improving the user experience and be more systematic in improvements and iterations. The narrower scope of the project allowed us to better match user expectations, reducing the number of questions that could not be answered dramatically. Still, while LLMs are incredibly good at generating plausible answers given the right input, providing this input still proves a challenge.

For the evaluation of the entire system, particularly the answers, finding metrics that agree with feedback by people also proved tricky.

We plan to build more QA-Assistants for other topics and want to explore similar use cases that provide a better, more interactive user experience.

--

--