AI Agents: How Klarna replaced the work of 700 Humans in 1 Month

Game-changing AI agents

Published in

Altar.io

11 min readJul 15, 2024

AI Agents promise incredible benefits when it comes to streamlining operations and reducing the need for human touch in increasingly complex tasks.

It is early days, and we see innovative companies already report tremendous cost reductions and efficiency gains from deploying this technology.

An example of this is Klarna.

The Fintech company, in collaboration with OpenAI, has implemented autonomous agents across their customer support operation.

Just one month after implementation, they reported a reduction in average resolution time from 11 to 2 minutes!

Their implementation was able to engage in 2.3M conversations with customers, handling two-thirds of all customer service interactions, and leaving only the most complex cases for humans to resolve. Their agents performed the equivalent work of 700 full-time employees. Insane results!

I will share Klarna’s report at the end, but, if you are a start-up founder, it should be very clear that staying competitive requires you to get familiar with these tools and leverage them in your business.

AI agents will change the structure of organisations, and according to Sam Altman, open up the possibility for one-person unicorns.

In this article, I will leave you with:

An overview of relevant concepts to understand AI agents: LLM, Chain, RAG, Agent, and Multi-Agent Systems (MAS).
A comprehensive list of leading tools to try out
A list of performance benchmarks showcasing current capabilities and limitations

Concept Overview: LLM-based Systems

An AI agent, simply put, is an LLM-based system capable of handling tasks that require multiple undefined steps.

An agent can reason about what would be required in order to solve the prompted task and come up with a plan. It then generates prompted calls to specific LLMs to resolve the different parts of the plan.

At the frontier of LLM-based systems, is the concept of Multi-Agent Systems (MAS), where multiple “expert autonomous agents” are capable of collaboration, dialogue, and access to tools for joint problem-solving.

In this section, I will go through the different types of LLM-based systems, with increasing levels of complexity. Compare this with the tasks you want to automate via an AI agent and you will have a clear picture of what you need to implement.

LLM

Large Language Models we use today started as what we call now Base LLMs, which were models trained to predict the next token (set of characters), based on a large dataset of text from the internet.

Base LLM models would do a good job at predicting the next set of characters, but would have a limited awareness of whether the output was true or false. It would often come up with wrong facts or “hallucinate” the response.

To improve model quality, a second generation of models called Instruction tuned LLMs emerged.

Instruction Tuned LLMs were built on top of Base LLMs and improved with Reinforcement Learning with Human Feedback (RLHF) to increase the quality of answers.

This has brought us to the race for model quality that is currently happening, with frequent releases of new model versions — ChatGPT, Claude, Llama, and more.

Model quality has come a long way, but it is still far from perfect. To emphasise this point, see the graph below provided by Anthropic on the accuracy of their most advanced models when solving “hard questions”:

LLMs can be prompted in many ways to improve output quality. However, it might be the case that you want your system to execute multiple tasks in a row, given only one prompt. This is where chains come in.

Chain

A Chain is a sequence of calls to an LLM, a tool, or a data preprocessing step.

Chains allow for the completion of more complex tasks than a simple LLM, given that they can receive the output of a previous task as input to the next task, triggering a sequence of events.

In a chain, the sequence of sub-tasks necessary to resolve a prompted meta-task is hardcoded.

Implementing a chain means:

Ability to tackle complex tasks by resorting to a sequence of sub-tasks
Higher control over the sub-tasks and output
Higher robustness, given a hard-coded strategy
Less adaptability, given a hard-coded strategy

If you’d like to learn how to implement a chain, check out this gentle introduction to chaining LLMs and utils via Langchain.

Retrieval Augmented Generation (RAG)

RAG refers to an LLM-based system where the LLM is provided access to a set of documents, allowing the model to find and fetch relevant information to inform the generation of answers.

Instruction-tuned LLMs are significantly better performing at generating responses than Base LLMs, but still, hallucinations happen. A good example of hallucination is when a model is asked to talk about a product that does not exist, from a company that exists:

In order to resolve such cases, RAG systems are one of the most popular approaches.

In comparison to using a simple LLM, a RAG system provides:

Reduced hallucination
More control over the output
Explainability
Access to proprietary/private information to inform responses

To implement a simple RAG — see this Stack Overflow blog post on RAG.

Autonomous AI Agents

AI Agents are capable of splitting the task given in a prompt into sub-tasks and triggering the execution of those sub-tasks.

Unlike chains, where the sub-tasks are hardcoded, agents can figure out which sub-tasks are needed to resolve the meta-task, coming up with a plan.

Agents provide a similar reliability of output to RAG systems for their ability to fetch information from a set of pre-selected documents. However, Agents can tackle more complex tasks.

Agents can:

Break a prompted task into sub-tasks
Figure out which tools and documents are needed to complete each sub-tasks
Plan the execution of sub-tasks
Access tools & documents as needed for each sub-task
Orchestrate the execution of sub-tasks

An example of a prompt that would not be solvable by a simple RAG or LLM, that AI Agents can tackle is the following:
“How has the trend in the average daily calorie intake among adults changed over the last decade in the United States, and what impact might this have on obesity rates?
Additionally, can you provide a graphical representation of the trend in obesity rates over this period?”
To resolve this prompt, an AI agent might:
Access Search API tool
Access health-related publications
Access public/private health databases
Access the “code interpreter” tool
Generate useful charts on obesity trends
If we wanted to resolve a similar problem consistently, say for every country in the world, a Chain would be a good idea. It would allow us to have control over the sequence of sub-tasks, and tools to use, and optimise the process for the best results.

Multi-Agent Systems (MAS)

Agents are fantastic. Still, like humans, they can’t become an expert in everything.

Much like hiring a new intern for your company, an Agent will be as good as the specificity of the tasks requested of him. And, teamwork leads to the best results. Here is where Multi-Agent Systems comes in.

Multi-Agent Systems (MAS) are teams of Autonomous AI Agents, which can independently specialise in handling specific tasks and can interact, collaborate and negotiate towards the resolution of high-level tasks.

Some of the benefits of MAS are increases in:

Performance: Mix of Experts approach
Robustness: You can add QA AI agents to the system
Scalability: Parallel processing, spin-up of AI agents as needed

Multi-agent systems can be fully autonomous — solving the request from an initial prompt without additional human input — or request human input between steps.

Making these systems “semi-automatic” at first is useful for fine-tuning the system. In some cases, human input is imperative. In others, once the system is consistently performing as required, full automation is possible. You can set the human feedback option as a parameter in most, if not all, of the frameworks.

Multi-Agent Frameworks & Tools

In this section, I will go over the latest frameworks and tools you can use to implement your AI Agents and Multi-Agent Systems effectively.

In order to understand how to use each of the frameworks effectively, hands-on experimentation is required. Making little sense to copy all the documentation here, I will simply list the most relevant tools in my view, pointing to the best places to dive deeper into each of the tools, with examples and use cases, where possible.

Here is a list of tools you can test to build AI agents and multi-agent systems:

AutoGen

AutoGen is a framework by Microsoft that facilitates the development of large language model (LLM) applications using multiple customisable and conversable agents. These agents can interact autonomously or with human feedback to solve complex tasks, integrating LLMs, tools, and human inputs seamlessly. Documentation.

LangGraph

LangGraph offers a graph-based approach to managing interactions between multiple AI agents, enabling complex reasoning and task-solving capabilities. It allows developers to visualise and optimise the interactions and dependencies between various agents. Learn more.

CrewAI

CrewAI is a platform for orchestrating autonomous AI agents in collaborative environments, designed to enhance productivity through multi-agent workflows. It supports various tasks such as web searching, data analysis, and delegation among agents to streamline complex task execution. Documentation.

Check out this interesting example of LangGraph + Crew AI for automatic email fetching and response draft generation:

MetaGPT

MetaGPT allows users to deploy GPT-based AI agents capable of performing a variety of tasks, from answering queries to generating content. It leverages the capabilities of GPT models to build responsive and interactive applications, facilitating natural language interactions. Learn more.

Camel-ai

Camel-ai specialises in creating adaptive AI agents that can learn and evolve. Its framework supports the development of agents that can autonomously adapt to new tasks and environments, enhancing the AI’s capability to handle dynamic scenarios. Learn more.

Langroid

Langroid is a framework for building modular and scalable AI systems using a collection of specialised agents. It supports the integration of different AI models and tools, providing a flexible infrastructure for developing complex AI applications. Learn more.

Status Quo

Performance benchmarks

There have been great efforts to create performance leaderboards and scores for LLM performance — see hugging face leaderboard, LLM Benchmarks explained.

However, evaluating a team of agents is no easy task.

Several benchmarks have been designed to evaluate LLM agents. Notable examples include ALFWorld, IGLU, Tachikuma, AgentBench, SocKET, AgentSims, ToolBench, WebShop, Mobile-Env, WebArena, GentBench, RocoBench, EmotionBench, PEB, ClemBench, and E2E.

For a deeper understanding read “AgentBench benchmark to evaluate LLM-as-Agent on real-world challenges and 8 different environments. Liu, et. all (Oct 2023)” here.

You can also read “\benchmarkname : A Benchmark for LLMs as Intelligent Agents” — Wu, et all. (Mar 2024) here.

Challenges

LLM agents are still in their early days, and there are several challenges and limitations to consider when developing them.

Prompt Robustness and Reliability: An LLM agent often requires several prompts to power different modules, such as memory and planning. Even minor changes to prompts can cause reliability issues. LLM agents use a comprehensive prompt framework, making them susceptible to robustness problems. Potential solutions include iterative prompt crafting, automatic prompt optimisation/tuning, or generating prompts using GPT. Additionally, hallucinations are a common problem, where agents generate inaccurate information due to conflicting external data.

Knowledge Boundary: Managing the knowledge scope of LLMs is challenging, which can impact the accuracy of simulations. An LLM’s internal knowledge can introduce biases or utilise information unknown to the user, affecting the agent’s behaviour in specific scenarios.

Role-playing Capability: LLM-based agents often need to adopt specific roles to effectively accomplish tasks within a given domain. For roles that the LLM does not inherently embody well, fine-tuning the LLM with data representing rare roles or psychological profiles can be beneficial.

Long-term Planning and Finite Context Length: Planning over an extended history is difficult and can lead to errors that the agent may struggle to correct. Additionally, LLMs have limitations in the length of context they can handle, which can restrict the agent’s ability to use short-term memory effectively.

Efficiency: LLM agents require numerous requests to be processed by the LLM, which can impact the efficiency of agent actions due to reliance on the LLM’s inference speed. Moreover, deploying multiple agents can be costly.

Human Alignment: Aligning agents with a wide range of human values is challenging, a problem also seen in standard LLMs. One potential solution is to realign the LLM by creating advanced prompting strategies.

Conclusion — Works in Practice, to be Proven in Theory

As shown by industry-leading use cases like Klarna’s customer service automation, AI agents are production-ready — i.e., they are already delivering large value to organisations — even if quantifying their capabilities is still a hard research challenge.

Experimentation and subsequent implementation of these tools into your organisation will be crucial so as not to fall behind.

Finding yourself in a position where a competitor is using AI agents to streamline a process you still pay people to perform manually will, eventually, kill your business.

In order to survive, upskilling your team with AI, automating what can be automated, and strategically moving human resources to higher value-generating tasks are demanded of each organisation today.

In this article, I have provided you with enough insight to stay ahead of the competition and thrive in the age of AI. Take your time to dive deeper into the tools and if you have any questions, feel free to reach out.

Thanks for reading.

Finally, as I mentioned at the beginning of the article, here’s the full report on Klarna’s experience using AI customer support agents.