How to integrate an LLM-Based Agent into Your App: Understand User Intents and Deliver Value Faster

Published in

Lightricks Tech Blog

11 min readJun 25, 2024

Introduction

AI Agents are gaining momentum with the release of tools like OpenAI APIs, offering features that can be seamlessly integrated into existing products. These tools enable developers to incorporate and fine-tune machine learning models, ensuring they operate within the app’s context. Natural language processing (NLP) allows applications to understand and respond to human language in a contextually relevant and personalized manner, effectively analyzing user intentions and preferences.

Throughout this blog post, I will provide guiding questions to help you reflect on these concepts and consider how to apply them to your own projects.

What benefit do we gain from using AI Agents?

An AI agent is a software entity that autonomously performs tasks by making decisions based on its environment and objectives. It uses machine learning models to analyze data, learn from interactions, and execute actions to achieve specific goals without human intervention. Agents are widely used in various applications, from virtual assistants and recommendation systems to autonomous vehicles and smart home technologies.

In this context, I will discuss the process of integrating a goal-based LLM Agent and enhancing fault tolerance by expanding it into an LLM multi-agent system. This project introduces a personal assistant within Facetune, Lightricks photo and video editing app, which serves millions of users, designed to simplify the app’s advanced editing features into user-friendly requests. By leveraging natural language processing, users can perform complete edits without the need to manually navigate through multiple tools, significantly enhancing the user experience by making advanced editing capabilities applicable without needing to be familiar with every part of the app.

For your project, think about the following questions:

What are the potential applications of AI agents in your own projects?
Where could autonomous task performance and decision-making add value?
How could a personal assistant improve your app and deliver value to users faster?

Understanding Facetune’s Unique Approach

Facetune is a popular application developed by Lightricks. It’s best known for its powerful tools that allow users to enhance and retouch photos, particularly portraits and selfies, directly from their smartphones. The app offers a range of features that make professional-level photo retouching accessible to everyday users.

The main focus to keep in mind when talking about Facetune in this post, is that the tools in the app were designed to be able to make subtle changes to the users’ image and keep their identity. Therefore we wanted to preserve the current tools instead of using text-to-image engines that often introduce very aggressive changes. This aspect made our job a lot more interesting — we had to make the LLM “know” our app.

What specific context and knowledge about your app does the agent need to maintain the current user experience and avoid disrupting core functionalities?

Our use case: from natural language request to an edited image

The integration begins with a user expressing their editing desires through natural language. A simple example of a query can be: “make my skin smoother”, “remove my eyebags”, “blur my background”, and so on. This input initiates a process where a refined call to ChatGPT, acting as the Facetune personal assistant, interprets and translates these requests into Facetune’s in-house specific editing actions and then applies auto-edit based on these actions.

Define a use case for your project:

What natural language requests might your users make?
How will these requests be translated into specific actions within your application?

Project Architecture

The ChatGPT-based code is deployed within a dedicated Flask app as a single endpoint serving as the Assistant Agent. In case of fault, the task is delegated to the Search Agent to increase fault tolerance. Instead of directly using the OpenAI API, the Facetune app communicates with this Flask app. This Flask app preprocesses user prompts plus app support tools and capabilities into refined instructions for the LLM, ensuring outputs are consistent, predictable, and in a controlled format. The method we used to achieve this consistency will be elaborated in the following paragraphs.

We deployed the Flask app on Google Cloud Run, taking advantage of its fully managed serverless environment. With millions of users, it was necessary to ensure that our application could handle fluctuating request volumes efficiently. Cloud Run allows us to automatically scale the web application in response to incoming traffic, providing high availability. This setup simplifies the deployment process, allowing us to focus on developing and refining the agent without worrying about infrastructure management.

We chose this approach because this project was built before direct API calls to OpenAI Assistants API were available. These direct API calls are now available and simplify integration. Explore them to see if they meet your needs before considering a managed environment like Google Cloud Run.

Development Process

Before even touching Facetune’s codebase, work began on the exploratory part of LLMs.
The main task was to understand how to approach the integration.
Several key tasks were identified to tailor the LLM to meet our needs:

Understanding app features and context in order to provide effective responses.
Consistent output format is crucial for integration.
Handling cases where the agent might fall short.

As you read through these tasks, take a moment to think about how these challenges might apply to your own project. Consider questions like:

What specific features does your app have that the LLM needs to understand?
What output format will ensure seamless integration for you?
Which flows might be difficult to support by the agent at the POC stage of your project?

I will now examine each of these tasks in detail, showing how we approached them and the solutions we ultimately implemented.

1. Understanding app features and context in order to provide effective responses

Mitigating LLM’s Visual “Blindspot”

An initial concern was that the LLM wouldn’t actually “see” the input image, which could potentially lead to overly dramatic effects on images that were already quite good. For example, if a user uploads an image that is already very bright and asks to make it even brighter, the result could be an “overexposed” image that loses detail and quality. This wouldn’t be considered a successful edit. There was a worry that the responses from the agent might be too intense, potentially compromising the reputation of Facetune’s well-regarded subtle edits.

To address this perceived problem, a component called CLIP-interrogator was initially added. The clip interrogator creates a description of the image, which is then included in the user input sent to the LLM. This way, we hoped that the LLM would be able to “understand” the details and contents of the image better, ensuring more contextually appropriate edits.

However, this approach didn’t improve the results as expected. In fact, the LLM provided satisfying results without it, and the clip interrogator only inflated the prompt size. By focusing on simpler solutions, like adding a slider to control the intensity of the effects applied to the image, such issues were effectively managed. This experience highlighted the importance of reevaluating tools and approaches, discarding those that aren’t suitable, and solving problems creatively and efficiently.

Embedding App-Specific Knowledge into the LLM

We created a mapping between desired tools in the app and their description. Each user has a different set of supported features based on the monetization method they have chosen. To handle this, we dynamically extracted the specific features supported on the user’s device and included them in the instructions prompt, ensuring that we didn’t send features the user couldn’t use or wasn’t subscribed to. Additionally, we provided ChatGPT with positive and negative examples of usage and determined that if something is unclear or involves body parts and not the face, it should return an empty JSON.

Handling Context Window Limitations

As you can imagine, such descriptions and instructions can pile up to quite a lot of text. A significant technical hurdle for this project was the hard limit on the number of tokens the GPT model could process. This required innovative solutions to ensure the model could efficiently understand and process complex editing instructions without exceeding these limits. This limitation supported our decision to remove the clip interrogator, helping to reduce the prompt size.

Semantic Similarity Approach for Token Efficiency

An initial strategy to solve the long-prompt problem involved embedding descriptions of Facetune’s editing tools, along with user queries, into a latent space. By storing these embeddings in a vector database, we could use cosine similarity to search for semantic matches between a user’s query and the top-k features. Semantic similarity measures how much two pieces of text share meaning, enabling the identification of related content even if different words are used. Cosine similarity is a mathematical technique used to determine semantic similarity by calculating the cosine of the angle between two vectors in a multi-dimensional space, indicating how closely they align.

This approach aimed to reduce reliance on extensive token usage by identifying precise editing actions through semantic similarity analysis. However, despite its promise, it struggled to achieve the required precision for Facetune-specific commands due to the complexity of accurately matching user queries with appropriate editing features.

As a result, the project shifted towards optimizing how user requests were parsed and condensed before being fed into the AI model, focusing on extracting key information and restructuring queries to maximize informational value per token. These adjustments allowed for more effective communication within token limitations.

2. Ensuring Consistent Output: How to Tame the Model to Your Needs

Our initial implementation used the LangChain framework.
Although it was accurate, we noticed that it also enlarged the prompt size. So we decided to try and make a few-shot instruction prompt supported by LangChain that provided structured classes like BaseMessage, HumanMessage, and SystemMessage to handle different types of messages in communication with the LLM. This helped us define clear and structured prompts.

Few-Shot Prompt Structure
Few-shot learning was a natural step to address the prompt size issue, by crafting a single, detailed prompt with a clear instruction and multiple examples. The prompt structure for this project was crafted and evaluated with some sample data we created, until it reached accurate and contextually relevant responses. Here’s how it was designed:

In our few-shot prompt, we included several key components:

Role-based Instruction (Persona-Pattern): The prompt begins with a clear role to explicitly set expectations. For example, “As a professional photographer and portraits editor” helps GPT understand its expertise and the type of tasks it will perform.
Task-focused Guidance: We provided a clear and concise task for the model to focus on, such as “you will help me edit my selfie photos”. This minimizes ambiguity and ensures the model knows the primary objective.
Output Formatting: We specified the required format for the output, ensuring consistency and usability. For example, “You should output the parameter name and its value in a valid JSON format.”. These days there are GPT models that are specialized for JSON formats so this might be redundant in future projects.
Resource and Tools Listing: A comprehensive list of tools with descriptions and valid ranges was included. This equips the model with all necessary information to perform the task accurately.
Example-driven Learning: By providing multiple examples, we clarified the expected input-output relationship. This transition from zero-shot to few-shot learning is what makes our prompt effective, as the examples provide clear guidance for the model.
Handling Ambiguous or Irrelevant Inputs: We included clear instructions on handling invalid or irrelevant inputs to ensure robustness. For example, “If the input is unclear or there is no matching editing tool for the given input, then return an empty JSON.”

Implementing few-shot learning required ensuring the model could accurately interpret and respond to user queries within the constraints of a single prompt. This necessitated detailed instructions to cover a broad range of possible inputs.

Each project may require its own cycles of trial and error to achieve optimal results. Ask yourself:

What role-based instructions and task-focused guidance will you provide in your prompt to ensure the LLM understands its purpose and tasks?
How will you handle ambiguous inputs, such as vague requests or irrelevant statements, to ensure the robustness of your LLM responses?

Tip: It is recommended to write an evaluation script to continuously test and refine the prompt’s clarity and conciseness, emphasizing key information and structured examples.

3. Handling agent limitations with a multi-agent fallback mechanism

When creating a proof of concept for an agent like this, we had to consider that there may be cases in which the agent will fail or will not be able to provide a solution for all scenarios. For example, one challenge we faced was with app features that depend on user-generated inputs, such as custom-drawn masks (also known as binary masks or segmentations in image processing). Brushes are tools that users use to interact with the screen and create these masks which define which parts of the image should be edited. These inputs couldn’t be processed by the LLM due to its visual blind spot. This is one example of why not all queries can be simply translated into actionable edits.

Therefore, a fallback mechanism was introduced, transitioning our single-agent system into a multi-agent system that includes both an assistant agent and a search agent. The search agent helps users find features and sub-features within the app, similar to the search functionality in an iPhone. This agent is also powered by advanced language models and guides users towards manually using Facetune’s tools for those detailed edits the assistant agent couldn’t execute directly. This improves the chances users will achieve their editing goals by offering a smooth transition between AI-assisted and manual editing within the app.

Pause and reflect:

How will you implement a fallback mechanism to handle scenarios where the AI agent may fail?
What roles can additional agents play in supporting the primary agent to ensure users achieve their goals?

Future Directions

Looking ahead, the integration of LLM within Facetune opens numerous directions for further enhancing the app’s editing capabilities and user experience. With the new models of GPT, it’s now possible to attach images to requests, and the maximum token limit has been increased.
Lastly, we’d like to minimize the fallback scenarios and support all app tools including those using brushes.

Conclusion

The journey to integrate an LLM-based agent into Facetune involved overcoming numerous challenges. Initially, the LangChain framework was considered for creating a fully autonomous agent. However, the prompt size expansion it introduced led instead to the development of a custom solution, supported partially with some LangChain features. Deploying the model queries as a Flask app and having the mobile app interact with it to get parsable responses established a system that perceives inputs, processes them according to predefined instructions, and generates appropriate outputs.

To sum up, this project demonstrates that an LLM-based agent system can be a valuable tool for addressing various challenges in application development.
I encourage you to consider your own project needs, define your use cases, and leverage AI agents powered by LLMs to deliver value faster. Additionally, think about how you can balance using existing frameworks with custom solutions to address your specific needs. By understanding the challenges and solutions implemented in this project, I hope you can apply these insights to ensure an efficient integration of AI technology.