Transforming E-commerce with a Conversational Search Assistant Powered by Large Language Models

Published in

Adevinta Tech Blog

11 min readJan 3, 2024

Discover how we harness the power of large language models to craft a conversational search assistant that empowers users to search for products using their natural language

An introduction to the Adevinta marketplace project

In the ever-evolving landscape of e-commerce, achieving success is intricately linked to our capacity to continually meet the changing demands of our customers. As a pivotal aspect of our ongoing e-commerce platform transformation at Adevinta, multiple dedicated teams are collaborating to enhance user experiences. Our collective mission is:

To make the platform more intuitive, navigable and accessible for customers both within and beyond our national borders. Security and trust are paramount, as we strive to facilitate seamless and secure transactions.

In the spirit of our ongoing evolution and commitment to user-friendliness, one of our most exciting projects involves the introduction of a conversational search assistant. This tool empowers users to discover products easily by communicating in their natural language, revolutionising the way they find exactly what they need. In this blog, we explore the methodologies employed in the creation of a conversational search assistant.

Evolution of the conversational search assistant

Foundations of conversational search assistant: Building the skeleton

We built our conversational search assistant vision around a three-component structure. The first component, known as the “conversation model,” takes centre stage as it participates in user interactions, adeptly responding to inquiries while retaining a strong relationship with Adevinta’s marketplaces and product offerings. The second component, the “extraction model,” is a critical link in the chain, responsible for extracting user preferences and any recommendations that emerge during the conversation model’s interactions. This step adds a personalised touch to the buying experience. Finally, the third component searches Adevinta’s marketplaces for products that match the user’s preferences and recommendations. We made this possible by leveraging the capabilities of our current search API, which streamlined the process of locating and presenting these desired products. This trio of components collaboratively shapes our conversational search assistant, enhancing the user experience by providing meaningful, personalised product recommendations.

Conversational search assistant: The Envisaged product

Strategic prompt design for conversational search

Initially, we embarked on the journey of prompt engineering, iterating through various setups and prompts to enhance both the conversational model and extraction model. Our primary goal was to develop a conversation model specialised in engaging users on the subject of cars (the chosen category for our prototype) and effectively addressing their queries. By leveraging insights from current and past conversations, we designed the extraction model to identify the latest user preferences and recommendations, subsequently channelling them to the search API. In this blog, we will delve into the version that proved effective for us after extensive experimentation.

Visualising conversational dynamics: Process flow diagram with prompts

The conversation model (ConvLLM: gpt-3.5-turbo) receives the user’s query and any existing history to initiate the conversational process. During this step, we proactively set the context through prompting to guide the language model, determining the most suitable response strategy, and formulating relevant questions to maintain the flow of the conversation. Subsequently, the last two messages from the conversation history are forwarded to the extraction model (ExtrLLM: gpt-3.5-turbo). In this stage, a specific prompt is employed to extract distinct features such as user preferences (e.g. fuel type, price, colour) and recommendations (cars suggested by the model). These extracted elements are treated as the current preferences and recommendations. Following a post-processing step to consolidate any historical preferences (limited to preferences only), we generate search queries based on the consolidated preferences and recommendations to search our inventory.

Furthermore, we enhance our product by integrating an additional step, the guidance model (GdLLM: gpt-3.5-turbo), to seamlessly navigate users through the ongoing conversation. The core principle revolves around predicting the most advantageous next action for users, closely analysing both their posed questions and the responses provided by the conversation model. This feature not only aims to inspire users but also strives to streamline their search process, guiding them towards making well-informed decisions in their car purchasing journey.

Fine-tuning for latency optimisation

Traditionally, models are refined by incorporating extensive external knowledge to enhance their ability to respond to inquiries. In contrast, our unique strategy involved fine-tuning the model with a specific emphasis on generating structured output (JSON) using metadata.

Optimising model precision: Fine-tuning for structured output with metadata

The decision to fine-tune was driven by concerns about prompt latency within the extraction model. This was necessitated by the considerable size of the prompt used in the extraction model, as we needed to explicitly define a comprehensive list of attributes to extract, along with a corresponding list of potential values for each attribute.

In formulating our sample creation approach, we implemented a two-fold strategy. Initially, we employed a random generation process to create sample JSONs, integrating metadata such as brand, model and price from the marketplace inventory. Importantly, during the generation of JSONs, we intentionally excluded certain attributes randomly to mitigate potential biases, ensuring that not all attributes consistently appear in the input conversation. This precautionary step aimed to prevent the model from generating information hallucinations, particularly when specific attributes are absent from the conversation. After generating these randomised samples, we employed a Language Model (StatementLLM: gpt-3.5-turbo) to compose a conversation, seamlessly integrating preferences derived from the randomly generated JSON.

Fine-tuning: Crafting samples from synthetic data

In our second approach, our objective was to recreate authentic user car searches. To accomplish this, we utilised our search filter data and chose a subset of filters, converting them into human-readable text that emulates how users might inquire using a Language Model (QuestionLLM: gpt-3.5-turbo). Subsequently, these questions were input into both our established conversational model and the extraction model (with prompt) to generate a conversation and a JSON object. To ensure accuracy and reliability, we performed manual validation of these JSON outputs against corresponding questions and conversations.

With this approach, we exemplify authentic scenarios wherein conversational model output serves as the foundation for extracting user preferences and recommendations.

Fine-tuning: Crafting samples from search filters

Through both synthetic and search filter sample creation, our focus remained on positive cases initially. After generating positive samples, we intentionally introduced negative scenarios unrelated to cars. In these specific situations, the expected output consistently yielded an empty JSON. In the ‘more information required’ samples, although the examples related to cars, the conversation model abstained from suggesting cars. Instead, it sought additional questions to refine the search criteria. In such instances, our goal was to comprehensively capture the user’s preferences, even when the conversation model refrained from explicitly recommending cars. This approach ensured the effective collection of available user preferences without overlooking any crucial details.

After acquiring synthetic samples, those based on search filters, negative, and other cases, we systematically organised them in the standardised format advocated by OpenAI for fine-tuning. This meticulous formatting, in harmony with OpenAI’s guidelines, lays the groundwork for subsequent fine-tuning steps. Following the procedures outlined in the OpenAI Guides, you can effortlessly fine-tune the model once the data is prepared.

Fine-tuning: Exemplifying input and output for each sample type

Towards the end of our journey, we found ourselves with various fine-tuned models, each built on a different set of examples. To determine the most optimal fine-tuned model, we created a test dataset and established metrics for evaluation. Our metrics included comparing the predicted JSON outputs from the fine-tuned models to the expected JSON outputs in the test dataset, considering attributes and attribute values. The model with fewer errors, especially in terms of attribute values and their correctness, was chosen as the winner. This systematic approach enabled us to assess and select models that demonstrated outstanding performance, seamlessly aligning with our intended outcomes.

Fine-tuning: Performance-driven model selection

Employing the fine-tuned model significantly reduced latency by almost 500%, resulting in harmonised conversations and search results that were not only more accurate and natural but also highly responsive.

Visualising conversational dynamics: Process flow diagram with prompts & fine-tuned model

Despite the thorough fine-tuning with an extensive set of samples, there were instances where the optimised model generated predictions for JSON keys or values that deviated from our expectations. To tackle this, we implemented validation checks tailored to specific keys, ensuring alignment with our predefined expectations in both range and content. This proactive measure aimed to enhance the accuracy and reliability of the model’s outputs.

Marketplace inventory integration

The current car recommendations are specifically crafted for the French market in line with the provided instructions. However, this setup doesn’t guarantee that the recommended cars are available in our marketplace. In response to this challenge, we opted to seamlessly integrate knowledge graphs, enriched with marketplace data, into our product. These knowledge graphs play a dual role — they can assist in conveying user-preferred cars available in the marketplace to the conversation model or conduct a post-filtering of recommended cars by the conversation model. This strategic step is aimed at refining the alignment between user preferences and the marketplace offering. We will elaborate on the rationale behind choosing both pre-filter and post-filter approaches.

In our initial strategy, we employed knowledge graphs as a prefix to recommend cars based on users’ comprehensive historical preferences. This involved capturing both the latest user preferences and the accumulated historical preferences. To gather the users’ current preferences, we introduced a new model, the knowledge model (KgLLM: gpt-3.5-turbo), while the historical preferences were already stored in memory and consolidated using a dedicated function.

Knowledge graph integration: Enhancing model precision with pre-filtering

Once we had compiled all the users’ historical preferences, we dynamically generated a SPARQL query to search the knowledge graph. The results from the knowledge graph were then transmitted to the conversation model along with the user’s query. We considered various scenarios, such as when the latest preferences were empty, indicating that the user’s latest query was unrelated to cars. In such cases, we refrained from making a call to the knowledge graph and instead passed the user query directly to the conversation model. Specific instructions were embedded within the conversation model to discourage questions not related to cars.

There are two challenges with this approach.

The inherent challenges of randomly selecting cars from the knowledge graph for recommendations necessitate a strategic approach to improvement. One approach is a structured ranking system, aiming to elevate the quality of suggestions and achieve a more precise alignment with user preferences.
We miss out on nuanced user context, like whether they’re looking for a family car, a sports car or a vehicle with fashionable aesthetics. The current preference extraction mechanism falls short in capturing and considering these specific contextual details when querying the knowledge graph.

Given the challenges discussed, it’s clear our system increasingly relies on knowledge graphs (KGs). Effectively addressing these issues requires enhancing the KG’s natural language understanding capabilities. Fortunately, this capability is already embedded in the Large Language Models (LLMs) we employ.

As part of our strategic shift, we’ve chosen to primarily use KGs for post-filtering instead of relying on them for a prefix search for cars. This adjustment aims to balance leveraging KGs for valuable data and utilising the natural language understanding capabilities inherent in LLMs, ultimately providing a more seamless and user-centric experience.

Knowledge graph integration: Enhancing model precision with post-filtering

In our refined approach, we continue to rely on accessing all the user’s latest preferences and recommendations from the conversation model, conveniently stored in memory for SPARQL queries. The primary goal remains to check the availability of recommended cars on Adevinta’s marketplace, considering the user’s latest preferences. To achieve this, we use knowledge graph results as a prompt for our new model, the knowledge model. This, combined with the conversation model’s output, enables us to adjust the final text based on knowledge graph results, ensuring recommendations align more closely with inventory availability.

There are three potential scenarios:

If the conversation model recommends five cars but only three are found on our marketplace (per the knowledge graph), we refine the final output to include only available cars.
When no cars meet the user’s criteria on our marketplace, we tailor the final output to communicate the absence of matching cars and suggest adjusting user preferences.
If unrelated car-related questions are asked, we skip the knowledge graph query and directly transfer the conversation model’s output to the knowledge model without modifications.

Marketplace inventory integration: Our approach

The functional prototype has not been integrated into the product, but the experiment is completed, and we have a working setup in our sandbox environment. Currently, our focus is on developing an efficient knowledge graph API for seamless querying.

In conclusion — a product as we visioned

Our journey began with a simple approach, gradually refining the product based on insights from user research reports. With a collaborative team effort, we’ve developed a prototype that empowers users to engage in conversations using natural language or simple filters, facilitating the discovery of relevant listings. The introduction of guidance prompts and a less tokenised conversation structure aims to enhance user engagement. Now we are diligently working on frontend integration, seamlessly incorporating the product into our marketplace e-commerce sites.

What comes next?

There are several ideas to elevate our current product setup. At present, our recommendations are limited to car names. Our next strategic move involves enhancing each recommendation with key features such as fuel efficiency, durability and security, utilising knowledge graphs. This enrichment empowers users to compare lists and identify the best option tailored to their preferences. This approach not only makes the recommendations more explainable but also provides better advice for decision-making.

Our existing approach guides users to refine their preferences if no matching recommended cars are found in the knowledge graph. We are considering an enhancement by introducing similar car recommendations (using Graph Neural Networks) from the knowledge graph when initially suggested cars are unavailable in the marketplace inventory. This allows us to showcase relevant listings based on availability, ensuring a more personalised experience for users. Additionally, we aim to include complementary recommendations alongside similar ones, encouraging and inspiring customers to explore more listings that align with their preferences. The intention is to provide a comprehensive and inspiring shopping experience for Adevinta’s marketplace users.

We’re also exploring the implementation of automatic topic/category classification. This would enable effortless transitions between various categories, such as cars, bikes, electronics etc., during conversations.

Finally, we are testing the use of open-source Language Models (LLMs) beyond OpenAI, facing the substantial challenge of evaluating the product’s performance with these alternatives. To ensure a thorough assessment, we are considering the creation of test data benchmarks for both conversational and extraction models, laying the groundwork for continuous improvements.

Do you have any comments on our methodology? Or tips on further improvements? Please get in touch.

A heartfelt acknowledgement goes to Cumhur Kinaci, Anton Lashin, Andrii Myhal, Dmitry Ershov, and Bongani Shongwe for their collaborative efforts and invaluable contributions, playing a pivotal role in the success of this product journey.