Beyond the Hype: Real-World Lessons and Insights from Working with Large Language Models

Juan Eduardo Coba Puerto
Mercado Libre Tech
Published in
9 min readJun 11, 2024

Large Language Models (LLMs) have experienced incredible growth throughout the last year, arguably surpassing the leaps made since OpenAI first introduced their groundbreaking GPT models. Remember GPT-3? That was a game-changer, opening our eyes to a world of new possibilities with the ability to tackle complex natural language processing (NLP) tasks with little to no prior learning.

With the launch of OpenAI’s GPT-3.5-turbo API, the floodgates of innovation truly opened. We’ve witnessed some astonishing use cases, some of which seemed almost unimaginable not so long ago, such as generative coding assistants or in-context learning for zero-shot task solving. At Mercado Libre, learning continuously, embracing risks, and innovating with cutting-edge technologies are at the core of our DNA.

In this article, we’re excited to take you on a short journey through the various use cases we’ve explored at Meli. We’ll discuss what worked, the challenges we faced, and, most importantly, the valuable lessons we’ve learned along the way. This isn’t a how-to guide for LLMs; instead, think of it as a behind-the-scenes look at the practical aspects of developing LLM-based applications and the insights we gained from our experience.

First use case — Retrieval Augmented Generation (RAG)

One common application of Large Language Models (LLMs) is Retrieval-Augmented Generation (RAG). This involves providing a Question-Answering system that generates personalized answers by retrieving and combining information from relevant documents. It’s important to have realistic expectations about LLMs, especially when it comes to answering company-specific queries. Despite being trained on vast datasets, these models typically lack access to proprietary knowledge or internal company data. Therefore, expecting an LLM to possess intrinsic knowledge about a specific company’s internal matters is not feasible.

This is where the RAG approach proves invaluable. It effectively navigates through an index of knowledge based on a user’s query. By retrieving relevant context or information and then generating a response, RAG enables the LLM to provide accurate and pertinent answers to user inquiries, bridging the gap between general knowledge and company-specific information.

Our initial attempt into building a RAG system utilized Llama Index, an open-source tool that handles everything from constructing and storing knowledge indexes to providing a primary pipeline for context retrieval and answer generation.

Our primary goal was to create a tool for developers that would be capable of answering any question related to our technical stack. This included widely used tools like BigQuery, Tableau, Looker, as well as proprietary tools such as Fury Data Applications, Data Flow, and others. We envisioned a centralized information repository where users could pose questions and instantly receive answers, complete with links to source materials for further exploration. Curious to see the end product? Take a look at this gif!

Source: Post by Adrian Quillis at LinkedIn

The prototype we developed using Llama Index was initially a resounding success, impressing everyone who saw it. It functioned as an easy-to-use search engine for documentation. However, the initial excitement was short-lived. As we encouraged more users to try the system, we noticed gaps in the documentation. When the system came across questions without available information, the model often provided answers based on its general knowledge, which sometimes led to inaccuracies or hallucinated responses.

This was our first major learning, albeit an obvious one: the model cannot reliably answer questions beyond its contextual knowledge base. We realized that for the model to provide accurate answers, the necessary information had to be within its knowledge domain. The real challenge was figuring out how to ensure this.

To address this, we started testing the model’s responses to specific queries that we needed to answer accurately, as well as queries we preferred not to respond to. This process revealed shortcomings in our documentation — certain actions or tools that users inquired about were not covered. So, what was the solution? Enhancing our documentation. In some cases, even when the information was present, it lacked direct relevance to the user’s problem. Parts of the documentation described processes without explaining why a user might need to perform them, complicating the retrieval process for the RAG system.

Second use case — Documentation Generation

Our goal with this tool was to go beyond addressing standard tool-related queries; we aimed to enable it to efficiently answer questions about our data sources. Specifically, when users sought to identify where particular information resided, we intended for the tool to direct them to the right table and relevant fields. However, a significant challenge emerged when we first integrated the table descriptions into the system. The responses we received fell short of our expectations. The root cause? Our existing table documentation was cursory and lacked depth. Only a select few tables had comprehensive documentation detailing their contents, use cases, and inter-table relationships.

This shortfall meant that the model often had to rely on mere table names to infer their contents. It was trying to fill in the gaps with educated guesses; but without substantial information to work with; the accuracy of these guesses was limited.

The goal was twofold: to enhance the table documentation making it more valuable for our developers and business analysts, and to ensure its compatibility with the RAG system we had developed. The challenge was daunting — we had thousands of tables needing documentation. For instance, out of 4,000 productive tables, half lacked adequate documentation.. These tables were typically understood and used within the confines of the creators or shared informally across units.

Given the magnitude of the task, we embarked on a mission to leverage LLMs (Large Language Models) for enriching documentation. Tackling this task manually was neither feasible nor cost-effective. Our objective was clear: to devise a method for creating accurate, useful, and coherent documentation for each data product, in the most efficient and budget-friendly way possible.

In order to do so, we leveraged the existing documentation of tables, including information about all their fields and technical documentation. Using a generic prompt such as “You’re an expert documenter, please create documentation for table {TABLE_NAME} based on the following elements,” we were able to generate documentation that was well-received by 90% of our stakeholders. Table owners agreed with our suggested documentation and made only minor adjustments..

Even though it was a great first start, we found that the 10% who didn’t like our documentation had three main concerns: 1) it lacked a clear structure for easy understanding, 2) it didn’t include any technical and internal acronyms, and 3) in some cases, the existing documentation was already satisfactory.

This taught us an important lesson. While prompts were crucial, we realized the importance of iterating over them and conducting quality assurance (QA) on the generated outputs. Does the generated text behave as intended? Is it using all the information you want it to use? Are we lacking in information for the LLM to create an accurate response? Through this analysis, we identified the need to create more guided and flexible prompts, allowing for different levels of information availability, and following a predefined format on what a good description should look like.

An example of the system prompt. It dynamically changes based on available inputs.

Therefore, it is important to have a clear objective and also a clear schema for desired output.

Third Use Case — Natural Language Inputs

Have you ever attempted to extract specific information from raw text using an LLM? I’m not referring to things like names of people or organizations, dates, or brands, which can be easily extracted using Named Entity Recognition. I’m talking about figuring out what date “next week” refers to or identifying a specific chain of numbers as a tax identification number and not as something else.

By leveraging the “reasoning” abilities of LLMs, we can interpret raw text and understand the underlying meaning behind those words or numbers — this is without even discussing the use of multimodal models like GPT-4V.

For instance, consider the following question: How many units would you end up with if you bought this product?

“Etiqueta Carta 6287 25 Fls 12,7 X 44,45 Mm Ct.c/2000 Pimaco”

For some people, it might be obvious that the answer is one unit (if its not, then take a look at the image below!). But what about simpler models? There could be a lot of numbers that might be mistaken for the units of the product. The units might not be standardized, and the label “ct.c/” gives away the amount, but it could also be “pcs” referring to pieces, “u”, “unts”, “amt”, and so on. Capturing these numbers consistently is complicated.

An image of a listing of a single product with 25 pages full of etiquetes, for a total of 2000.

Another use case is booking services. At Meli, we have a platform called “Data Doctors”, which allows developers and business people to book experts in a data-related fields such as Data Bases, Dashboards, Machine Learning, etc, so that they can aid in solving a problem. The booking system needed improvement, and we wanted to implement a way to simplify it by allowing users to search for experts and availability using natural language. Our goal was to streamline the booking experience and enhance the user’s ability to find the right expert at the desired time.

For example, when a user queried “I want to consult an expert in Tableau who is available next thursday”, we needed to understand first what topics the user was interested in and then, what agenda was being requested. Yet, “next thursday” cannot be directly entered into a calendar; we need a date! How can we manage to make the LLM understand and format the date to use in a query without generating a long sentence explaining the date and its context?

A user asking for the availability for Data Doctors with expertise in Shipping and SQL

To address these challenges and ensure consistency and a predefined format, we leveraged the concept of Function Calling, which was first introduced in GPT-3.5 and is also available in many other LLMs, such as LLaMA 2. Function Calling provides a great option when you need to use an LLM to extract specific information which is already contained in some text. By leveraging this functionality, you can interpret raw text and understand the underlying meaning behind words or numbers, which are retrieved in a more structured manner.

Final Thoughts

Finally, we would like to emphasize that “raw” LLMs are not a definitive solution for every problem. It is important to optimize processes, carefully prepare the data, and assess whether simpler and less expensive models can meet the requirements. If they cannot, then simplify the task for the LLM as much as possible by being clear and handling the intensive data processing outside the model.

Advancing with Language Models

In conclusion, implementing Large Language Models (LLMs) at Mercado Libre has opened up new possibilities for innovative use cases. LLMs have enabled us to address user needs more effectively, supply relevant information, and improve our models by uncovering valuable information that conventional models have struggled with. This advancement has enhanced the performance of other applications and elevated user experience.

By carefully evaluating input and output, providing better context, and using higher-cost models when necessary, companies can optimize the performance of LLMs. The use cases discussed, including Retrieval Augmented Generation (RAG), documentation generation, and the interpretation of natural language inputs, exemplify the versatility and potential of LLMs in various domains.

However, it is crucial to acknowledge the limitations of LLMs’ contextual knowledge and actively work on enhancing their capabilities through iterative improvement. Companies should strive to refine documentation, iterate on prompts, and conduct quality assurance to ensure accurate and valuable outputs. In addition, it’s important for users to explore the models and the diverse methods available for utilizing them such as function calling, rather than solely relying on standard calls. By doing so, LLMs can become powerful tools for answering complex queries, generating comprehensive documentation, and extracting meaningful insights from raw text.

Embrace the evolving world of Large Language Models to simplify and enhance your organization’s operations. LLMs offer fresh solutions to reduce workloads and manage unstructured data more effectively, a challenge for traditional models. They’re the cutting-edge tools the world is exploring every day, constantly evolving to provide new applications for both internal and external users. By integrating LLMs, you’re not just solving problems more efficiently; you’re joining a global movement towards a smarter, more innovative use of language technology.

Stay tuned for more lessons, tips and tricks on integrating LLMs to applications!

--

--

Juan Eduardo Coba Puerto
Mercado Libre Tech

I’m Passionate about Machine Learning, Statistics and Time Series Forecasting. I work @ MercadoLibre doing forecasts!