De ChatGPT al producto, impulsando la inteligencia artificial generativa (GenAI) con la generación mejorada por recuperación (RAG)

Published in

Data & AI Accenture Argentina

31 min readAug 12, 2024

El lanzamiento de ChatGPT por parte de OpenAI presenta un desafío significativo para las empresas. Estas enfrentan presiones crecientes para adoptar la inteligencia artificial generativa con el objetivo de aumentar la productividad y ofrecer soluciones más personalizadas hacia los usuarios. Aunque esta tecnología ofrece numerosas ventajas, uno de los problemas principales radica en la forma en que se presenta la información. Es posible que la información generada por ChatGPT provenga de fuentes desactualizadas o no verificadas, lo que puede llevar a errores. Esto trae una situación problemática, la cual sucede cuando el modelo presenta información incorrecta, nos referimos a ello como una “alucinación”.

En este artículo abordaremos una solución prometedora para este problema: la generación mejorada por recuperación. Esta estrategia permite mejorar las respuestas generadas por los modelos de inteligencia artificial eliminando las alucinaciones al proporcionar información específica y actualizada (y de manera controlada) en la que basar las respuestas del modelo. Exploraremos cómo esta técnica puede ser implementada eficazmente para garantizar que las empresas puedan aprovechar al máximo las capacidades de la inteligencia artificial generativa sin comprometer la calidad y precisión de la información. En las secciones siguientes hacemos un repaso de los conceptos base que tenemos que considerar a la hora de implementar esta estrategia.

La inteligencia artificial generativa

La inteligencia artificial generativa es una subrama de la inteligencia artificial que se enfoca en crear sistemas capaces de generar contenido nuevo y original a partir de datos existentes. Esto incluye la generación de texto, imágenes, música, videos y otros tipos de contenido. Utiliza modelos avanzados de aprendizaje automático, especialmente redes neuronales profundas, para aprender patrones y estructuras en los datos de entrenamiento y luego producir resultados creativos y originales.

Modelos de lenguaje

Los modelos de lenguaje son una aplicación específica y fundamental de la inteligencia artificial generativa que estan diseñados para entender y generar lenguaje humano. Estos están entrenados en grandes cantidades de datos con el objetivo de aprender patrones gramaticales, semánticos y contextuales del lenguaje. Una vez entrenado, el modelo es capaz de generar texto coherente y relevante en función de las palabras o contextos dados.

Ingenieria de prompt

La ingeniería de prompt (o prompt engineering, en inglés) nos ayuda a entender las capacidades y limitaciones de los modelos de lenguaje. En este artículo exploramos los casos de uso más comunes que podemos encontrar para los modelos de lenguaje y los parámetros a considerar a la hora de diseñar nuestras prompts. También exploramos las distintas secciones que componen la consulta y las mejores prácticas para empezar nuestras primeras interacciones.

Casos de uso

Entre los casos de uso más comunes, podemos encontrar:

Traducción de texto automatizado: El modelo traduce de manera automática un texto para generarlo en otro idioma.
Resumen automatizado: El modelo puede resumir textos largos al capturar las ideas clave y la información relevante.
Generación de código: Algunos modelos de lenguaje están entrenados para generar código en diversos lenguajes de programación.
Análisis de sentimiento: Estos modelos tienen la capacidad de ayudarnos a clasificar emociones en textos.
Asistente virtual: El modelo se programa para que interactúe y responda preguntas con lenguaje natural.

Parámetros de los modelos de lenguaje

Entre los parámetros que podemos encontrar en los modelos de lenguaje, los más recurrentes son:

temperature: Incrementa los pesos que se les da a los posibles tokens a la hora de ser elegidos por el modelo.
top_p: Computa la probabilidad acumulada de la distribución para que, una vez alcanzado su límite, el modelo deje de generar posibles tokens.
max_output_tokens: Administra el número total de tokens a generar por el modelo en su respuesta. Algunos modelos consideran en su límite no sólo los generados en la respuesta, sino también los enviados dentro de la prompt.

Cada token representa una palabra, fragmento de palabra o símbolo específico dentro del texto. Comprender este concepto es importante para manipular y diseñar las entradas de texto de manera efectiva ya que los modelos de lenguaje operan a nivel de tokens en lugar de a nivel de palabras completas. Lo que significa que la selección adecuada y la manipulación de los tokens pueden afectar significativamente el rendimiento y la salida del modelo.

Elementos que conforman el prompt

Un prompt es un texto en lenguaje natural que nos permite solicitar al modelo de lenguaje que tarea específica debe realizar. En general, aunque no son estrictamente necesarios dentro del prompt (y su existencia depende de la tarea a realizar), este consta de:

Instrucciones: Instrucción o tarea específica que queremos que el modelo realice.
Contexto: Información externa o adicional que ayude a orientar al modelo hacia mejores respuestas.
Indicador de Respuesta: El formato que queremos que tenga la respuesta.
Información de Entrada: La entrada o pregunta para la que estamos interesados en encontrar una respuesta.

Es recomendable incluir separadores como “###” para diferenciar cada una de las secciones dentro de la consulta.

Consejos generales

En esta sección repasamos algunos consejos generales (y técnicas) para empezar a interactuar con los modelos de lenguaje. Utilizando como ejemplo el modelo de text-bison de Google diseñado para tareas de instrucción de un solo turno como clasificación, extracción, resumen y generación. Dentro de la función definimos los valores estándar para cada parámetro.

# importamos la funcion para crear el modelo de lenguaje
from vertexai.preview.language_models import TextGenerationModel

# definimos la funcion con la que vamos a interactuar con el modelo
def generate_response(# peso de los tokens a elegir por el modelo
                      temperature = .2,

                      # tokens a generar por el modelo en la respuesta
                      max_output_tokens = 1024,

                      # probabilidad acumulada de los tokens
                      # considerados por el modelo
                      top_p = .8,

                      # cantidad maxima de tokens a considerar
                      # por el modelo para elegir como siguiente
                      top_k = 40,

                      # modelo que genera la respuesta a la prompt
                      model = "text-bison@001",

                      # prompt enviada para responder
                      prompt = "Respond by saying that no prompt was sent."
                     ):
    
    parameters = {
        "temperature": temperature,
        "max_output_tokens": max_output_tokens,
        "top_p": top_p,
        "top_k": top_k
        }

    llm = TextGenerationModel.from_pretrained(model)
    
    response = llm.predict(
        prompt,
        **parameters,
    )
    
    print(f"Response from Model {model}:\n{response.text}")

Por definición, le pedimos al modelo que responda indicando que no recibió una prompt cuando esta no se envía en la función.

#corremos la funcion sin ninguna prompt
generate_response()

Response from Model text-bison@001:

I'm sorry, but I didn't receive a prompt. Can you please try sending it again?

Empezá simple

La ingeniería de prompts es un proceso iterativo que requiere de mucha experimentación para llegar a los mejores resultados, por lo que es recomendable ir añadiendo elementos y contexto a medida que se apunta a mejorar la respuesta. Cuando queremos trabajar en una tarea compleja siempre es recomendable intentar separarla en subtareas más simples que le quiten complejidad al diseño de la prompt desde un principio.

Supongamos que nos interesa saber cual es la capital de Argentina, podemos preguntarle al modelo para que responda directamente con la información general con la cual fue entrenado.

prompt = "What is the capital of Argentina?"

generate_response(prompt=prompt)

Response from Model text-bison@001:

Buenos Aires is the capital of Argentina. It is located on the Río de la Plata, on the southeastern coast of the country. Buenos Aires is the largest city in Argentina and the second largest city in South America. The city is home to over 13 million people. Buenos Aires is a major economic and cultural center in Argentina. The city is home to many museums, theaters, and art galleries. Buenos Aires is also a popular tourist destination.

Prestá atención a tus instrucciones

A la hora de diseñar una prompt, hay que ser claros y específicos en la instrucción y tarea que queremos realizar, entre más descriptivos y detallados seamos, mejor. Los modelos de lenguaje están entrenados para abordar tareas simples de manera efectiva con comandos claros que no dan lugar a la ambigüedad.

ZERO-SHOT PROMPTING: Al estar entrenados en una gran cantidad de datos, los modelos de lenguaje tienen la capacidad de completar tareas sin la necesidad de proveerlos de ejemplos ni extensas guías sobre las cuales sostenerse. Uno de estos ejemplos es su capacidad inherente de clasificar las sentencias bajo el “sentimiento” que transmiten (análisis de sentimiento).

Volviendo con el ejemplo anterior, podríamos agregar una simple instrucción para que solo responda con el nombre de la ciudad.

prompt = """Respond only with the name of the city.
Q: What is the capital of Argentina?
A:"""

generate_response(prompt=prompt)

Response from Model text-bison@001:

Buenos Aires

No existen tokens o palabras clave que lleven a mejores resultados que un buen formato y prompt descriptiva. Incluso se recomienda insertar ejemplos dentro de la prompt para indicarle al modelo el formato que esperamos tenga la respuesta.

FEW-SHOT PROMPTING: Aunque los modelos de lenguaje demuestran grandes capacidades para completar tareas sencillas sin el uso de técnicas de prompting, al realizar tareas más complejas estos pueden necesitar de algunos ejemplos para lograr entender y generalizar la tarea que se les pide. Es importante que en los ejemplos se detalle el formato de la respuesta que queremos que nos entregue el modelo.

prompt = """Q: What is the capital of Uruguay?
A: Montevideo
Q: What is the capital of Brazil?
A: Brasília
Q: What is the capital of Paraguay?
A: Asunción
Q: What is the capital of Bolivia?
A: La Paz
Q: What is the capital of Argentina?
A:"""

generate_response(prompt=prompt)

Response from Model text-bison@001:

Buenos Aires

Implementación del modelo de lenguaje

Ahora que sabemos como podemos empezar a interactuar con los modelos de lenguaje, podemos empezar a probar la aplicación de estos para la automatización y personalización de los productos, recordando que:

Es importante asegurarse que el modelo responda solamente las preguntas para las que fue diseñado.
Las respuestas del modelo deben ser consistentes y confiables (mitigando el problema de la “alucinación”), para esto pueden estar basadas en información verificada externa.
A la hora de implementar un modelo de lenguaje para automatizar un proceso es importante definir una evaluación sistemática de los cambios realizados a la prompt para asegurarnos que una nueva funcionalidad no afecte el desarrollo de otras.
De la misma manera en la que se recomienda dividir tareas complejas en tareas más simples, en la medida de lo posible, es recomendable dividir funcionalidades en consultas distintas predefinidas.

Desarrollo de un producto con IA

A modo de ejemplo, desarrollamos un agente personalizado capaz de recomendar música a los usuarios. La musica que se recomendará por el agente proviene del dataset de kaggle 30.000 Spotify Songs que contiene los siguentes atributos:

‘track_name’ es el nombre de la canción.
‘track_artist’ es el nombre del artista de la canción.
‘track_popularity’ es la popularidad de la canción valorada de 0 a 100 en orden ascendente.
‘track_album_name’ es el nombre del album de la canción.
‘track_album_release_date’ es la fecha de publicación del album.
‘playlist_genre’ es el género de la playlist donde la canción existe. Los valores disponibles son ‘pop’, ‘edm’, ‘latin’, ‘r&b’, ‘rap’ y ‘rock’.
‘playlist_subgenre’ es el subgénero de la playlist donde la canción existe. Los valores disponibles son ‘dance pop’, ‘post-teen pop’, ‘electropop’, ‘indie poptimism’, ‘hip hop’, ‘southern hip hop’, ‘gangster rap’, ‘trap’, ‘album rock’, ‘classic rock’, ‘permanent wave’, ‘hard rock’, ‘tropical’, ‘latin pop’, ‘reggaeton’, ‘latin hip hop’, ‘urban contemporary’, ‘hip pop’, ‘new jack swing’, ‘neo soul’, ‘electro house’, ‘big room’, ‘pop edm’ y ‘progressive electro house’.
‘danceability’ describe qué tan adecuada es una pista para bailar basándose en una combinación de elementos musicales que incluyen el tempo, la estabilidad del ritmo, la fuerza del ritmo y la regularidad general. Un valor de 0 es el menos bailable y 1 es el más bailable.
‘loudness’ es la sonoridad general de una pista en decibelios (dB). Los valores de sonoridad se promedian en toda la pista y son útiles para comparar el volumen relativo de las pistas. El volumen es la cualidad de un sonido que es el principal correlato psicológico de la fuerza física (amplitud). Los valores típicos oscilan entre -60 y 0 db.
‘mode’ indica la modalidad (mayor o menor) de una pista, el tipo de escala de la que se deriva su contenido melódico. El mayor está representado por 1 y el menor es 0.
‘speechiness’ detecta la presencia de palabras habladas en una pista. Cuanto más exclusivamente hablada sea la grabación (por ejemplo, un programa de entrevistas, un audiolibro, poesía), más cercano a 1,0 será el valor del atributo. Los valores superiores a 0,66 describen pistas que probablemente estén compuestas exclusivamente de palabras habladas. Los valores entre 0,33 y 0,66 describen pistas que pueden contener música y voz, ya sea en secciones o en capas, incluidos casos como la música rap. Los valores inferiores a 0,33 probablemente representen música y otras pistas que no sean de voz.
‘acousticness’ es una medida de confianza de 0 a 1 sobre si la pista es acústica. 1.0 representa una alta confianza en que la pista es acústica.
‘instrumentalness’ predice si una pista no contiene voces. Los sonidos “Ooh” y “aah” se tratan como instrumentales en este contexto. Las pistas de rap o de palabra hablada son claramente “vocales”. Cuanto más cerca esté el valor de instrumentalidad de 1,0, mayor será la probabilidad de que la pista no contenga contenido vocal. Los valores superiores a 0,5 pretenden representar pistas instrumentales, pero la confianza es mayor a medida que el valor se acerca a 1,0.
‘liveness’ detecta la presencia de una audiencia en la grabación. Los valores más altos representan una mayor probabilidad de que la pista se haya interpretado en vivo. Un valor superior a 0,8 proporciona una gran probabilidad de que la pista sea en vivo.
‘valence’ es una medida de 0 a 1 que describe la positividad musical transmitida por una pista. Las pistas con valencia alta suenan más positivas (por ejemplo, felices, alegres, eufóricas), mientras que las pistas con valencia baja suenan más negativas (por ejemplo, tristes, deprimidas, enojadas).
‘tempo’ es el tempo general estimado de una pista en latidos por minuto (BPM). En terminología musical, el tempo es la velocidad o el ritmo de una pieza determinada y se deriva directamente de la duración promedio del tiempo.
‘duration_ms’ es la duración de la canción en milisegundos.

# importamos librerias
import pandas as pd
import numpy as np

df = pd.read_csv("spotify_songs.csv") # leemos el archivo

# filtramos los atributos relevantes para el agente
data = df[['track_name', 'track_artist', 'track_popularity', 'track_album_name', 'track_album_release_date',
           'playlist_genre', 'playlist_subgenre', 'danceability', 'loudness', 'mode', 'speechiness',
           'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']].reset_index(drop=True)

# eliminamos duplicados existentes por la presencia de multiple ids o playlists para la misma canción
dataset = data.drop_duplicates(["track_name", "track_artist", "track_album_name"], keep="first").reset_index(drop=True)

El agente seleccionara las canciones que coincidan con los requisitos del cliente y las ordenará en base a su popularidad dentro de la plataforma. Se utilizarán dos metodologías para proveer al modelo con información relevante que mejore sus respuestas.

En primer lugar, utilizaremos los embeddings de la pregunta para proveer de ejemplos similares de pregunta-respuesta. Estos ejemplos contienen tanto preguntas referidas a artistas y su musica como “Quiero escuchar musica similar a la [canción]”, “Qué musica del album [album] puedes recomendarme?”, “Quiero escuchar musica de [género]” y preguntas referidas a temas no relacionados con esto que generamos con la función definida a continuación.

# generamos una lista con preguntas no relacionadas con la música

prompt = """Please generate a comma separated list with questions that are not related to music and artists.

The questions should be between single quotes.

comma_separated_list = ['What is the capital of Argentina?', 'What is the meaning of life?', """

generate_response(prompt=prompt)

Los embeddings son representaciones vectoriales de palabras, frases o textos completos, se utilizan para capturar el significado semántico y las relaciones existentes entre las palabras. Cada palabra o texto es representado como un vector de números reales, donde la posición y la dirección del vector reflejan sus características semánticas.

# importamos la funcion para crear el modelo de embedding
from vertexai.language_models import TextEmbeddingModel

def text_embedding(text) -> list:
    # modelo que genera los embeddings
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    # guardamos la lista de embeddings
    embeddings = model.get_embeddings(text)
    # iteramos y generamos el vector
    for embedding in embeddings:
        vector = embedding.values
    return vector

# generamos una lista vacia para guardar los vectores de embeddings
comma_separated_list_embed = []

# iteramos la lista de preguntas definida
for i in comma_separated_list:
    # guardamos los embeddings correspondientes a la pregunta
    comma_separated_list_embed.append(text_embedding([i]))

Además de generar las preguntas y sus embeddings, generamos 2 respuestas predefinidas que serán ingresadas dentro de la prompt del modelo para utilizar como ejemplos que mejoren su respuesta:

En la primera respuesta definimos las consultas de SQL asociadas a las preguntas relacionadas con la musica, mientras que aquellas no relacionadas con la musica contendran una respuesta generica que le indique al usuario que el modelo no puede responder información no relacionada con esto.
La segunda respuesta predefinida contiene la respuesta en el formato predeterminado para que el modelo utilice como guía a la hora de generar la respuesta a la consulta del usuario. Estas las predefinimos con las 3 canciones mejor rankeadas para la condición especificada.

La segunda aumentación de la información consiste de la tabla obtenida con la query de la cual el modelo obtendrá las canciones a recomendar. Utilizamos la libreria pandasql para procesar las consultas generadas directamente sobre la notebook.

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Definimos la función que toma la pregunta del usuario y genera la respuesta del modelo, definiendo cada una de las prompts a utilizar y su información. Tambien definimos la nueva función que llama al modelo de lenguaje para que retorne directamente la respuesta.

# importamos la funcion para calcular la similitud coseno de las preguntas
from sklearn.metrics.pairwise import cosine_similarity

# re-definimos la funcion con la que vamos a interactuar con el modelo para que retorne la respuesta
def generate_response(temperature = .2, max_output_tokens = 1024, top_p = .8,
                      top_k = 40, model = "text-bison@001",
                      prompt = "Respond by saying that no prompt was sent."):
    
    parameters = {
        "temperature": temperature,
        "max_output_tokens": max_output_tokens,
        "top_p": top_p,
        "top_k": top_k
        }

    llm = TextGenerationModel.from_pretrained(model)
    
    response = llm.predict(
        prompt,
        **parameters,
    )
    
    return response.text

# definimos la primera prompt a enviar
n_embedd_prompt1 = """The following is an agent that generates a SQL query to music-related questions from a customer.

Below are the available fields in the table 'dataset' and a description of their meaning:
    - 'track_name' is the song name.
    - 'track_artist' is the song artist.
    - 'track_popularity' is the song popularity rated from 0 to 100 where higher is better.
    - 'track_album_name' is the song album name.
    - 'track_album_release_date' is the date when album was released.
    - 'playlist_genre' is the genre of the playlist where the song exists.
    - 'playlist_subgenre' is the subgenre of the playlist where the song exists.
    - 'danceability' describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    - 'loudness' is the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    - 'mode' indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    - 'speechiness' detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    - 'acousticness' is a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    - 'instrumentalness' predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    - 'liveness' detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    - 'valence' is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    - 'tempo' is the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    - 'duration_ms' is the duration of song in milliseconds.

Take your time to analyze the question at the end and determine if it is related to music or no:
    - If the question at the end is not related to music: The agent must respond only with the sentence 'I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?'.
    - If the question at the end is related to music: The agent must respond only with the SQL query, ending the response at the ';'.

Use the examples below to understand the expected output and how to respond:
"""

# definimos la segunda prompt a enviar, separada en dos partes para incluir la tabla antes de los ejemplos
n_table_prompt2 = """The following is an agent that recommends music to a customer based on the information contained in the DataFrame in JSON string format of 'records'.

Below are the available fields in the DataFrame and a description of their meaning:
    - 'track_name' is the song name.
    - 'track_artist' is the song artist.
    - 'track_popularity' is the song popularity rated from 0 to 100 where higher is better.
    - 'track_album_name' is the song album name.
    - 'track_album_release_date' is the date when album was released.
    - 'playlist_genre' is the genre of the playlist where the song exists.
    - 'playlist_subgenre' is the subgenre of the playlist where the song exists.
    - 'danceability' describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    - 'loudness' is the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    - 'mode' indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    - 'speechiness' detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    - 'acousticness' is a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    - 'instrumentalness' predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    - 'liveness' detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    - 'valence' is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    - 'tempo' is the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    - 'duration_ms' is the duration of song in milliseconds.

DataFrame in JSON string format of 'records' (orient):
"""

n_embedd_prompt2 = """Use the examples below to understand the expected output and how to respond:
"""

def music_recommender(question):
    # calculamos los embeddings de la pregunta
    question_embedding = text_embedding([question])
    
    # calculamos la similitud coseno de la pregunta con las preguntas predefeinidas
    questions["cosine"] = questions["Embedding"].apply(lambda x: cosine_similarity(np.asarray(np.asarray(x).reshape(1, -1), np.array(question_embedding).reshape(1,-1)))
    
    # generamos un archivo temporal con las 3 preguntas mas similares
    top_3 = questions.sort_values(by="cosine", ascending=False).head(3).reset_index(drop=True)
    
    # generamos los ejemplos de pregunta y respuesta para la prompt 1
    few_shot = ""
    for i in top_3.index:
        few_shot = few_shot + f"Q: {top_3.loc[i, 'Question']}\nA: {top_3.loc[i, 'Answer']}\n"
    
    # generamos la prompt 1 completa
    prompt_1 = n_embedd_prompt1 + few_shot + "\nBelow is the question from the user 'Q' that the agent should answer in 'A'\nQ: " + question + "\nA: "
    
    # generamos la primer respuesta del modelo
    response_1 = generate_response(prompt = prompt_1)
    
    # verificamos que la primera respuesta sea una query
    if response_1.startswith("SELECT"):
        
        # verificamos que la tabla tenga informacion disponible
        if len(pysqldf(response_1)) == 0:
            
            # si no hay data devolvemos una respuesta indicando que no se encontro información
            prompt_2 = "N/A"
            response_2 = "The data used to recommend music does not contain any information related to that album or artist."
        
        else:
            # generamos los ejemplos de pregunta y respuesta para la prompt 2
            few_shot = ""
            for i in top_3.index:
                few_shot = few_shot + f"Q: {top_3.loc[i, 'Question']}\nA: {top_3.loc[i, 'Answer2']}\n"
            
            # generamos la prompt 2 completa con la informacion encontrada
            prompt_2 = n_table_prompt2 + pysqldf(response_1).to_json(orient="records") + "\n\n" + n_embedd_prompt2 + few_shot + "Below is the question from the user 'Q' that the agent should answer in 'A'\nQ: " + question + "\nA: "
            
            # generamos la segunda respuesta del modelo
            response_2 = generate_response(prompt = prompt_2)

    # si la primer respuesta no empieza con "SELECT", la devolvemos como respuesta final
    else:
        prompt_2 = "N/A"
        response_2 = response_1
        
    # retornamos las prompts y respuestas generadas
    return prompt_1, response_1, prompt_2, response_2

Evaluamos los resultados de los distintos escenarios; en el primero el usuario hace una pregunta no relacionada con la música, en el segundo el modelo tiene información para responder y en el tercero el modelo no encuentra información para responder a la pregunta. En el primer caso, vemos primero la respuesta del modelo sin la ingeniería de prompt ni la aumentación de la información.

print(generate_response(prompt = "What is the smallest country in the world?"))

Vatican City is the smallest country in the world. It is located in Rome, Italy, and has an area of just 0.44 square kilometers. The population of Vatican City is about 800 people. The country is governed by the Pope, who is also the head of the Catholic Church. Vatican City is a popular tourist destination, and is home to many important works of art and architecture.

Pregunta 1: “What is the smallest country in the world?”

prompt_1, response_1, prompt_2, response_2 = music_recommender("What is the smallest country in the world?")

print(prompt_1)

The following is an agent that generates a SQL query to music-related questions from a customer.

Below are the available fields in the table 'dataset' and a description of their meaning:
    - 'track_name' is the song name.
    - 'track_artist' is the song artist.
    - 'track_popularity' is the song popularity rated from 0 to 100 where higher is better.
    - 'track_album_name' is the song album name.
    - 'track_album_release_date' is the date when album was released.
    - 'playlist_genre' is the genre of the playlist where the song exists.
    - 'playlist_subgenre' is the subgenre of the playlist where the song exists.
    - 'danceability' describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    - 'loudness' is the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    - 'mode' indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    - 'speechiness' detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    - 'acousticness' is a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    - 'instrumentalness' predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    - 'liveness' detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    - 'valence' is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    - 'tempo' is the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    - 'duration_ms' is the duration of song in milliseconds.

Take your time to analyze the question at the end and determine if it is related to music or no:
    - If the question at the end is not related to music: The agent must respond only with the sentence 'I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?'.
    - If the question at the end is related to music: The agent must respond only with the SQL query, ending the response at the ';'.

Use the examples below to understand the expected output and how to respond:
Q: What is the largest country in the world?
A: I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?
Q: What is the largest desert in the world?
A: I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?
Q: What is the most populous country in the world?
A: I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?

Below is the question from the user 'Q' that the agent should answer in 'A'
Q: What is the smallest country in the world?
A:

print(response_1)

I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?

Como el modelo identificó que la pregunta del usuario no corresponde con una relacionada con la música, el valor asignado a “prompt_2" es N/A, y el de “response_2” igual que el de “response_1”.

Pregunta 2: “I want to hear some rock music, what would be your suggestion?”

prompt_1, response_1, prompt_2, response_2 = music_recommender("I want to hear some rock music, what would be your suggestion?")

print(prompt_1)

The following is an agent that generates a SQL query to music-related questions from a customer.

Below are the available fields in the table 'dataset' and a description of their meaning:
    - 'track_name' is the song name.
    - 'track_artist' is the song artist.
    - 'track_popularity' is the song popularity rated from 0 to 100 where higher is better.
    - 'track_album_name' is the song album name.
    - 'track_album_release_date' is the date when album was released.
    - 'playlist_genre' is the genre of the playlist where the song exists.
    - 'playlist_subgenre' is the subgenre of the playlist where the song exists.
    - 'danceability' describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    - 'loudness' is the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    - 'mode' indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    - 'speechiness' detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    - 'acousticness' is a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    - 'instrumentalness' predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    - 'liveness' detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    - 'valence' is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    - 'tempo' is the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    - 'duration_ms' is the duration of song in milliseconds.

Take your time to analyze the question at the end and determine if it is related to music or no:
    - If the question at the end is not related to music: The agent must respond only with the sentence 'I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?'.
    - If the question at the end is related to music: The agent must respond only with the SQL query, ending the response at the ';'.

Use the examples below to understand the expected output and how to respond:
Q: What rock music can you recommend me?
A: SELECT track_name, track_artist FROM dataset WHERE playlist_genre = 'rock' ORDER BY track_popularity DESC LIMIT 3;
Q: Given that i am listening to rock, what can you recommend me?
A: SELECT track_name, track_artist FROM dataset WHERE playlist_genre = 'rock' ORDER BY track_popularity DESC LIMIT 3;
Q: What hard rock music can you recommend me?
A: SELECT track_name, track_artist FROM dataset WHERE playlist_subgenre = 'hard rock' ORDER BY track_popularity DESC LIMIT 3;

Below is the question from the user 'Q' that the agent should answer in 'A'
Q: I want to hear some rock music, what would be your suggestion?
A:

print(response_1)

SELECT track_name, track_artist FROM dataset WHERE playlist_genre = 'rock' ORDER BY track_popularity DESC LIMIT 3;

print(prompt_2)

The following is an agent that recommends music to a customer based on the information contained in the DataFrame in JSON string format of 'records'.

Below are the available fields in the DataFrame and a description of their meaning:
    - 'track_name' is the song name.
    - 'track_artist' is the song artist.
    - 'track_popularity' is the song popularity rated from 0 to 100 where higher is better.
    - 'track_album_name' is the song album name.
    - 'track_album_release_date' is the date when album was released.
    - 'playlist_genre' is the genre of the playlist where the song exists.
    - 'playlist_subgenre' is the subgenre of the playlist where the song exists.
    - 'danceability' describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    - 'loudness' is the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    - 'mode' indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    - 'speechiness' detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    - 'acousticness' is a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    - 'instrumentalness' predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    - 'liveness' detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    - 'valence' is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    - 'tempo' is the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    - 'duration_ms' is the duration of song in milliseconds.

DataFrame in JSON string format of 'records' (orient):
[{"track_name":"Bohemian Rhapsody - 2011 Mix","track_artist":"Queen"},{"track_name":"Every Breath You Take","track_artist":"The Police"},{"track_name":"In the End","track_artist":"Linkin Park"}]

Use the examples below to understand the expected output and how to respond:
Q: What rock music can you recommend me?
A: Sure, I can recommend to you the following tracks:
    1. Bohemian Rhapsody - 2011 Mix from Queen.
    2. Every Breath You Take from The Police.
    3. In the End from Linkin Park.

Q: Given that i am listening to rock, what can you recommend me?
A: Sure, I can recommend to you the following tracks:
    1. Bohemian Rhapsody - 2011 Mix from Queen.
    2. Every Breath You Take from The Police.
    3. In the End from Linkin Park.

Q: What hard rock music can you recommend me?
A: Sure, I can recommend to you the following tracks:
    1. Toxicity from System Of A Down.
    2. Unsainted from Slipknot.
    3. Duality from Slipknot.

Below is the question from the user 'Q' that the agent should answer in 'A'
Q: I want to hear some rock music, what would be your suggestion?
A:

print(response_2)

Sure, I can recommend to you the following tracks:
    1. Bohemian Rhapsody - 2011 Mix from Queen.
    2. Every Breath You Take from The Police.
    3. In the End from Linkin Park.

Pregunta 2: “I want to listen to Bajofondo music, what do you recommend?”

prompt_1, response_1, prompt_2, response_2 = music_recommender("I want to listen to Bajofondo music, what do you recommend?")

print(prompt_1)

The following is an agent that generates a SQL query to music-related questions from a customer.

Below are the available fields in the table 'dataset' and a description of their meaning:
    - 'track_name' is the song name.
    - 'track_artist' is the song artist.
    - 'track_popularity' is the song popularity rated from 0 to 100 where higher is better.
    - 'track_album_name' is the song album name.
    - 'track_album_release_date' is the date when album was released.
    - 'playlist_genre' is the genre of the playlist where the song exists.
    - 'playlist_subgenre' is the subgenre of the playlist where the song exists.
    - 'danceability' describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    - 'loudness' is the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    - 'mode' indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    - 'speechiness' detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    - 'acousticness' is a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    - 'instrumentalness' predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    - 'liveness' detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    - 'valence' is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    - 'tempo' is the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    - 'duration_ms' is the duration of song in milliseconds.

Take your time to analyze the question at the end and determine if it is related to music or no:
    - If the question at the end is not related to music: The agent must respond only with the sentence 'I am not able to respond since I am an agent that only have information related to music in Spotify. Do you want a music recommendation?'.
    - If the question at the end is related to music: The agent must respond only with the SQL query, ending the response at the ';'.

Use the examples below to understand the expected output and how to respond:
Q: What music from Dadá Boladão can you recommend me?
A: SELECT track_name, track_artist FROM dataset WHERE track_artist = 'Dadá Boladão' ORDER BY track_popularity DESC LIMIT 3;
Q: What music from Bailo can you recommend me?
A: SELECT track_name, track_artist FROM dataset WHERE track_artist = 'Bailo' ORDER BY track_popularity DESC LIMIT 3;
Q: What music from Bougenvilla can you recommend me?
A: SELECT track_name, track_artist FROM dataset WHERE track_artist = 'Bougenvilla' ORDER BY track_popularity DESC LIMIT 3;

Below is the question from the user 'Q' that the agent should answer in 'A'
Q: I want to listen to Bajofondo music, what do you recommend?
A:

print(response_1)

SELECT track_name, track_artist FROM dataset WHERE track_artist = 'Bajofondo' ORDER BY track_popularity DESC LIMIT 3;

print(prompt_2)

N/A

print(response_2)

The data used to recommend music does not contain any information related to that album or artist.

Como no encontramos registros de canciones de la agrupación musical “Bajofondo”, el valor de prompt_2 se asigna como “N/A”. La respuesta final del modelo indica que no contiene información del artista.

Conclusiones

Como vimos, la implementación de modelos de lenguaje con generación mejorada por recuperación ofrece a las empresas una solución poderosa y versátil para mejorar sus sistemas de procesamiento y de atención al cliente mediante el aumento de información. Este enfoque no solo optimiza el uso de datos controlados y específicos de la empresa, sino que también minimiza la propagación de información incorrecta, aumentando la confiabilidad del sistema y garantizando que las interacciones sean más satisfactorias para los usuarios.

En el ejemplo presentado, se demostró cómo un chat puede utilizar los embeddings de las preguntas para recuperar ejemplos relevantes, facilitando que el modelo generativo produzca respuestas informadas y coherentes. Este método no solo mejora la precisión y relevancia de las respuestas, sino que también reduce el tiempo de entrenamiento y los recursos necesarios para mantener el sistema actualizado.

La adopción de este enfoque permite a las empresas aprovechar lo mejor de ambos mundos: la capacidad generativa de los modelos de lenguaje y la precisión de la recuperación de información específica. A medida que las tecnologías de inteligencia artificial continúan avanzando, las empresas que implementen estos enfoques híbridos estarán mejor posicionadas para ofrecer experiencias de usuario superiores, impulsando tanto la satisfacción del cliente como la eficiencia operativa.