Building an Product Search using Open Source LLMs: Langchain, Hugging Face and Chroma

6 min readAug 13, 2023

Introduction

In a world where technology is rapidly evolving, businesses are constantly looking for ways to stand out and offer innovative solutions. One such emerging solution, with the power to transform how we do business online, is the use of LLMs (Large Language Models) for text similarity in searches for similar products. But what exactly are these LLMs, and why are they so revolutionary?

What are LLMs?

LLMs are cutting-edge language models that have the capability to understand and generate human language remarkably accurately. Imagine having a virtual assistant that not only comprehends exactly what you’re searching for but also can suggest products or information that align perfectly with your needs, no matter how subtle or specific they may be. This is precisely what LLMs promise!

Why Use LLMs in Product Searches?

With consumers’ growing demand for personalized shopping experiences, businesses face the challenge of sifting through vast amounts of data to find the most relevant matches. This is where LLMs shine. They can analyze the text of a search query and, with their advanced understanding of language, pinpoint products or services that are genuinely relevant to the consumer.

But the impact of LLMs goes beyond simple searching. By adopting them, businesses are positioning themselves at the forefront of innovation. They’re saying: “We understand you. We value your experience with us.” And for consumers, this means less time sifting through irrelevant results and more time enjoying products and services that truly cater to their needs.

Code time

Install required python packages

Open the terminal and install these python packages

pip install langchain==0.0.262
pip install python-dotenv==1.0.0
pip install transformers==4.31.0
pip install huggingface-hub==0.16.4
pip install chromadb==0.4.5
pip install pandas

Create .env file

Create a file named .env with your hugging face api token, you can get your free token using this guide.

HUGGINGFACEHUB_API_TOKEN=<YOUR-HUGGINGFACEHUB-API-TOKEN>

Knowledge database

In order to simulate a database with different possible search options, we will be using Online Retail dataset, available publicily at UCI datasets, contains all transactions of a non-store online retailer.

Download the dataset here and extract the xlsx file to your folder. Then we'll convert this to a CSV file, so we can use the native Langchain's CSV Loader later.

import pandas as pd

# Read the online retail dataset
df = pd.read_excel("./Online Retail.xlsx")

# Normalize product description and drop duplicates
df_unique_products = df["Description"].apply(lambda x: str(x).lower()).drop_duplicates()

# Persist only the unique product descriptions
df_unique_products.to_csv("./online_retail_unique_products.csv", index = False)

We can check some product descriptions, as these examples:

turquoise christmas tree 
red star card holder
wicker wreath 
advent calendar gingham sack
feltcraft butterfly hearts

Rolling up sleeves

We will be passing for two main steps ahead, they are:

Prepare the vector database for search, using an Open Source Embeddings Extractor Model
Querying database

from langchain.embeddings.huggingface_hub import HuggingFaceHubEmbeddings
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores import Chroma
from langchain.schema.document import Document
from typing import List
from dotenv import load_dotenv
import os

# This will expose your Langchain api token as an environment variable
load_dotenv()

def read_csv(file_path: str, source_column: str = "Description") -> List[Document]:
    """Reads a CSV file and returns a list of Documents.

    Args:
        file_path (str): The path to the CSV file to read.
        source_column (str, optional): The name of the column in the CSV file that contains the text data. Defaults to "Description".

    Returns:
        List[Document]: A list of Documents, where each Document contains the text data from the corresponding row in the CSV file.

    Raises:
        FileNotFoundError: If the CSV file does not exist.
        IOError: If there is an error reading the CSV file.
    """

    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File does not exist: {file_path}")

    loader = CSVLoader(file_path=file_path, source_column=source_column)
    data = loader.load()

    return data

Embedding model

At this step, we'll be using a model with more than 11M donwloads month, named "all-mpnet-base-v2", an Open Source Transformer model from Hugging Face Hub. This means:

This is a Sentence Embedding Model, converting human readable text to equivalent numerical embeddings [1]
It is licensed under Apache 2.0 license, meaning that it can be used commercially [2]

def load_embeddings_model(model_name: str) -> HuggingFaceEmbeddings:

    """Loads a Hugging Face Transformer model and returns an Embeddings object.

    Args:
        model_name (str): The name of the Hugging Face Transformer model to load.

    Returns:
        HuggingFaceEmbeddings: An Embeddings object that can be used to encode text into embeddings.
    """

    embedding_function = HuggingFaceHubEmbeddings(
        repo_id=model_name
    ) 

    return embedding_function

Here we are using the HuggingFaceHubEmbeddings class, but you can download the model locally and use the HuggingFaceEmbeddings class instead, setting the CPU or GPU option, this last need to be set as 'cuda' option for GPU acceleration [3].

Vector Database

For development purposes, we'll be using Chroma as a local vector database. Vector databases are often used to store and query data that has been generated by large language models (LLMs)

A vector database is a type of database that is specifically designed to store and query vector data. Vector data is data that is represented as a vector, which is a list of numbers. [4]

def vectorize_documents(data : List[Document], embedding_function : HuggingFaceEmbeddings) -> Chroma:
    """Vectorizes a list of Documents using a Hugging Face Transformer model.

    Args:
        data (List[Document]): A list of Documents to vectorize.
        embedding_function (HuggingFaceEmbeddings): An Embeddings object that can be used to encode text into embeddings.

    Returns:
        Chroma: A Chroma object that contains the vectorized documents.
    """
    
    db = Chroma.from_documents(data, embedding_function)

    return db

Putting all together, simply

This function will init our local vector database properly. As example, this code could be put in a init script for an API of search of available products.

def init_llm():
    """Initializes the LLM by reading the CSV file, loading the embeddings model, and vectorizing the documents.

    Returns:
        Chroma: A Chroma object that contains the vectorized documents.
    """
    
    data = read_csv(file_path ='./online_retail_unique_products.csv', source_column = "Description")
    embedding_function = load_embeddings_model(model_name = './all-mpnet-base-v2')
    db = vectorize_documents(data, embedding_function)

    return db

Having theses functions defined, let's add a line of code to properly instantiate our vector database with all embeddings calculated.

db = init_llm()

Enjoying your search

This is the most joyful part of the code, just query a product by its description. Even if you introduce a few errors, you’ll observe that the search remains robust and forgiving.

# query it
query = "poppy"
resultados = db.similarity_search_with_score(query, k = 5)

# print results
for doc in resultados:
    print(doc)

As example, inserting a query "poppy" we can have this suggestions of similar itens descriptions available:

(Document(page_content='Description: poppy fields chopping board', metadata={'row': 3930, 'source': 'poppy fields chopping board'}), 0.8975588083267212)
(Document(page_content="Description: poppy's playhouse bedroom", metadata={'row': 10, 'source': "poppy's playhouse bedroom "}), 0.9310866594314575)
(Document(page_content="Description: poppy's playhouse livingroom", metadata={'row': 1240, 'source': "poppy's playhouse livingroom "}), 1.0001243352890015)
(Document(page_content="Description: poppy's playhouse bathroom", metadata={'row': 1241, 'source': "poppy's playhouse bathroom"}), 1.0488601922988892)
(Document(page_content="Description: poppy's playhouse kitchen", metadata={'row': 11, 'source': "poppy's playhouse kitchen"}), 1.052075743675232)

How could it be used, then?

Imagine a dynamic e-commerce platform where search results are not just good, but precisely tailored to the nuances of each user’s query. With the adaptability of LLMs, businesses can elevate their search algorithms to retrieve the most relevant product matches, thereby enhancing user experience and potentially boosting sales.

Beyond e-commerce, think of the countless repositories of information, like corporate wikis. The ability to swiftly extract pertinent data or information becomes invaluable. Through LLMs, searching for relevant wiki information could be more intuitive and efficient, driving faster decision-making and knowledge sharing.

Moreover, the basic function of sentence similarity is just the tip of the iceberg. The versatility of LLMs allows for a range of other capabilities to be incorporated. Features like autocomplete can streamline user interactions, while the renowned “Did you mean…” prompt, reminiscent of Google’s intuitive suggestion system, can guide users in refining their queries for better results. In essence, LLMs don’t just offer solutions; they empower businesses and their stakeholders to reimagine and redefine their operational potentials.

Let’s connect

Did you like the content? Let’s have a coffee, add me on LinkedIn to exchange ideas and share knowledge!

https://www.linkedin.com/in/iagobrandao

References

Langchain. https://python.langchain.com/docs/get_started/introduction.html

Hugging Face. https://huggingface.co/

Sentence Transformers, BERT. https://www.sbert.net/