In recent years, vector databases have gained popularity for their ability to efficiently store and retrieve high-dimensional data, such as word embeddings.

Vector embedding is a crucial step in many data-driven applications, but it can often be computationally expensive and resource-intensive. However, by leveraging Huggingface embeddings, we can significantly reduce the cost associated with embedding vectors while maintaining performance and accuracy.

In this article, we will explore how using Huggingface embeddings can save costs compared to traditional embedding approaches

Understanding Huggingface Embeddings

Huggingface is a leading library in natural language processing (NLP) that offers a wide range of pre-trained models and embeddings. These embeddings are derived from state-of-the-art models such as BERT, GPT, or RoBERTa and capture rich semantic information from text. Unlike traditional embedding methods that require training from scratch, Huggingface embeddings provide precomputed representations that can be readily used for various NLP tasks.

How to do it?

from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from transformers import GPT2TokenizerFast

#Inititalise the embedding
hf_embeddings = HuggingFaceEmbeddings()

#Load documents
loader = PyPDFLoader('1Q23_media_briefing_transcript.pdf')
pages = loader.load()

#Split the token
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text_split = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=800, chunk_overlap=20)
text = text_split.split_documents(pages)

#Create the vectorstore
store = Chroma.from_documents(text,hf_embeddings,persist_directory='saved_vdb')
#Load the vectorstore
vectordb = Chroma(persist_directory='saved_vdb', embedding_function=hf_embeddings)

#Get the semantic paragraph
prompt = 'Your query'
search = vectordb.similarity_search_with_score(prompt)


