Ask your PDF with Cohere Embed v3

Ambika Sukla
nlmatics
Published in
3 min readNov 2, 2023
A PDF scaring LLM

Cohere released their super fast embedding API today. Congratulations to Nils Reimers and team!

For retrieval augmented generation (RAG), both the quality and speed of embeddings is required to get good results in a reasonable amount of time. Moreover, since every word is counted by the APIs, an economic solution is desirable to embed thousdands of documents.

Getting high quality contexts from PDFs is a very challenging problem that makes it difficult to bring PDF content into LLMs for embedding and question answering. In this blog, we will show how to do that with LayoutPDFParser, Cohere embeddings and OpenAI chat completion and create a Ask your PDF solution.

To get good quality embeddings we will use:

  1. Document chunker that creates smart context chunks and a embedding mechanism. This is especially hard with PDFs. In this example we will use LayoutPDFParser for smart and optimal chunking of our PDF content.
  2. Cohere Embed v3 API.

Complete code is available in this colab to try!

First we read the PDF:

from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://www.shinzen.org/wp-content/uploads/2016/08/WhatIsMindfulness_SY_Public_ver1.5.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

Then we get the smart chunk texts into a list and embed the items with Cohere API:

contexts = []
for chunk in doc.chunks():
contexts.append(chunk.to_context_text())
co = cohere.Client(cohere_key)
#Encode your documents with input type 'search_document'
doc_emb = co.embed(contexts, input_type="search_document", model="embed-english-v3.0").embeddings
doc_emb = np.asarray(doc_emb)

Next we put it all together. The code below embeds your query using Cohere API, gets the top 10 most relevant matches and sends them over to OpenAI for summarization.

def ask(query):
#Encode your query with input type 'search_query'
query_emb = co.embed([query], input_type="search_query", model="embed-english-v3.0").embeddings
query_emb = np.asarray(query_emb)
query_emb.shape

#Compute the dot product between query embedding and document embedding
scores = np.dot(query_emb, doc_emb.T)[0]

#Find the highest scores
max_idx = np.argsort(-scores)
most_relevant_contexts = []
top_k = 10

#Get only the top contexts to keep the context for openai small
for idx in max_idx[0:top_k]:
most_relevant_contexts.append(contexts[idx])

#Call OpenAI to synthesize answers
passages = "\n".join(most_relevant_contexts)
prompt = f"Read the following passages and answer the question: {query}\n passages: {passages}"
completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}])
synthesized_answer = completion.choices[0].message.content

print(f"Query: {query}")
print(f"Answer: {synthesized_answer}")
print("\nRelevant contexts: \n")
for ctx in most_relevant_contexts:
print(ctx)
print("--------")

ask("what do i need to practice mindfulness")

That is it — here’s the answer:

To practice mindfulness, you need to acquire and apply concentration, clarity, and equanimity skills.

Mindfulness can refer to a form of awareness, the practices that elevate that awareness, and the application of that awareness for specific goals.

Mindfulness practice often includes training in positive affect, such as loving kindness. It is important to understand that mindfulness is not just being aware in a general sense, but rather a fine-grained and systematic practice.

Mindfulness skills can be learned independent of one’s beliefs or worldview, but they may impact how one views things.

It is also important to recognize that there may be specific learning required for different sensory experiences, and that carryover of skills may not always be immediate or universal.

Cohere Embed v3 Announcement link.

Colab link.

LayoutPDFParser link.

--

--