How extract data from PDF using LangChain and Mistral

Jose Chipana
3 min readApr 23, 2024

--

This is an example of how we can extract structured data from one PDF document using LangChain and Mistral.

Now in days, extract information from documents is a task hard-boring and it wastes our time. Read, retain the information, transcribing the answer and repeat this process with every document could lead us to errors too. that is why in the post we’ll explore an alternative to automate this routine.

To begin, we have to install the framework and libraries required

pip install -U langchain-core langchain-mistralai pypdf

And import them.

from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_community.document_loaders import PyPDFLoader
from typing import Listpy

You should have a CV pdf in the same directory of your Python script or Notebook. If you want to call it from another folder remember replace it.

In the next script you’ll see that the first line is calling a PDF Loader, this class helps us to load a PDF document. The next line read the document and then return the data as chucks .

loader = PyPDFLoader("./cv.pdf")
pages = loader.load_and_split()

Combine the text from all chunks into a single string variable:

text = " ".join(list(map(lambda page: page.page_content, pages)))

You have to define a Python class based on the features you want. We could create a more complex Data Extractor Class (as class over class) but take about the LLM, the context windows and more.

class CVDataExtraction(BaseModel):
username: str = Field(description="candidate username")
email: str = Field(description="candidate email")
profile: str = Field(description="candidate profile description")
skills: List[str] = Field(description="soft and technical skills")

Now, we’ll initialize a ChatMinistralAI class from LangChain. Remember to use your API_KEY. Here you can configure the rest of parameters like temperature, max tokens, top_n and more but with this is enough for the example. I’ll create another post talking about them and I’ll explain them.

model = ChatMistralAI(api_key="", model='mistral-large-latest')

As one of the last steps we must to use the “with_structured_output” to indicate to model that at the end or his answer must to be an schema.

structured_llm = model.with_structured_output(CVDataExtraction)

Finally, invoke your LLM by feeding it with the text.

structured_llm.invoke(text)

Conclusions

With the LLM’s structured outputs a new opportunity to automate data extraction opens up.

But what happens with audio or image ?

We can extract data from this sources too.

In the audio case we can use “Whisper” from OpenAI or another “Speech to Text” (STT) as Deepgram. In image case we can use TextExtract from AWS or even OpenAI vision model to extract the text.

Comments

  • In this example I used a PDF with text, in other words, that a PDF Loader can read, be careful of use a PDF that can’t be read.
  • As a CV have fews pages it works well for a basic example but in the others scenarios you must to worry about the context window and not exceed it.
  • We must to be careful about the schema cause one complex schema could lead to error.
  • The langChain’s structured output is in Beta so it isn’t recommendable for project in production.

I hope this post helps in some way or at least you find it interesting.

Thanks for reading, I really appreciate it!

See you soon in the next post!

--

--