Optimizing Document Ingestion and Retrieval with Azure Document Intelligence, AI Search and Durable Function : Part 2

7 min readJul 21, 2024

Recap of the Part 1

In the first part the series, we covered the following topics:

o High-Level architecture: An overview of the document ingestion API for structured document formats such as PDF and DOCX, utilizing of the Azure services like as Document Intelligence, AI search and Durable Functions.

o Azure Durable Function overview: Introduction to Durable Functions and their role in the Architecture

o HTTP Trigger Document Ingestion API: Creation of the HTTP trigger for document ingestion API with Orchestrator and Activity functions.

In this part, we will focus on the various components of the document chunking, including document parsing, extraction, chunking, embedding and vector store persistence. Let’s start with document preprocessing and chunking.

Click here to check out the first part of the blog.

Document Preprocessing & Chunking

In this section we will focus on the document content extraction and chunking utilizing LangChain and Azure Document Intelligent Service. First, we will explore the capabilities of the Azure Intelligence service, and then we will demonstrate how to integrate it with LangChain for effective document chunking.

Azure Document Intelligence Service

Azure Document Intelligence Service formerly known as Form Recognizer is AI service that leverages advanced machine learning to extract text, key-value pairs, tables and structures from documents. It offers both prebuilt models and support for custom models to facilitate the document extraction.

The Layout model provides a comprehensive solution for advanced content extraction and document structure analysis capabilities. It enables easy extraction of text and structural elements. A recent addition to service is the ability to extract document contents in Markdown format. Although this feature is currently in preview, it performs exceptionally well.

Document Intelligence supports a wide range of formats, including PDF, JPEG/JPG, PNG, BMP, TIFF, HEIF, DOCX, XLSX, PPTX and HTML.

Latest API version: 024–02–29-preview

Preview Feature Availability Regions:

East US
West US2
West Europe

You can explore and experiment with Document Intelligence Service using Azure Document Intelligence studio. This tools allows you to try out various features and gain more insights into the capabilities of the Azure Document Intelligence Service. Refer this to get more details about Azure Document Intelligence Service.

LangChain AzureAIDocumentIntelligenceLoader

AzureAIDocumentIntelligenceLoader in LangChain is a document loader that utilizes the Azure Document Intelligence Service to extract the content of the document in markdown format. Since default output of this loader is Markdown, it can be easily chunked using MarkdownHeaderTextSplitter for semantic document chunking.

In the first part, we created an activity in Azure Durable Function for document preprocessing.

# Activity
@myApp.activity_trigger(input_name="inputJson")
def document_preprocess(inputJson):
    """
    TODO: 
        - Code for preprocessing the document will go here.
        - Update the details of the pre-processed document in inputJson
    """
    return inputJson

The following steps outlines how to implement the logic using LangChain document loader.

Download the Document: Retrieve the document from blob storage and save in /tmp folder.
Add Required Imports: Update “function_app.py” to include the necessary imports for document processing and chunking .

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_openai import AzureOpenAIEmbeddings
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch

3. Set Environment Variables: Ensure the following environment variable are configured to utilize Azure services.

AZURE_OPENAI_ENDPOINT
AZURE_OPENAI_API_KEY
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT
AZURE_DOCUMENT_INTELLIGENCE_KEY

4. Once all the required imports and environment variables are set up, you can proceed to parse the document. refer the script below:

doc_intelligence_endpoint = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
doc_intelligence_key = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY")
""" 
InitializeAzure AI document Intelligence loader. 
"""
loader = AzureAIDocumentIntelligenceLoader(
        file_path="<<path of the downloaded document in /tmpl folder>>", 
        api_key = doc_intelligence_key, 
        api_endpoint = doc_intelligence_endpoint, api_model="prebuilt-layout")
docs = loader.load()
# Extract the markdown content for further processing
docs_string = docs[0].page_content

Note: MarkdownHeaderTextSplitter that we will use in next section supports ATX(e.g. #, ##, ###) Markdown headers. However, it has been observed that Markdown text returned by Azure Document Intelligence Service may use both ATX and SETEXT(=, -) headers. Ensure that all SETEXT headers are converted to ATX headers. you can use regex pattern and re package in python to accomplish this.

6. Save and Upload Processed Markdown: Save the processed markdown text to a file and upload it to Blob Storage for next Activity. Using Blob Storage instead of the /tmp folder ensures that documents are shared between activities, as it’s not guaranteed that the activities will share the resources. This approach helps avoid “File not found” errors and prevents Non-Deterministic Workflow execution errors.

7. Update Input JSON: Add the details of the markdown file to InputJson for the next activity.

Its advisable to encapsulate the logic for processing, chunking and vector store operations in specific Python script file, with activity function in function_app.py merely invoking these scripts. This helps to keep function_app.py manageable and easy to read.

Chunking : LangChain MarkdownHeaderText Splitter

Chunking is the process of dividing the large texts or inputs into smaller manageable segments i.e. chunks which can be more easily processed by Large Language Models(LLM). This is a crucial because LLMs often have limitation of the maximum length of input tokens they can process or handle. due to memory constraints and processing power.

Several chunking strategies are Fixed-Length Chunking (Character text Splitter, Recursive Text Splitter), Sentence-Based Chunking(Sentence Splitter), Paragraph-based Chunking, Semantic chunking etc.

For our implentation, We use the MarkdownHeaderTextSplitter from LangChain which performs the semantic chunking by leveraging Markdown headers to determine chunk boundaries.

In our Durable Function, We have defined an activity function to handle both chunking and vector store persistence of these as illustrated below :

# Activity
@myApp.activity_trigger(input_name="inputJson")
def document_chunking_persistence(inputJson):
    """
    TODO: 
        - Code for document chunking will go here.
        - Write/Invoke the code to store the document chunks to AI search index.
        - Update the details of the chunking and persistence in inputJson        
    """
    return inputJson

Below are the steps to implement the chunking logic:

Download and Retrieve Markdown Content: Download the markdown file from the Blob Storage and load its content in variable named “docs_string”
Define Markdown Headers for Chunking: Specify the markdown headers based on which chunking should be performed. Use the MarkdownHeaderTextSplitterand to split the markdown text in chunks:

# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

# Initialize the text splitter.
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Splits the text into chunks.
splits = text_splitter.split_text(docs_string)

3. Handle Large Chunks: Since chunking based on the markdown header may result in chunks of variable size, it’s possible to have some chunks that are too large. To address this, check the size of each chunk and, if necessary, split larger chunks further using an additional text splitter such as NLTK’s text splitter, while preventing the metadata.

from langchain.text_splitter import NLTKTextSplitter

# Sample code to split the large chunks.
text_splitter = NLTKTextSplitter(chunk_size=2500, chunk_overlap = 500)

splits = text_splitter.split_text(text)

This approach ensures that your document is effectively chunked while managing large chunks, making the text more manageable for processing. With the document now properly chunked, the next step is to store these chunks into a vector store such as Azure AI Search.

Chunk Embeddings Creation and Vector Store Persistence

Vector store (or Vector database) is a specialized data storage system designed to focus on efficiently manage and retrieve high-dimensional vectors. These vector represents the embeddings of the data which are generated using LLMs or other machine learning models.

Key Features of the Vector stores:

Embedding Storage: Stores the embeddings of the data that are generated using LLMs or other ML models.
Efficient Retrieval: Provides efficient and quick retrieval of the vectors based on similarity to query vectors
Similarity Search Algorithms: Supports various algorithms such as KNN, HNSW for similarity searches.

In this blog, We will use Azure AI Search as our vector store. Here’s how to create chunks embedding and store them in an Azure AI Search index.

Create Embedding Function: initialize an embeddings function for Azure AI search. For this purpose, we will be use “text-embedding-ada-002” model from Azure OpenAI to create the chunk embeddings.

# Embed the splitted documents and insert into Azure Search vector store

aoai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment="<Azure OpenAI embeddings model>",
    openai_api_version="2024-05-01-preview",  
)

2. Create and Configure Azure AI Search Client: Follow these steps to set up an Azure AI search client, create a new index, and add document chunks along with the embedding .

vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_ADMIN_KEY")

# Initialize Azure search client
index_name: str = "<your index name>"
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=aoai_embeddings.embed_query,
)

# Add documents chunks and embeddings to the index.
vector_store.add_documents(documents=splits)

You can also define the fields for AI Search index, configure vector search profiles, and set up semantic search to enhance search results and context retrieval. For detailed instructions on these configurations, refer to LangChain documentation here .

Conclusion

In this two part blog series, we explored how to utilize Azure services such as Azure Durable Function, Document Intelligence and AI search for document ingestion for Retrieval-Augmented Generation(RAG). These powerful tools can be beneficial for various usecases, including:

User-Uploaded Document Q&A: Allowing users to upload documents and perform question-and-answer task for them.
Document generation and Merging: Creating new documents by merging contents from multiple input documents.

Thank you for following along, and I hope to see you in future posts!