Optimizing Document Ingestion and Retrieval with Azure Document Intelligence, AI Search and Durable Function : Part 1

8 min readJul 21, 2024

Introduction

Recent rapid developments in AI have brought about many advancements in NLP (Natural Language Processing). And leading this revolution are Large Language models better known as LLMs like GPT-4, Claude, Cohere, Llama. LLMs have demonstrated remarkable capabilities like generating human like text, understanding the context, and even performing complex tasks. But these are not free from limitations like data they are trained on that may be stale, incomplete, and publicly sourced.

What if we want to use LLMs with Enterprise data or generate response from specific datasets? To address this, a novel approach called as Retrieval-Augmented-Generation or RAG in short has emerged. It enhances LLM’s capabilities by integrating real time data retrieval ensuring generated data is contextually relevant data and up-to-date.

A crucial component of the RAG is data ingestion which involves the retrieval of the real time contextual data for LLM response generation. In this blog, we are going explore how to leverage Azure services such as Durable Function, Document Intelligence, and AI Search to streamline the document ingestion process for structured document like PDFs and docx format. This approach can be highly beneficial for various usecases, including:

User-Uploaded Document Q&A: Allowing users to upload documents and perform question-and-answer task for them.
Document generation and Merging: Creating new documents by merging contents from multiple input documents.

This blog will be presented in two parts. The first part will cover Azure Durable Functions, and the second part will delve into chunking and vector store persistence.

Pre-requisites and Technology stack

Pre-requisites:

- Python knowledge: Familiarity with python programming language

- LangChain framework: Understanding of the LangChain framework

- Visual Studio Code(VS Code) with Azure functions extension and Azure Functions Core Tools installed.

Azure resources:

Azure subscription with below resources created.

o Blob Storage Account

o Azure Function app

o Document Intelligence Resource: Ensure this resource is in supported region such as East US, West US2 or West Europe, as the “Layout” model used in this blog is only supported in these regions.

o Azure AI Search

o Azure OpenAI Resource: With an Embedding model deployment such as “text-embedding-ada-002”.

Technology Stack:

- Python 3.10+

- Azure python SDK

- Major python packages (Not an exhaustive list):

o azure-functions

o azure-functions-durable

o azure-ai-documentintelligence

o azure-search-documents

o azure-storage-blob

o langchain

o langchain-community

o nltk

o pandas

Implementation

High Level Architecture: Document ingestion

Above diagram shows the high-level architecture of the Document Ingestion process for structured document like PDFs and DOCX. Below are the details of each component.

Durable Function— Durable Function will be utilized to provide asynchronous, long-running operations capabilities for document ingestion. It will offer an HTTP-based API for triggering the document ingestion process. Additionally, as part of the trigger response, it will provide URLs to manage the orchestrator instance, such as checking the status, suspending, or resuming orchestrator execution. More details will be covered in the Durable Function creation section.

2. Azure Blob Storage — Azure Blob Storage will be used as the intermediate storage for documents, as Durable function Orchestrator can only accept input which can be converted to JSON.

3. Azure Document Intelligence — Azure Document Intelligence Service will be used to extract text from documents and convert it into Markdown format. This is necessary for chunking the documents based on Markdown headers.

4. Azure AI Search —Azure AI Search Index will be used as a vector store for storing document chunks. For more information on why Azure AI Search is an excellent choice as a vector store, you can refer to the articles below, which I found particularly useful:

o https://www.linkedin.com/pulse/choosing-right-azure-vector-database-michael-john-pe%C3%B1a

o https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167

5. Azure Open AI — Azure Embedding API with “text-embedding-ada-002” deployment will be utilized for creation of the chunk embeddings.

I will be diving the implementation of the Document Ingestion in three sections.

- Azure Durable Function Creation for Document Ingestion API

- Document Preprocessing & Chunking

- Chunk Embedding Creation and Vector Store persistence.

I will be providing the reference code snippets for explaining the important parts.

Azure Durable Function Creation for Document Ingestion API

Azure durable functions is feature of the Azure Function which enables you to define and orchestrate long running, stateful workflows within serverless computing environment. Azure durable functions primarily consist of two key components:

1. Orchestrator Functions: These manage the overall workflow execution by coordinating the execution of the activities defined in the workflow. Orchestrator functions are designed to handle long running operations, with their lifespan ranging from seconds, days, months or even indefinitely.

2. Activity Functions: These represents the basic unit of work within workflow. Activities might include tasks such as creating a database record, processing documents, or integration with external systems. Activities can be executed sequentially, in parallel, or as a combination of both.

Its important to ensure that the input passed to both Orchestrator and activity functions is valid JSON input. While Azure Durable Functions also supports additional feature such as Durable Entities, Sub-orchestrator, timers, these are outside the scope of this blog. You can refer Micorsoft documentation for them here.

Let’s proceed by creating Azure Durable Function API for document ingestion. Please follow these steps:

1. Create a project folder and add to VS Code workspace.

2. Open Command pallet and type “Azure Functions: Create New Project..” and select matching option as shown below.

3. Choose the project folder for the function App project

4. Provide below information for the prompts.

a. Language: python

b. Python Programming model: Model V2

c. Python Version: Installed python version. Based on the selected version VS code will create virtual environment.

d. Template for project: HTTP Trigger

This will create function app in the selected project folder with HTTP trigger. The project will have below files.

- function_app.py
- Host.json
- local.settings.json
- requirements.txt

Update the requirement.txt with below packages.

azure-functions

azure-functions-durable

As we proceed with Ingestion logic implementation, we need additional packages technology stacks so make sure those are part of the requirements.txt. You may also need to install/add additional packages as per your implementation.

5. Activate your virtual environment in project folder and then install the required packages using below command.

python -m pip install -r requirements.txt

6. As we are using V2 model, we need to have below application setting in VS code. Use “Azure Functions: Add New Setting…” in command pallet for the same.

Name: AzureWebJobsFeatureFlags

Value: EnableWorkerIndexing

Also add this setting in local.settings.json for running the function locally.

7. Let create basic a hello world durable function for testing the setup. Update the “function_app.py” with below code. This code has been provided by Microsoft, check here.

import azure.functions as func
import azure.durable_functions as df

myApp = df.DFApp(http_auth_level=func.AuthLevel.ANONYMOUS)

# An HTTP-Triggered Function with a Durable Functions Client binding
@myApp.route(route="orchestrators/{functionName}")
@myApp.durable_client_input(client_name="client")
async def http_start(req: func.HttpRequest, client):
    function_name = req.route_params.get('functionName')
    instance_id = await client.start_new(function_name)
    response = client.create_check_status_response(req, instance_id)
    return response

# Orchestrator
@myApp.orchestration_trigger(context_name="context")
def hello_orchestrator(context):
    result1 = yield context.call_activity("hello", "Seattle")
    result2 = yield context.call_activity("hello", "Tokyo")
    result3 = yield context.call_activity("hello", "London")

    return [result1, result2, result3]


# Activity
@myApp.activity_trigger(input_name="city")
def hello(city: str):
    return f"Hello {city}"

Do not worry about details of these three functions, as they will be discussed in upcoming steps.

8. Make sure you are logged in Azure from VS code. Test your function by setting a breakpoint in the hello activity function code. Select Debug: Start Debugging from the command palette to start the function app project. Output from Core Tools is displayed in the Terminal panel.

9. If the function is started successfully, you should be able to see endpoint URL of the HTTP Trigger function like below:

Use browser or Postman and send HTTP request to given endpoint. Replace {functionName} with “hello-orchestrator”. Response will show that function orchestrator has started successfully. You can check the status of orchestrator execution using “statusQueryGetUri”

Refer Azure Durable Functions documentation here for more details, as that is not the scope of this blog.

10. To create the document ingestion API, replace the contents of function_app.py with the following Python code. At this stage, we are implementing placeholder activity functions. As we move forward, I will provide the approach and code snippets necessary to complete these placeholder functions.

import azure.functions as func
import azure.durable_functions as df

myApp = df.DFApp(http_auth_level=func.AuthLevel.ANONYMOUS)

# An HTTP-Triggered Function with a Durable Functions Client binding
@myApp.route(route="orchestrators/{functionName}")
@myApp.durable_client_input(client_name="client")
async def http_start(req: func.HttpRequest, client):
    #Pass the functionName as "doc_ingestion"
    function_name = req.route_params.get('functionName')
    """
    This function expects the multipart/form-data input withe "files" field. extract from the request.
    """
    files = request.files.values();
    """ TODO: 
        - Write code to upload to blob storage and get the blob storage url.
        - Generate Orchestrator input JSON using blob file url.
    """
    orechestrator_input = {
                        # Orchestrator input JSON
                        }

    instance_id = await client.start_new(function_name, client_input=orechestrator_input)
    response = client.create_check_status_response(req, instance_id)
    return response

# Orchestrator
@myApp.orchestration_trigger(context_name="context")
def doc_ingestion(context):
    input = context.get_input()
    # After processing return the updated input json with added details in every activity output.
    input = yield context.call_activity("document_preprocess", input)
    input = yield context.call_activity("document_chunking_persistence", input)
    result = yield context.call_activity("vector_store_persistence", input)

    return result

# Activity
@myApp.activity_trigger(input_name="inputJson")
def document_preprocess(inputJson):
    """
    TODO: 
        - Code for preprocessing the document will go here.
        - Update the details of the pre-processed document in inputJson
    """
    return inputJson

# Activity
@myApp.activity_trigger(input_name="inputJson")
def document_chunking_persistence(inputJson):
    """
    TODO: 
        - Code for document chunking will go here.
        - Write/Invoke the code to store the document chunks to AI search index.
        - Update the details of the chunking and persistence in inputJson        
    """
    return inputJson

Explanation:

In above code we have created below components

1. HTTP Trigger: This HTTP-Trigger API function initiates the orchestrator function and accepts the following inputs:

a. function_name — Path variable used for identifying the orchestrator as same Durable function can have multiple orchestrators.

b. files — These are the documents to be chunked as multipart/form-data input.

2. Orchestrator: This is the document ingestion orchestrator function which is going to manage the document ingestion workflow activities.

3. Activities: These are activities function for document ingestion process. In real life scenarios there can be many activities such as document ingestion tracking record creation, record updating etc. For this blog we are keeping bare minimum activities i.e. document preprocessing and chunking & vector store persistence. Currently activities code is placeholder only. Implementation logic and partial code snippet will be covered in second part of the blog.

Note: Refer to the Microsoft documentation below while implementing Azure Durable Functions. These resources can can help improve your implementation and mitigate the potential issues.

This concludes the first part of our blog. I hope you join us for the second part, where we will cover document chunking and persisting data in a vector store. We will focus on using Azure Document Intelligence for document chunking and Azure AI Search as a vector store.

Click here to check out the second part of the blog.