Box Llama-Index Readers: The cool stuff

Published in

Box Developer Blog

8 min readAug 20, 2024

In the previous article, we covered the classic Box Llama-Index reader, and we also revealed the existence of other readers that use Box services to extract data from the Intelligent Content Cloud.

Introducing Box Llama-Index Reader

Ever wanted to integrate your Box documents into cutting-edge Retrieval Augmented Generation (RAG) and other large…

medium.com

Box is different than traditional cloud storage providers, because you can use built-in services. These include text extraction, generic AI prompting, and specialized structured data AI extraction.

Box Text Extraction — Uses Box text representation to extract text from documents directly
Box AI Prompt — Uses Box AI to extract context from documents
Box AI Extraction — Uses Box AI to extract structured data from documents

Box Text Extraction

The BoxReaderTextExtraction is a LLamaIndex reader class used for loading text content from Box files directly.

This class inherits from the BoxReaderBase class and specializes in extracting plain text content from Box files. It utilizes the provided BoxClient object to interact with the Box API and retrieves the text representation of the files.

Tip: For more information, check the Box text representation documentation.

Usage

To instantiate the reader, you only need a BoxClient object.

# Using CCG authentication
from llama_index.readers.box import BoxReaderTextExtraction
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient

ccg_conf = CCGConfig(
    client_id="your_client_id",
    client_secret="your_client_secret",
    enterprise_id="your_enterprise_id",
    user_id="your_ccg_user_id",  # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)
reader = BoxReaderTextExtraction(box_client=client)

Load data

The load_data method, extracts text content from Box files and creates LlamaIndex Document objects.

This method utilizes the Box API to retrieve the text representation (if available) of the specified Box files. It then creates Document objects containing the extracted text and file metadata.

file_ids (Optional[List[str]], optional): A list of Box file IDs to extract text from; If provided, folder_id is ignored; Defaults to None
folder_id (Optional[str], optional): The ID of the Box folder to extract text from; If provided, along with is_recursive set to True, retrieves data from sub-folders as well; Defaults to None
is_recursive (bool, optional): If True and folder_id is provided, extracts text from sub-folders within the specified folder; Defaults to False

Usage:

#### Using folder id
documents = reader.load_data(folder_id="folder_id")

#### Using file ids
documents = reader.load_data(file_ids=["file_id1", "file_id2"])

Other methods

The rest of the methods work exactly like the Box Reader:

Load resource — Load data from a specific resource
List resource — Lists the IDs of Box files based on the specified folder or file IDs
Read file content — Returns the binary content of a file
Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
Get resource info — Get information about a specific resource

Box AI Prompt

The BoxReaderAIPrompt is a LlamaIndex reader class for loading data from Box files using a custom AI prompt.

This class inherits from the BoxReaderBase class and allows specifying a custom AI prompt for data extraction. It utilizes the provided BoxClient object to interact with the Box API and extracts data based on the prompt.

Box AI features are only available to Enterprise Plus customers.

Usage

To instantiate the reader, you only need a BoxClient object.

# Using CCG authentication

from llama_index.readers.box import BoxReaderAIPrompt
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient

ccg_conf = CCGConfig(
    client_id="your_client_id",
    client_secret="your_client_secret",
    enterprise_id="your_enterprise_id",
    user_id="your_ccg_user_id",  # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)

reader = BoxReaderAIPrompt(box_client=client)

Load data

The load_data method, extracts data from Box files using a custom AI prompt and creates Document objects.

This method utilizes a user-provided AI prompt to extract data from the Box files. It then creates Document objects containing the extracted data along with file metadata.

ai_prompt (str): The custom AI prompt that specifies what data to extract from the files
file_ids (Optional[List[str]]): A list of Box file IDs to extract data from; If provided, folder_id is ignored; Defaults to None
folder_id (Optional[str]): The ID of the Box folder to extract data from; If provided, along with is_recursive set to True, retrieves data from sub-folders as well; Defaults to None
is_recursive (bool): If True and folder_id is provided, extracts data from sub-folders within the specified folder; Defaults to False.
individual_document_prompt (bool = True): If True, applies the provided AI prompt to each document individually; If False, all documents are used for context to the answer. Defaults to True.

Usage:

#### Using folder id
documents = reader.load_data(
    folder_id="folder_id", ai_prompt="summarize this document"
)

#### Using file ids
documents = reader.load_data(
    file_ids=["file_id1", "file_id2"], ai_prompt="summarize this document"

Please note:

The AI prompt is a tool that helps you generate text using Box AI. It can be used to generate text, answer questions, and more.

By default, Box AI will use the context of a single document individual_document_prompt=True; however, Box AI has the capability to answer questions by looking at the context of multiple documents.

For example, suppose you want to use an AI prompt from support requests.

You can pass a list of support requests with individual_document_prompt=True and the AI Prompt reader will generate an answer for each one.

On the other hand, if you want to get an answer from support requests grouped by customer, you can pass a list of support requests from a specific customer with individual_document_prompt=False and the AI Prompt reader will generate an answer for that customer.

Load resource

The load_resource method, loads data from a specific resource.

resource (str): The resource identifier.
ai_prompt (str): The custom AI prompt that specifies what data to extract from the files.

Usage:

resource_id = test_data["test_txt_invoice_id"]
docs = reader.load_resource(resource_id, ai_prompt="summarize this document")

Other methods

The rest of the methods work exactly like the Box Reader:

List resource — Lists the IDs of Box files based on the specified folder or file IDs
Read file content — Returns the binary content of a file
Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
Get resource info — Get information about a specific resource

Box AI Extraction

The BoxReaderAIExtract , is a LlamaIndex reader class for loading data from Box files using Box AI Extract.

This class inherits from the BoxReaderBase class and specializes in processing data from Box files using Box AI Extract. It utilizes the provided BoxClient object to interact with the Box API and extracts data based on a specified AI prompt.

Box AI features are only available to Enterprise Plus customers.
Note: Box AI Extraction is currently in beta, and the implementation may change.

Usage

To instantiate the reader, you only need a BoxClient object.

# Using CCG authentication

from llama_index.readers.box import BoxReaderAIExtract
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient

ccg_conf = CCGConfig(
    client_id="your_client_id",
    client_secret="your_client_secret",
    enterprise_id="your_enterprise_id",
    user_id="your_ccg_user_id",  # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)

reader = BoxReaderAIExtract(box_client=client)

Load data

The load_data method, extracts data from Box files using Box AI and creates Document objects.

This method utilizes the Box AI Extract functionality to extract data based on the provided AI prompt from the specified Box files. It then creates Document objects containing the extracted data along with file metadata.

ai_prompt (str): The AI prompt that specifies what data to extract from the files
file_ids (Optional[List[str]]): A list of Box file IDs to extract data from; If provided, folder_id is ignored; Defaults to None
folder_id (Optional[str]): The ID of the Box folder to extract data from; If provided, along with is_recursive set to True, retrieves data from sub-folders as well; Defaults to None
is_recursive (bool): If True and folder_id is provided, extracts data from sub-folders within the specified folder; Defaults to False.

Usage:

#### Using folder id
documents = reader.load_data(
    folder_id="folder_id",
    ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)

#### Using file ids
documents = reader.load_data(
    file_ids=["file_id1", "file_id2"],
    ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)

Please note:

The ai_prompt defines the structure of the data that will be extracted from the documents. It can be a dictionary string:

{
    "doc_type",
    "date",
    "total",
    "vendor",
    "invoice_number",
    "purchase_order_number",
}

A JSON string:

{
    "fields": [
        {
            "key": "vendor",
            "displayName": "Vendor",
            "type": "string",
            "description": "Vendor name",
        },
        {
            "key": "documentType",
            "displayName": "Type",
            "type": "string",
            "description": "",
        },
    ]
}

Or even conversational english text:

"find the document type (invoice or po), vendor, total, and po number"

Load resource

Load data from a specific resource.

resource (str): The resource identifier (file_id)
ai_prompt (str): The AI prompt that specifies what data to extract from the files

Usage:

AI_PROMPT = '{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}'
resource_id = test_data["test_txt_invoice_id"]
docs = reader.load_resource(resource_id, ai_prompt=AI_PROMPT)

Other methods

The rest of the methods work exactly like the Box Reader:

List resource — Lists the IDs of Box files based on the specified folder or file IDs
Read file content — Returns the binary content of a file
Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
Get resource info — Get information about a specific resource

Working example

Consider this code:

import os
from typing import List
import dotenv

from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient, File
from llama_index.readers.box import (
    BoxReader,
    BoxReaderTextExtraction,
    BoxReaderAIPrompt,
    BoxReaderAIExtract,
)
from llama_index.core.schema import Document


def get_box_client() -> BoxClient:
    dotenv.load_dotenv()

    # Common configurations
    client_id = os.getenv("BOX_CLIENT_ID", "YOUR_BOX_CLIENT_ID")
    client_secret = os.getenv("BOX_CLIENT_SECRET", "YOUR_BOX_CLIENT_SECRET")

    # CCG configurations
    enterprise_id = os.getenv("BOX_ENTERPRISE_ID", "YOUR_BOX_ENTERPRISE_ID")
    ccg_user_id = os.getenv("BOX_USER_ID")

    config = CCGConfig(
        client_id=client_id,
        client_secret=client_secret,
        enterprise_id=enterprise_id,
        user_id=ccg_user_id,
    )

    auth = BoxCCGAuth(config)
    if config.user_id:
        auth.with_user_subject(config.user_id)

    return BoxClient(auth)


def get_testing_data() -> dict:
    return {
        "disable_folder_tests": True,
        "test_folder_id": "273980493541",
        "test_doc_id": "1584054722303",
        "test_ppt_id": "1584056661506",
        "test_xls_id": "1584048916472",
        "test_pdf_id": "1584049890463",
        "test_json_id": "1584058432468",
        "test_csv_id": "1584054196674",
        "test_txt_waiver_id": "1514587167701",
        "test_folder_invoice_po_id": "261452450320",
        "test_folder_purchase_order_id": "261457585224",
        "test_txt_invoice_id": "1517629086517",
        "test_txt_po_id": "1517628697289",
    }


def print_docs(label: str, docs: List[Document]):
    print("------------------------------")
    print(f"{label}: {len(docs)} document(s)")

    for doc in docs:
        print("------------------------------")
        file = File.from_dict(doc.extra_info)
        print(f"File ID: {file.id}\nName: {file.name}\nSize: {file.size} bytes")
        # print("------------------------------")
        print(f"Text: {doc.text[:100]} ...")
    print("------------------------------\n\n\n")


def main():
    box_client = get_box_client()
    test_data = get_testing_data()

    # Text extraction
    reader = BoxReaderTextExtraction(box_client=box_client)
    docs = reader.load_data(file_ids=[test_data["test_txt_waiver_id"]])
    print_docs("BoxReader Text Extraction", docs)

    # AI prompt
    reader = BoxReaderAIPrompt(box_client=box_client)
    docs = reader.load_data(
        file_ids=[test_data["test_txt_waiver_id"]], ai_prompt="summarize this document"
    )
    print_docs("Box Reader AI Prompt", docs)

  
    # Ai extract
    reader = BoxReaderAIExtract(box_client=box_client)
    docs = reader.load_data(
        file_ids=[test_data["test_txt_invoice_id"]],
        ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
    )
    print_docs("BoxReader AI Extract", docs)

    docs = reader.load_data(
        folder_id=test_data["test_folder_purchase_order_id"],
        ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
        is_recursive=True,
    )
    print_docs("BoxReader AI Extract", docs)


if __name__ == "__main__":
    main()

Results in:

------------------------------
BoxReader Text Extraction: 1 document(s)
------------------------------
File ID: 1514587167701
Name: Box-Dive-Waiver.docx
Size: 7409 bytes
Text: YOU MUST BE ABLE TO SWIM TO PARTICIPATE IN ANY IN WATER ACTIVITIES. 

YOU MUST BE IN HEALTHY AND GOO ...
------------------------------



------------------------------
Box Reader AI Prompt: 1 document(s)
------------------------------
File ID: 1514587167701
Name: Box-Dive-Waiver.docx
Size: 7409 bytes
Text: The document is a liability release form for participants in water 
activities, specifically scuba di ...
------------------------------



------------------------------
BoxReader AI Extract: 1 document(s)
------------------------------
File ID: 1517629086517
Name: Invoice-Q2468.txt
Size: 176 bytes
Text: {"doc_type": "Invoice", "date": "August 2, 2024", "total": "$1,050", 
"vendor": "Quasar Innovations", ...
------------------------------



------------------------------
BoxReader AI Extract: 5 document(s)
------------------------------
File ID: 1517628618684
Name: PO-001.txt
Size: 212 bytes
Text: {"Purchase Order Number": "001", "Date": "February 13, 2024", 
"Total": "$575", "Vendor": "Galactic G ...
------------------------------
File ID: 1517626773559
Name: PO-002.txt
Size: 229 bytes
Text: {"purchase_order_number": "002", "date": "February 13, 2024", 
"total": "$230", "vendor": "Cosmic Con ...
------------------------------
File ID: 1517628291707
Name: PO-003.txt
Size: 222 bytes
Text: {"purchase_order_number": "003", "date": "February 13, 2024", 
"total": "$1,050", "vendor": "Quasar I ...
------------------------------
File ID: 1517625894126
Name: PO-004.txt
Size: 217 bytes
Text: {"purchase_order_number": "004", "date": "February 13, 2024", 
"total": "$920", "vendor": "AstroTech  ...
------------------------------
File ID: 1517628697289
Name: PO-005.txt
Size: 211 bytes
Text: {"purchase_order_number": "005", "date": "February 13, 2024", 
"total": "$45", "vendor": "Quantum Qui ...
------------------------------

Thoughts? Comments? Feedback?

Drop us a line in our community forum.

Box Llama-Index Readers: The cool stuff

Introducing Box Llama-Index Reader

Ever wanted to integrate your Box documents into cutting-edge Retrieval Augmented Generation (RAG) and other large…

Box Text Extraction

Usage

Load data

Other methods

Box AI Prompt

Usage

Load data

Please note:

Load resource

Other methods

Box AI Extraction

Usage

Load data

Please note:

Load resource

Other methods

Working example

Written by Rui Barbosa