Box Llama-Index Readers: The cool stuff

Rui Barbosa
Box Developer Blog
Published in
8 min readAug 20, 2024

--

In the previous article, we covered the classic Box Llama-Index reader, and we also revealed the existence of other readers that use Box services to extract data from the Intelligent Content Cloud.

Box is different than traditional cloud storage providers, because you can use built-in services. These include text extraction, generic AI prompting, and specialized structured data AI extraction.

Box Text Extraction

The BoxReaderTextExtraction is a LLamaIndex reader class used for loading text content from Box files directly.

This class inherits from the BoxReaderBase class and specializes in extracting plain text content from Box files. It utilizes the provided BoxClient object to interact with the Box API and retrieves the text representation of the files.

Tip: For more information, check the Box text representation documentation.

Usage

To instantiate the reader, you only need a BoxClient object.

# Using CCG authentication
from llama_index.readers.box import BoxReaderTextExtraction
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient

ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)
reader = BoxReaderTextExtraction(box_client=client)

Load data

The load_data method, extracts text content from Box files and creates LlamaIndex Document objects.

This method utilizes the Box API to retrieve the text representation (if available) of the specified Box files. It then creates Document objects containing the extracted text and file metadata.

  • file_ids (Optional[List[str]], optional): A list of Box file IDs to extract text from; If provided, folder_id is ignored; Defaults to None
  • folder_id (Optional[str], optional): The ID of the Box folder to extract text from; If provided, along with is_recursive set to True, retrieves data from sub-folders as well; Defaults to None
  • is_recursive (bool, optional): If True and folder_id is provided, extracts text from sub-folders within the specified folder; Defaults to False

Usage:

#### Using folder id
documents = reader.load_data(folder_id="folder_id")

#### Using file ids
documents = reader.load_data(file_ids=["file_id1", "file_id2"])

Other methods

The rest of the methods work exactly like the Box Reader:

  • Load resource — Load data from a specific resource
  • List resource — Lists the IDs of Box files based on the specified folder or file IDs
  • Read file content — Returns the binary content of a file
  • Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
  • Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
  • Get resource info — Get information about a specific resource

Box AI Prompt

The BoxReaderAIPrompt is a LlamaIndex reader class for loading data from Box files using a custom AI prompt.

This class inherits from the BoxReaderBase class and allows specifying a custom AI prompt for data extraction. It utilizes the provided BoxClient object to interact with the Box API and extracts data based on the prompt.

Box AI features are only available to Enterprise Plus customers.

Usage

To instantiate the reader, you only need a BoxClient object.

# Using CCG authentication

from llama_index.readers.box import BoxReaderAIPrompt
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient

ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)

reader = BoxReaderAIPrompt(box_client=client)

Load data

The load_data method, extracts data from Box files using a custom AI prompt and creates Document objects.

This method utilizes a user-provided AI prompt to extract data from the Box files. It then creates Document objects containing the extracted data along with file metadata.

  • ai_prompt (str): The custom AI prompt that specifies what data to extract from the files
  • file_ids (Optional[List[str]]): A list of Box file IDs to extract data from; If provided, folder_id is ignored; Defaults to None
  • folder_id (Optional[str]): The ID of the Box folder to extract data from; If provided, along with is_recursive set to True, retrieves data from sub-folders as well; Defaults to None
  • is_recursive (bool): If True and folder_id is provided, extracts data from sub-folders within the specified folder; Defaults to False.
  • individual_document_prompt (bool = True): If True, applies the provided AI prompt to each document individually; If False, all documents are used for context to the answer. Defaults to True.

Usage:

#### Using folder id
documents = reader.load_data(
folder_id="folder_id", ai_prompt="summarize this document"
)

#### Using file ids
documents = reader.load_data(
file_ids=["file_id1", "file_id2"], ai_prompt="summarize this document"

Please note:

The AI prompt is a tool that helps you generate text using Box AI. It can be used to generate text, answer questions, and more.

By default, Box AI will use the context of a single document individual_document_prompt=True; however, Box AI has the capability to answer questions by looking at the context of multiple documents.

For example, suppose you want to use an AI prompt from support requests.

You can pass a list of support requests with individual_document_prompt=True and the AI Prompt reader will generate an answer for each one.

On the other hand, if you want to get an answer from support requests grouped by customer, you can pass a list of support requests from a specific customer with individual_document_prompt=False and the AI Prompt reader will generate an answer for that customer.

Load resource

The load_resource method, loads data from a specific resource.

  • resource (str): The resource identifier.
  • ai_prompt (str): The custom AI prompt that specifies what data to extract from the files.

Usage:

resource_id = test_data["test_txt_invoice_id"]
docs = reader.load_resource(resource_id, ai_prompt="summarize this document")

Other methods

The rest of the methods work exactly like the Box Reader:

  • List resource — Lists the IDs of Box files based on the specified folder or file IDs
  • Read file content — Returns the binary content of a file
  • Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
  • Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
  • Get resource info — Get information about a specific resource

Box AI Extraction

The BoxReaderAIExtract , is a LlamaIndex reader class for loading data from Box files using Box AI Extract.

This class inherits from the BoxReaderBase class and specializes in processing data from Box files using Box AI Extract. It utilizes the provided BoxClient object to interact with the Box API and extracts data based on a specified AI prompt.

Box AI features are only available to Enterprise Plus customers.

Note: Box AI Extraction is currently in beta, and the implementation may change.

Usage

To instantiate the reader, you only need a BoxClient object.

# Using CCG authentication

from llama_index.readers.box import BoxReaderAIExtract
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient

ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)

reader = BoxReaderAIExtract(box_client=client)

Load data

The load_data method, extracts data from Box files using Box AI and creates Document objects.

This method utilizes the Box AI Extract functionality to extract data based on the provided AI prompt from the specified Box files. It then creates Document objects containing the extracted data along with file metadata.

  • ai_prompt (str): The AI prompt that specifies what data to extract from the files
  • file_ids (Optional[List[str]]): A list of Box file IDs to extract data from; If provided, folder_id is ignored; Defaults to None
  • folder_id (Optional[str]): The ID of the Box folder to extract data from; If provided, along with is_recursive set to True, retrieves data from sub-folders as well; Defaults to None
  • is_recursive (bool): If True and folder_id is provided, extracts data from sub-folders within the specified folder; Defaults to False.

Usage:

#### Using folder id
documents = reader.load_data(
folder_id="folder_id",
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)

#### Using file ids
documents = reader.load_data(
file_ids=["file_id1", "file_id2"],
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)

Please note:

The ai_prompt defines the structure of the data that will be extracted from the documents. It can be a dictionary string:

{
"doc_type",
"date",
"total",
"vendor",
"invoice_number",
"purchase_order_number",
}

A JSON string:

{
"fields": [
{
"key": "vendor",
"displayName": "Vendor",
"type": "string",
"description": "Vendor name",
},
{
"key": "documentType",
"displayName": "Type",
"type": "string",
"description": "",
},
]
}

Or even conversational english text:

"find the document type (invoice or po), vendor, total, and po number"

Load resource

Load data from a specific resource.

  • resource (str): The resource identifier (file_id)
  • ai_prompt (str): The AI prompt that specifies what data to extract from the files

Usage:

AI_PROMPT = '{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}'
resource_id = test_data["test_txt_invoice_id"]
docs = reader.load_resource(resource_id, ai_prompt=AI_PROMPT)

Other methods

The rest of the methods work exactly like the Box Reader:

  • List resource — Lists the IDs of Box files based on the specified folder or file IDs
  • Read file content — Returns the binary content of a file
  • Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
  • Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
  • Get resource info — Get information about a specific resource

Working example

Consider this code:

import os
from typing import List
import dotenv

from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient, File
from llama_index.readers.box import (
BoxReader,
BoxReaderTextExtraction,
BoxReaderAIPrompt,
BoxReaderAIExtract,
)
from llama_index.core.schema import Document


def get_box_client() -> BoxClient:
dotenv.load_dotenv()

# Common configurations
client_id = os.getenv("BOX_CLIENT_ID", "YOUR_BOX_CLIENT_ID")
client_secret = os.getenv("BOX_CLIENT_SECRET", "YOUR_BOX_CLIENT_SECRET")

# CCG configurations
enterprise_id = os.getenv("BOX_ENTERPRISE_ID", "YOUR_BOX_ENTERPRISE_ID")
ccg_user_id = os.getenv("BOX_USER_ID")

config = CCGConfig(
client_id=client_id,
client_secret=client_secret,
enterprise_id=enterprise_id,
user_id=ccg_user_id,
)

auth = BoxCCGAuth(config)
if config.user_id:
auth.with_user_subject(config.user_id)

return BoxClient(auth)


def get_testing_data() -> dict:
return {
"disable_folder_tests": True,
"test_folder_id": "273980493541",
"test_doc_id": "1584054722303",
"test_ppt_id": "1584056661506",
"test_xls_id": "1584048916472",
"test_pdf_id": "1584049890463",
"test_json_id": "1584058432468",
"test_csv_id": "1584054196674",
"test_txt_waiver_id": "1514587167701",
"test_folder_invoice_po_id": "261452450320",
"test_folder_purchase_order_id": "261457585224",
"test_txt_invoice_id": "1517629086517",
"test_txt_po_id": "1517628697289",
}


def print_docs(label: str, docs: List[Document]):
print("------------------------------")
print(f"{label}: {len(docs)} document(s)")

for doc in docs:
print("------------------------------")
file = File.from_dict(doc.extra_info)
print(f"File ID: {file.id}\nName: {file.name}\nSize: {file.size} bytes")
# print("------------------------------")
print(f"Text: {doc.text[:100]} ...")
print("------------------------------\n\n\n")


def main():
box_client = get_box_client()
test_data = get_testing_data()

# Text extraction
reader = BoxReaderTextExtraction(box_client=box_client)
docs = reader.load_data(file_ids=[test_data["test_txt_waiver_id"]])
print_docs("BoxReader Text Extraction", docs)

# AI prompt
reader = BoxReaderAIPrompt(box_client=box_client)
docs = reader.load_data(
file_ids=[test_data["test_txt_waiver_id"]], ai_prompt="summarize this document"
)
print_docs("Box Reader AI Prompt", docs)


# Ai extract
reader = BoxReaderAIExtract(box_client=box_client)
docs = reader.load_data(
file_ids=[test_data["test_txt_invoice_id"]],
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)
print_docs("BoxReader AI Extract", docs)

docs = reader.load_data(
folder_id=test_data["test_folder_purchase_order_id"],
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
is_recursive=True,
)
print_docs("BoxReader AI Extract", docs)


if __name__ == "__main__":
main()

Results in:

------------------------------
BoxReader Text Extraction: 1 document(s)
------------------------------
File ID: 1514587167701
Name: Box-Dive-Waiver.docx
Size: 7409 bytes
Text: YOU MUST BE ABLE TO SWIM TO PARTICIPATE IN ANY IN WATER ACTIVITIES.

YOU MUST BE IN HEALTHY AND GOO ...
------------------------------



------------------------------
Box Reader AI Prompt: 1 document(s)
------------------------------
File ID: 1514587167701
Name: Box-Dive-Waiver.docx
Size: 7409 bytes
Text: The document is a liability release form for participants in water
activities, specifically scuba di ...
------------------------------



------------------------------
BoxReader AI Extract: 1 document(s)
------------------------------
File ID: 1517629086517
Name: Invoice-Q2468.txt
Size: 176 bytes
Text: {"doc_type": "Invoice", "date": "August 2, 2024", "total": "$1,050",
"vendor": "Quasar Innovations", ...
------------------------------



------------------------------
BoxReader AI Extract: 5 document(s)
------------------------------
File ID: 1517628618684
Name: PO-001.txt
Size: 212 bytes
Text: {"Purchase Order Number": "001", "Date": "February 13, 2024",
"Total": "$575", "Vendor": "Galactic G ...
------------------------------
File ID: 1517626773559
Name: PO-002.txt
Size: 229 bytes
Text: {"purchase_order_number": "002", "date": "February 13, 2024",
"total": "$230", "vendor": "Cosmic Con ...
------------------------------
File ID: 1517628291707
Name: PO-003.txt
Size: 222 bytes
Text: {"purchase_order_number": "003", "date": "February 13, 2024",
"total": "$1,050", "vendor": "Quasar I ...
------------------------------
File ID: 1517625894126
Name: PO-004.txt
Size: 217 bytes
Text: {"purchase_order_number": "004", "date": "February 13, 2024",
"total": "$920", "vendor": "AstroTech ...
------------------------------
File ID: 1517628697289
Name: PO-005.txt
Size: 211 bytes
Text: {"purchase_order_number": "005", "date": "February 13, 2024",
"total": "$45", "vendor": "Quantum Qui ...
------------------------------

Thoughts? Comments? Feedback?

Drop us a line in our community forum.

--

--