Box Llama-Index Readers: The cool stuff
In the previous article, we covered the classic Box Llama-Index reader, and we also revealed the existence of other readers that use Box services to extract data from the Intelligent Content Cloud.
Box is different than traditional cloud storage providers, because you can use built-in services. These include text extraction, generic AI prompting, and specialized structured data AI extraction.
- Box Text Extraction — Uses Box text representation to extract text from documents directly
- Box AI Prompt — Uses Box AI to extract context from documents
- Box AI Extraction — Uses Box AI to extract structured data from documents
Box Text Extraction
The BoxReaderTextExtraction
is a LLamaIndex reader class used for loading text content from Box files directly.
This class inherits from the BoxReaderBase
class and specializes in extracting plain text content from Box files. It utilizes the provided BoxClient
object to interact with the Box API and retrieves the text representation of the files.
Tip: For more information, check the Box text representation documentation.
Usage
To instantiate the reader, you only need a BoxClient
object.
# Using CCG authentication
from llama_index.readers.box import BoxReaderTextExtraction
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient
ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)
reader = BoxReaderTextExtraction(box_client=client)
Load data
The load_data
method, extracts text content from Box files and creates LlamaIndex Document
objects.
This method utilizes the Box API to retrieve the text representation (if available) of the specified Box files. It then creates Document
objects containing the extracted text and file metadata.
file_ids (Optional[List[str]], optional)
: A list of Box file IDs to extract text from; If provided, folder_id is ignored; Defaults toNone
folder_id (Optional[str], optional)
: The ID of the Box folder to extract text from; If provided, along withis_recursive
set toTrue
, retrieves data from sub-folders as well; Defaults toNone
is_recursive (bool, optional)
: IfTrue
andfolder_id
is provided, extracts text from sub-folders within the specified folder; Defaults toFalse
Usage:
#### Using folder id
documents = reader.load_data(folder_id="folder_id")
#### Using file ids
documents = reader.load_data(file_ids=["file_id1", "file_id2"])
Other methods
The rest of the methods work exactly like the Box Reader:
- Load resource — Load data from a specific resource
- List resource — Lists the IDs of Box files based on the specified folder or file IDs
- Read file content — Returns the binary content of a file
- Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
- Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
- Get resource info — Get information about a specific resource
Box AI Prompt
The BoxReaderAIPrompt
is a LlamaIndex reader class for loading data from Box files using a custom AI prompt.
This class inherits from the BoxReaderBase
class and allows specifying a custom AI prompt for data extraction. It utilizes the provided BoxClient
object to interact with the Box API and extracts data based on the prompt.
Box AI features are only available to Enterprise Plus customers.
Usage
To instantiate the reader, you only need a BoxClient
object.
# Using CCG authentication
from llama_index.readers.box import BoxReaderAIPrompt
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient
ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)
reader = BoxReaderAIPrompt(box_client=client)
Load data
The load_data
method, extracts data from Box files using a custom AI prompt and creates Document
objects.
This method utilizes a user-provided AI prompt to extract data from the Box files. It then creates Document
objects containing the extracted data along with file metadata.
ai_prompt (str)
: The custom AI prompt that specifies what data to extract from the filesfile_ids (Optional[List[str]])
: A list of Box file IDs to extract data from; If provided,folder_id
is ignored; Defaults toNone
folder_id (Optional[str])
: The ID of the Box folder to extract data from; If provided, along withis_recursive
set toTrue
, retrieves data from sub-folders as well; Defaults toNone
is_recursive (bool)
: IfTrue
andfolder_id
is provided, extracts data from sub-folders within the specified folder; Defaults toFalse
.individual_document_prompt (bool = True)
: IfTrue
, applies the provided AI prompt to each document individually; IfFalse
, all documents are used for context to the answer. Defaults toTrue
.
Usage:
#### Using folder id
documents = reader.load_data(
folder_id="folder_id", ai_prompt="summarize this document"
)
#### Using file ids
documents = reader.load_data(
file_ids=["file_id1", "file_id2"], ai_prompt="summarize this document"
Please note:
The AI prompt is a tool that helps you generate text using Box AI. It can be used to generate text, answer questions, and more.
By default, Box AI will use the context of a single document individual_document_prompt=True;
however, Box AI has the capability to answer questions by looking at the context of multiple documents.
For example, suppose you want to use an AI prompt from support requests.
You can pass a list of support requests with individual_document_prompt=True
and the AI Prompt reader will generate an answer for each one.
On the other hand, if you want to get an answer from support requests grouped by customer, you can pass a list of support requests from a specific customer with individual_document_prompt=False
and the AI Prompt reader will generate an answer for that customer.
Load resource
The load_resource
method, loads data from a specific resource.
resource (str)
: The resource identifier.ai_prompt (str)
: The custom AI prompt that specifies what data to extract from the files.
Usage:
resource_id = test_data["test_txt_invoice_id"]
docs = reader.load_resource(resource_id, ai_prompt="summarize this document")
Other methods
The rest of the methods work exactly like the Box Reader:
- List resource — Lists the IDs of Box files based on the specified folder or file IDs
- Read file content — Returns the binary content of a file
- Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
- Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
- Get resource info — Get information about a specific resource
Box AI Extraction
The BoxReaderAIExtract
, is a LlamaIndex reader class for loading data from Box files using Box AI Extract.
This class inherits from the BoxReaderBase
class and specializes in processing data from Box files using Box AI Extract. It utilizes the provided BoxClient
object to interact with the Box API and extracts data based on a specified AI prompt.
Box AI features are only available to Enterprise Plus customers.
Note: Box AI Extraction is currently in beta, and the implementation may change.
Usage
To instantiate the reader, you only need a BoxClient
object.
# Using CCG authentication
from llama_index.readers.box import BoxReaderAIExtract
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient
ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)
reader = BoxReaderAIExtract(box_client=client)
Load data
The load_data
method, extracts data from Box files using Box AI and creates Document
objects.
This method utilizes the Box AI Extract functionality to extract data based on the provided AI prompt from the specified Box files. It then creates Document
objects containing the extracted data along with file metadata.
ai_prompt (str)
: The AI prompt that specifies what data to extract from the filesfile_ids (Optional[List[str]])
: A list of Box file IDs to extract data from; If provided,folder_id
is ignored; Defaults toNone
folder_id (Optional[str])
: The ID of the Box folder to extract data from; If provided, along withis_recursive
set toTrue
, retrieves data from sub-folders as well; Defaults toNone
is_recursive (bool)
: IfTrue
andfolder_id
is provided, extracts data from sub-folders within the specified folder; Defaults toFalse
.
Usage:
#### Using folder id
documents = reader.load_data(
folder_id="folder_id",
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)
#### Using file ids
documents = reader.load_data(
file_ids=["file_id1", "file_id2"],
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)
Please note:
The ai_prompt
defines the structure of the data that will be extracted from the documents. It can be a dictionary string:
{
"doc_type",
"date",
"total",
"vendor",
"invoice_number",
"purchase_order_number",
}
A JSON string:
{
"fields": [
{
"key": "vendor",
"displayName": "Vendor",
"type": "string",
"description": "Vendor name",
},
{
"key": "documentType",
"displayName": "Type",
"type": "string",
"description": "",
},
]
}
Or even conversational english text:
"find the document type (invoice or po), vendor, total, and po number"
Load resource
Load data from a specific resource.
resource (str)
: The resource identifier (file_id
)ai_prompt (str)
: The AI prompt that specifies what data to extract from the files
Usage:
AI_PROMPT = '{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}'
resource_id = test_data["test_txt_invoice_id"]
docs = reader.load_resource(resource_id, ai_prompt=AI_PROMPT)
Other methods
The rest of the methods work exactly like the Box Reader:
- List resource — Lists the IDs of Box files based on the specified folder or file IDs
- Read file content — Returns the binary content of a file
- Search resources — Searches for Box resources based on specified criteria and returns a list of their IDs
- Search resources by metadata — Searches for Box resources based on metadata and returns a list of their IDs
- Get resource info — Get information about a specific resource
Working example
Consider this code:
import os
from typing import List
import dotenv
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient, File
from llama_index.readers.box import (
BoxReader,
BoxReaderTextExtraction,
BoxReaderAIPrompt,
BoxReaderAIExtract,
)
from llama_index.core.schema import Document
def get_box_client() -> BoxClient:
dotenv.load_dotenv()
# Common configurations
client_id = os.getenv("BOX_CLIENT_ID", "YOUR_BOX_CLIENT_ID")
client_secret = os.getenv("BOX_CLIENT_SECRET", "YOUR_BOX_CLIENT_SECRET")
# CCG configurations
enterprise_id = os.getenv("BOX_ENTERPRISE_ID", "YOUR_BOX_ENTERPRISE_ID")
ccg_user_id = os.getenv("BOX_USER_ID")
config = CCGConfig(
client_id=client_id,
client_secret=client_secret,
enterprise_id=enterprise_id,
user_id=ccg_user_id,
)
auth = BoxCCGAuth(config)
if config.user_id:
auth.with_user_subject(config.user_id)
return BoxClient(auth)
def get_testing_data() -> dict:
return {
"disable_folder_tests": True,
"test_folder_id": "273980493541",
"test_doc_id": "1584054722303",
"test_ppt_id": "1584056661506",
"test_xls_id": "1584048916472",
"test_pdf_id": "1584049890463",
"test_json_id": "1584058432468",
"test_csv_id": "1584054196674",
"test_txt_waiver_id": "1514587167701",
"test_folder_invoice_po_id": "261452450320",
"test_folder_purchase_order_id": "261457585224",
"test_txt_invoice_id": "1517629086517",
"test_txt_po_id": "1517628697289",
}
def print_docs(label: str, docs: List[Document]):
print("------------------------------")
print(f"{label}: {len(docs)} document(s)")
for doc in docs:
print("------------------------------")
file = File.from_dict(doc.extra_info)
print(f"File ID: {file.id}\nName: {file.name}\nSize: {file.size} bytes")
# print("------------------------------")
print(f"Text: {doc.text[:100]} ...")
print("------------------------------\n\n\n")
def main():
box_client = get_box_client()
test_data = get_testing_data()
# Text extraction
reader = BoxReaderTextExtraction(box_client=box_client)
docs = reader.load_data(file_ids=[test_data["test_txt_waiver_id"]])
print_docs("BoxReader Text Extraction", docs)
# AI prompt
reader = BoxReaderAIPrompt(box_client=box_client)
docs = reader.load_data(
file_ids=[test_data["test_txt_waiver_id"]], ai_prompt="summarize this document"
)
print_docs("Box Reader AI Prompt", docs)
# Ai extract
reader = BoxReaderAIExtract(box_client=box_client)
docs = reader.load_data(
file_ids=[test_data["test_txt_invoice_id"]],
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
)
print_docs("BoxReader AI Extract", docs)
docs = reader.load_data(
folder_id=test_data["test_folder_purchase_order_id"],
ai_prompt='{"doc_type","date","total","vendor","invoice_number","purchase_order_number"}',
is_recursive=True,
)
print_docs("BoxReader AI Extract", docs)
if __name__ == "__main__":
main()
Results in:
------------------------------
BoxReader Text Extraction: 1 document(s)
------------------------------
File ID: 1514587167701
Name: Box-Dive-Waiver.docx
Size: 7409 bytes
Text: YOU MUST BE ABLE TO SWIM TO PARTICIPATE IN ANY IN WATER ACTIVITIES.
YOU MUST BE IN HEALTHY AND GOO ...
------------------------------
------------------------------
Box Reader AI Prompt: 1 document(s)
------------------------------
File ID: 1514587167701
Name: Box-Dive-Waiver.docx
Size: 7409 bytes
Text: The document is a liability release form for participants in water
activities, specifically scuba di ...
------------------------------
------------------------------
BoxReader AI Extract: 1 document(s)
------------------------------
File ID: 1517629086517
Name: Invoice-Q2468.txt
Size: 176 bytes
Text: {"doc_type": "Invoice", "date": "August 2, 2024", "total": "$1,050",
"vendor": "Quasar Innovations", ...
------------------------------
------------------------------
BoxReader AI Extract: 5 document(s)
------------------------------
File ID: 1517628618684
Name: PO-001.txt
Size: 212 bytes
Text: {"Purchase Order Number": "001", "Date": "February 13, 2024",
"Total": "$575", "Vendor": "Galactic G ...
------------------------------
File ID: 1517626773559
Name: PO-002.txt
Size: 229 bytes
Text: {"purchase_order_number": "002", "date": "February 13, 2024",
"total": "$230", "vendor": "Cosmic Con ...
------------------------------
File ID: 1517628291707
Name: PO-003.txt
Size: 222 bytes
Text: {"purchase_order_number": "003", "date": "February 13, 2024",
"total": "$1,050", "vendor": "Quasar I ...
------------------------------
File ID: 1517625894126
Name: PO-004.txt
Size: 217 bytes
Text: {"purchase_order_number": "004", "date": "February 13, 2024",
"total": "$920", "vendor": "AstroTech ...
------------------------------
File ID: 1517628697289
Name: PO-005.txt
Size: 211 bytes
Text: {"purchase_order_number": "005", "date": "February 13, 2024",
"total": "$45", "vendor": "Quantum Qui ...
------------------------------
Thoughts? Comments? Feedback?
Drop us a line in our community forum.