Introducing Box Llama-Index Reader
Ever wanted to integrate your Box documents into cutting-edge Retrieval Augmented Generation (RAG) and other large language model (LLM) applications?
This suite of readers bridges the gap between Box and Llama-Index, allowing you to harness the rich content you store in your Box directly within your LLM workflows.
Available Box readers
You might be surprised to learn that Box offers 4 different readers.
You have the traditional reader, similar to other cloud storage providers, where you download the file, and then use Llama-Index parser.
Because Box is not your traditional cloud storage provider, we can use built-in services, such as text extraction, generic AI prompting, and specialized structured data AI extraction.
- Box Reader — Implementation of the
SimpleReader
interface to read files from Box - Box Text Extraction — Uses Box text representation to extract text from documents directly
- Box AI Prompt — Uses Box AI to extract context from documents
- Box AI Extraction — Uses Box AI to extract structured data from documents
Box AI features are only available to Enterprise plus customers.
Getting started
The Box readers can be found here in the LlamaIndex GitHub, and can be installed like any other reader:
pip install llama-index-readers-box
Authentication
All Box readers require a BoxClient
. This simplifies the reader code and provides the greatest flexibility for developers. You can use any type of Box authentication (Developer Token, OAuth 2.0, Client Credential Grants, and JSON Web Token), and use it on your Box Client.
Here are some examples:
Using CCG authentication
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient
config = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # Optional user id
)
auth = BoxCCGAuth(config)
if config.user_id:
auth.with_user_subject(config.user_id)
client = BoxClient(auth)
reader = BoxReader(box_client=client)
By default the CCG client will use a service account associated with the application. Depending on how the files are shared, the service account may not have access to all the files.
If you want to select a different user, you can specify the user ID. In this case make sure your application can impersonate and/or generate user tokens in the scope.
Check out this guide for more information on how to setup the CCG: Box CCG Guide
Using JWT authentication
from box_sdk_gen import JWTConfig, BoxJWTAuth, BoxClient
# Using manual configuration
config = JWTConfig(
client_id="YOUR_BOX_CLIENT_ID",
client_secret="YOUR_BOX_CLIENT_SECRET",
jwt_key_id="YOUR_BOX_JWT_KEY_ID",
private_key="YOUR_BOX_PRIVATE_KEY",
private_key_passphrase="YOUR_BOX_PRIVATE_KEY_PASSPHRASE",
enterprise_id="YOUR_BOX_ENTERPRISE_ID",
user_id="YOUR_BOX_USER_ID",
)
# Using configuration file
config = JWTConfig.from_config_file("path/to/your/.config.json")
user_id = "1234" # Optional user id
if user_id:
config.user_id = user_id
config.enterprise_id = None
auth = BoxJWTAuth(config)
client = BoxClient(auth)
reader = BoxReader(box_client=client)
By default the JWT client will use a service account associated with the application. Depending on how the files are shared, the service account may not have access to all the files.
If you want to select a different user, you can specify the user ID. In this case make sure your application can impersonate and/or generate user tokens in the scope.
Check out this guide for more information on how to setup the JWT: Box JWT Guide
The JWT authentication requires extra dependencies in the SDK. You can install them by running:
pip install “box-sdk-gen[jwt]”
Box Reader
This loader reads files from Box using the LlamaIndex SimpleReader
, and does not take advantage of any Box-specific features.
Load data
This method retrieves Box files based on the provided parameters and processes them into a structured format using a SimpleDirectoryReader.
folder_id (Optional[str])
: The ID of the Box folder to load data from; if provided, along withis_recursive
set to True, retrieves data from sub-folders as well; defaults to Nonefile_ids (Optional[List[str]])
: A list of Box file IDs to load data from; if provided, folder_id is ignored; defaults to Noneis_recursive (bool = False)
: If True and folder_id is provided, retrieves data from sub-folders within the specified folder; defaults to False
There can be an overwhelming amount of files and folders, at which point the reader becomes impractical.
Usage:
# Using CCG authentication
from llama_index.readers.box import BoxReader
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient
ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)
reader = BoxReader(box_client=client)
box_client = BoxClient(auth)
reader = BoxReader(box_client=box_client)
# Using a list of file ids
docs = reader.load_data(file_ids=["test_csv_id"])
# Using a folder id
docs = reader.load_data(folder_id="test_folder_id")
Load resource
Load data from a specific resource.
resource (str)
: The resource identifier (file_id
)
Usage:
doc = reader.load_resource("test_csv_id")
List resources
Lists the IDs of Box files based on the specified folder or file IDs.
This method retrieves a list of Box file identifiers based on the provided parameters. You can either specify a list of file IDs or a folder ID with an optional is_recursive
flag to include files from sub-folders as well.
folder_id (Optional[str])
: The ID of the Box folder to load data from; if provided, along withis_recursive
set to True, retrieves data from sub-folders as well; defaults to Nonefile_ids (Optional[List[str]])
: A list of Box file IDs to load data from; if provided, folder_id is ignored; defaults to Noneis_recursive (bool = False)
: If True and folder_id is provided, retrieves data from sub-folders within the specified folder; defaults to False
Usage:
resources = reader.list_resources(file_ids=["test_csv_id"])
Read file content
Returns the binary content of a file.
input_file: Path = Path("test_csv_id")
content = reader.read_file_content(input_file)
Search resources
Searches for Box resources based on specified criteria and returns a list of their IDs.
This method utilizes the Box API search functionality to find resources matching the provided parameters. It then returns a list containing the IDs of the found resources.
Tip: Check out the Box search for more information on how to operate search.
query = "invoice"
resources = reader.search_resources(query=query)
Search resources by metadata
Searches for Box resources based on metadata and returns a list of their IDs.
This method utilizes the Box API search functionality to find resources matching the provided metadata query. It then returns a list containing the IDs of the found resources.
Tip: Check out the Box Metadata Query Language for more information on how to construct queries.
from_ = (
test_data["enterprise_1234"] # your enterprise id
+ "rbInvoicePO" # your metadata template key
)
ancestor_folder_id = "test_folder_invoice_po_id"
query = "documentType = :docType "
query_params = {"docType": "Invoice"}
resources = reader.search_resources_by_metadata(
from_=from_,
ancestor_folder_id=ancestor_folder_id,
query=query,
query_params=query_params,
)
Get resource info
Get information about a specific resource.
resource = reader.get_resource_info(file_id=test_data["test_csv_id"])
Working example
Consider the following code:
import os
from typing import List
import dotenv
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient, File
from llama_index.readers.box import (
BoxReader,
BoxReaderTextExtraction,
BoxReaderAIPrompt,
BoxReaderAIExtract,
)
from llama_index.core.schema import Document
def get_box_client() -> BoxClient:
dotenv.load_dotenv()
# Common configurations
client_id = os.getenv("BOX_CLIENT_ID", "YOUR_BOX_CLIENT_ID")
client_secret = os.getenv("BOX_CLIENT_SECRET", "YOUR_BOX_CLIENT_SECRET")
# CCG configurations
enterprise_id = os.getenv("BOX_ENTERPRISE_ID", "YOUR_BOX_ENTERPRISE_ID")
ccg_user_id = os.getenv("BOX_USER_ID")
config = CCGConfig(
client_id=client_id,
client_secret=client_secret,
enterprise_id=enterprise_id,
user_id=ccg_user_id,
)
auth = BoxCCGAuth(config)
if config.user_id:
auth.with_user_subject(config.user_id)
return BoxClient(auth)
def get_testing_data() -> dict:
return {
"disable_folder_tests": True,
"test_folder_id": "273980493541",
"test_doc_id": "1584054722303",
"test_ppt_id": "1584056661506",
"test_xls_id": "1584048916472",
"test_pdf_id": "1584049890463",
"test_json_id": "1584058432468",
"test_csv_id": "1584054196674",
"test_txt_waiver_id": "1514587167701",
"test_folder_invoice_po_id": "261452450320",
"test_folder_purchase_order_id": "261457585224",
"test_txt_invoice_id": "1517629086517",
"test_txt_po_id": "1517628697289",
}
def print_docs(label: str, docs: List[Document]):
print("------------------------------")
print(f"{label}: {len(docs)} document(s)")
for doc in docs:
print("------------------------------")
file = File.from_dict(doc.extra_info)
print(f"File ID: {file.id}\nName: {file.name}\nSize: {file.size} bytes")
# print("------------------------------")
print(f"Text: {doc.text[:100]} ...")
print("------------------------------\n\n\n")
def main():
box_client = get_box_client()
test_data = get_testing_data()
reader = BoxReader(box_client=box_client)
docs = reader.load_data(file_ids=[test_data["test_txt_invoice_id"]])
print_docs("Box Simple Reader", docs)
if __name__ == "__main__":
main()
Result:
------------------------------
Box Simple Reader: 1 document(s)
------------------------------
File ID: 1517629086517
Name: Invoice-Q2468.txt
Size: 176 bytes
Text: Vendor: Quasar Innovations
Invoice Number: Q2468
Purchase Order Number: 003
Line Items:
- Infini ...
------------------------------
Thoughts? Comments? Feedback?
Drop us a line in our community forum.