Introducing Box Llama-Index Reader

Rui Barbosa
Box Developer Blog
Published in
5 min readAug 13, 2024

--

Ever wanted to integrate your Box documents into cutting-edge Retrieval Augmented Generation (RAG) and other large language model (LLM) applications?

This suite of readers bridges the gap between Box and Llama-Index, allowing you to harness the rich content you store in your Box directly within your LLM workflows.

Available Box readers

You might be surprised to learn that Box offers 4 different readers.

You have the traditional reader, similar to other cloud storage providers, where you download the file, and then use Llama-Index parser.

Because Box is not your traditional cloud storage provider, we can use built-in services, such as text extraction, generic AI prompting, and specialized structured data AI extraction.

  • Box Reader — Implementation of the SimpleReader interface to read files from Box
  • Box Text Extraction — Uses Box text representation to extract text from documents directly
  • Box AI Prompt — Uses Box AI to extract context from documents
  • Box AI Extraction — Uses Box AI to extract structured data from documents

Box AI features are only available to Enterprise plus customers.

Getting started

The Box readers can be found here in the LlamaIndex GitHub, and can be installed like any other reader:

pip install llama-index-readers-box

Authentication

All Box readers require a BoxClient. This simplifies the reader code and provides the greatest flexibility for developers. You can use any type of Box authentication (Developer Token, OAuth 2.0, Client Credential Grants, and JSON Web Token), and use it on your Box Client.

Here are some examples:

Using CCG authentication

from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient
config = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # Optional user id
)
auth = BoxCCGAuth(config)
if config.user_id:
auth.with_user_subject(config.user_id)

client = BoxClient(auth)

reader = BoxReader(box_client=client)

By default the CCG client will use a service account associated with the application. Depending on how the files are shared, the service account may not have access to all the files.

If you want to select a different user, you can specify the user ID. In this case make sure your application can impersonate and/or generate user tokens in the scope.

Check out this guide for more information on how to setup the CCG: Box CCG Guide

Using JWT authentication

from box_sdk_gen import JWTConfig, BoxJWTAuth, BoxClient

# Using manual configuration
config = JWTConfig(
client_id="YOUR_BOX_CLIENT_ID",
client_secret="YOUR_BOX_CLIENT_SECRET",
jwt_key_id="YOUR_BOX_JWT_KEY_ID",
private_key="YOUR_BOX_PRIVATE_KEY",
private_key_passphrase="YOUR_BOX_PRIVATE_KEY_PASSPHRASE",
enterprise_id="YOUR_BOX_ENTERPRISE_ID",
user_id="YOUR_BOX_USER_ID",
)

# Using configuration file
config = JWTConfig.from_config_file("path/to/your/.config.json")

user_id = "1234" # Optional user id
if user_id:
config.user_id = user_id
config.enterprise_id = None
auth = BoxJWTAuth(config)

client = BoxClient(auth)

reader = BoxReader(box_client=client)

By default the JWT client will use a service account associated with the application. Depending on how the files are shared, the service account may not have access to all the files.

If you want to select a different user, you can specify the user ID. In this case make sure your application can impersonate and/or generate user tokens in the scope.

Check out this guide for more information on how to setup the JWT: Box JWT Guide

The JWT authentication requires extra dependencies in the SDK. You can install them by running: pip install “box-sdk-gen[jwt]”

Box Reader

This loader reads files from Box using the LlamaIndex SimpleReader, and does not take advantage of any Box-specific features.

Load data

This method retrieves Box files based on the provided parameters and processes them into a structured format using a SimpleDirectoryReader.

  • folder_id (Optional[str]): The ID of the Box folder to load data from; if provided, along with is_recursive set to True, retrieves data from sub-folders as well; defaults to None
  • file_ids (Optional[List[str]]): A list of Box file IDs to load data from; if provided, folder_id is ignored; defaults to None
  • is_recursive (bool = False): If True and folder_id is provided, retrieves data from sub-folders within the specified folder; defaults to False

There can be an overwhelming amount of files and folders, at which point the reader becomes impractical.

Usage:

# Using CCG authentication

from llama_index.readers.box import BoxReader
from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient

ccg_conf = CCGConfig(
client_id="your_client_id",
client_secret="your_client_secret",
enterprise_id="your_enterprise_id",
user_id="your_ccg_user_id", # optional
)
auth = BoxCCGAuth(ccg_conf)
client = BoxClient(auth)
reader = BoxReader(box_client=client)

box_client = BoxClient(auth)

reader = BoxReader(box_client=box_client)

# Using a list of file ids
docs = reader.load_data(file_ids=["test_csv_id"])

# Using a folder id
docs = reader.load_data(folder_id="test_folder_id")

Load resource

Load data from a specific resource.

  • resource (str): The resource identifier (file_id)

Usage:

doc = reader.load_resource("test_csv_id")

List resources

Lists the IDs of Box files based on the specified folder or file IDs.

This method retrieves a list of Box file identifiers based on the provided parameters. You can either specify a list of file IDs or a folder ID with an optional is_recursive flag to include files from sub-folders as well.

  • folder_id (Optional[str]): The ID of the Box folder to load data from; if provided, along with is_recursive set to True, retrieves data from sub-folders as well; defaults to None
  • file_ids (Optional[List[str]]): A list of Box file IDs to load data from; if provided, folder_id is ignored; defaults to None
  • is_recursive (bool = False): If True and folder_id is provided, retrieves data from sub-folders within the specified folder; defaults to False

Usage:

resources = reader.list_resources(file_ids=["test_csv_id"])

Read file content

Returns the binary content of a file.

input_file: Path = Path("test_csv_id")
content = reader.read_file_content(input_file)

Search resources

Searches for Box resources based on specified criteria and returns a list of their IDs.

This method utilizes the Box API search functionality to find resources matching the provided parameters. It then returns a list containing the IDs of the found resources.

Tip: Check out the Box search for more information on how to operate search.

query = "invoice"
resources = reader.search_resources(query=query)

Search resources by metadata

Searches for Box resources based on metadata and returns a list of their IDs.

This method utilizes the Box API search functionality to find resources matching the provided metadata query. It then returns a list containing the IDs of the found resources.

Tip: Check out the Box Metadata Query Language for more information on how to construct queries.

from_ = (
test_data["enterprise_1234"] # your enterprise id
+ "rbInvoicePO" # your metadata template key
)
ancestor_folder_id = "test_folder_invoice_po_id"
query = "documentType = :docType "
query_params = {"docType": "Invoice"}

resources = reader.search_resources_by_metadata(
from_=from_,
ancestor_folder_id=ancestor_folder_id,
query=query,
query_params=query_params,
)

Get resource info

Get information about a specific resource.

resource = reader.get_resource_info(file_id=test_data["test_csv_id"])

Working example

Consider the following code:

import os
from typing import List
import dotenv

from box_sdk_gen import CCGConfig, BoxCCGAuth, BoxClient, File
from llama_index.readers.box import (
BoxReader,
BoxReaderTextExtraction,
BoxReaderAIPrompt,
BoxReaderAIExtract,
)
from llama_index.core.schema import Document


def get_box_client() -> BoxClient:
dotenv.load_dotenv()

# Common configurations
client_id = os.getenv("BOX_CLIENT_ID", "YOUR_BOX_CLIENT_ID")
client_secret = os.getenv("BOX_CLIENT_SECRET", "YOUR_BOX_CLIENT_SECRET")

# CCG configurations
enterprise_id = os.getenv("BOX_ENTERPRISE_ID", "YOUR_BOX_ENTERPRISE_ID")
ccg_user_id = os.getenv("BOX_USER_ID")

config = CCGConfig(
client_id=client_id,
client_secret=client_secret,
enterprise_id=enterprise_id,
user_id=ccg_user_id,
)

auth = BoxCCGAuth(config)
if config.user_id:
auth.with_user_subject(config.user_id)

return BoxClient(auth)


def get_testing_data() -> dict:
return {
"disable_folder_tests": True,
"test_folder_id": "273980493541",
"test_doc_id": "1584054722303",
"test_ppt_id": "1584056661506",
"test_xls_id": "1584048916472",
"test_pdf_id": "1584049890463",
"test_json_id": "1584058432468",
"test_csv_id": "1584054196674",
"test_txt_waiver_id": "1514587167701",
"test_folder_invoice_po_id": "261452450320",
"test_folder_purchase_order_id": "261457585224",
"test_txt_invoice_id": "1517629086517",
"test_txt_po_id": "1517628697289",
}


def print_docs(label: str, docs: List[Document]):
print("------------------------------")
print(f"{label}: {len(docs)} document(s)")

for doc in docs:
print("------------------------------")
file = File.from_dict(doc.extra_info)
print(f"File ID: {file.id}\nName: {file.name}\nSize: {file.size} bytes")
# print("------------------------------")
print(f"Text: {doc.text[:100]} ...")
print("------------------------------\n\n\n")


def main():
box_client = get_box_client()
test_data = get_testing_data()

reader = BoxReader(box_client=box_client)
docs = reader.load_data(file_ids=[test_data["test_txt_invoice_id"]])
print_docs("Box Simple Reader", docs)

if __name__ == "__main__":
main()

Result:

------------------------------
Box Simple Reader: 1 document(s)
------------------------------
File ID: 1517629086517
Name: Invoice-Q2468.txt
Size: 176 bytes
Text: Vendor: Quasar Innovations
Invoice Number: Q2468
Purchase Order Number: 003
Line Items:
- Infini ...
------------------------------

Thoughts? Comments? Feedback?

Drop us a line in our community forum.

--

--