Introducing the LangChain Box loader

Published in

Box Developer Blog

4 min readSep 5, 2024

LangChain document loaders are components that allow developers to integrate data from various sources into applications that use large language models (LLMs).

They convert data into standardized Document objects that contain extracted text and metadata, such as the author`s name or publication date.

This allows developers to manage and standardize content for LLM workflows, and to expand the capabilities of their applications beyond the models’ existing knowledge bases.

With the release of a Box loader for LangChain, developers will be able to easily integrate data from documents stored in Box.

Let’s explore how this loader works.

Installation

Installing the Box loader is as simple as:

pip install -U langchain-box

Set up

In order to use the LangChain Box loader you will need:

A Box account — If you are not a current Box customer or want to test outside of your Box production instance, you can use a free developer account
A Box app — This is configured in the developer console, and for Box AI, must have the Manage AI scope enabled; Here you will also select your authentication method
The app must be enabled by the administrator; for free developer accounts, this is whomever signed up for the account

Authentication

Developers can choose any Box supported authentication, such as developer token, CCG, and JWT.

The loader will also support the direct use of any authentication token obtained from any of the Box-supported methods.

For more information, or to learn about how to set up a Box application, check out the Box authentication guide.

Using a developer token example:

from langchain_box.document_loaders import BoxLoader
from langchain_box.utilities import BoxAuth, BoxAuthType

box_developer_token = "your developer token"

auth = BoxAuth(
  auth_type=BoxAuthType.TOKEN,
  box_developer_token=box_developer_token
)

loader = BoxLoader(
  box_auth=auth,
  …
)

You can also set your developer token as the BOX_DEVELOPER_TOKEN environment variable and call BoxLoader directly. Using CCG with user security context:

from langchain_box.document_loaders import BoxLoader
from langchain_box.utilities import BoxAuth, BoxAuthType

box_client_id = "your box client id"
box_client_secret = "your box client secret"
box_user_id = "you box user id"

auth = BoxAuth(
  auth_type=BoxAuthType.CCG,
  box_client_id=box_client_id,
  box_client_secret=box_client_secret,
  box_user_id=box_user_id
)

loader = BoxLoader(
  box_auth=auth,
  …
)

To use OAuth 2.0, you can set auth_type=BoxAuthType.TOKEN, and then pass the token you have obtained.

For more information on Box integration with LangChain, please take a look at the official LangChain documentation.

Box loader

The purpose of the BoxLoader class is to read unstructured data from a document stored in Box and return a LangChain Document.

You can either pass a list of file IDs or a folder. IDWhen using a folder, you can also specify if you want to recursively read the sub-folders.

A Box instance can contain Petabytes of files, and folders can contain millions of files. Be intentional when choosing what folders you choose to index. And we recommend never getting all files from folder 0 recursively. Folder ID 0 is your root folder.

Files without a text representation will be skipped.

Load files

To load files, you’ll need to provide a list of file IDs. Example:

from langchain_box.document_loaders import BoxLoader

box_file_ids = ["1514555423624", "1514553902288"]

loader = BoxLoader(
  box_developer_token=box_developer_token,
  box_file_ids=box_file_ids,
  character_limit=10000, # Optional. Defaults to no limit
)

Load from folder

Alternatively, you can also provide a folder ID. Example:

from langchain_box.document_loaders import BoxLoader

box_folder_id = "260932470532"

loader = BoxLoader(
  box_folder_id=box_folder_id,
  recursive=False, # Optional. return entire tree, defaults to False
  character_limit=10000, # Optional. Defaults to no limit
)

Loading the documents

From here, you call the load() or lazy_load() methods to obtain a Lanchain Document.

Example:

docs = loader.load()
docs[0]

Resulting in:

metadata={
'source':'https://dl.boxcloud.com/api/2.0/internal_files/1514555423624/versions/1663171610024/representations/extracted_text/content/',
'title': 'Invoice-A5555_txt'
},
page_content='
Vendor: AstroTech Solutions
Invoice Number: A5555
Line Items:
- Gravitational Wave Detector Kit: $800
- Exoplanet Terrarium: $120\nTotal: $920'

Lazy load example:

page = []

for doc in loader.lazy_load():
  page.append(doc)

  if len(page) >= 10:
    # do some paged operation, e.g.
    # index.upsert(page)
    page = []

Conclusion

This tool is particularly useful for organizations that store large amounts of unstructured data in Box and want to leverage that data within LLM-driven applications. The loader simplifies the process of integrating Box data into LangChain workflows, enabling more efficient data management and processing.

Potential use cases include:

Finance teams: Automate document processing and analysis, such as extracting invoice or contract details from documents stored in Box.
Content management: Enhance content retrieval and management for applications that need to access large archives of documents.
AI-powered applications: Integrate Box-stored data with LLMs to build more intelligent and context-aware applications.

In order to use the Box AI Platform API endpoints, you must be an Enterprise Plus customer. You must have an application created in the developer console with the appropriate Box AI scope, and your Box instance must have Box AI enabled. Currently, metadata extraction endpoints are in private beta. To use these, you will need to contact your account team.

🦄 Want to engage with other Box Platform champions?

Join our Box Developer Community for support and knowledge sharing!