Box AI-driven Metadata extraction

Rui Barbosa
Box Developer Blog
Published in
7 min readFeb 28, 2024

--

Image by jemastock on Freepik

In the ever-evolving landscape of enterprise documents, metadata plays a pivotal role in how you organize, discover, and extract value from your unstructured content.

In this workshop, we’ll dive deep into metadata templates, creation, and extraction using the Box Platform API.

Please note: The Box AI-driven metadata extraction API pricing and packaging are to be announced, and capabilities are subject to change. All other metadata features discussed in this article are already available.

You can follow along this workshop with complete code samples via this GitHub Repo. We are using the Box Platform Next Gen Python SDK.

Let’s get started.

Concepts

When working with Box Metadata there are quite a few concepts we need to keep in mind.

  • The Metadata Template represents the structure of the data you want to associate to an unstructured document. Templates are created at the enterprise level.
  • A Metadata Instance represents the association of a template with a document or folder. You can have instances from multiple templates associated with a single document.
  • Administrators can even create Metadata Cascade Policies, allowing a metadata instance to automatically be applied to the contents of a folder.
    For example, a user might assign the same invoiceData metadata template to a project folder allowing it to automatically apply to all the files and folders within that project folder.

The advantage of using metadata to provide structure to your content is that it makes it much easier to create processes, integrations, and workflows. Search results using metadata-based queries are much more accurate when compared to those derived from traditional search queries.

Use case

To illustrate the usage of metadata, consider a simplified procurement process for an enterprise:

  • A company issues a purchase order to a vendor for goods or services
  • The vendor completes the order and sends an invoice to the company
  • The company matches the invoice to the purchase order
  • The company verifies invoice, purchase order, goods or services received, and (if all checks out) issues a payment to the vendor

Typically, the vendor includes the purchase order number in the invoice, so the company can more easily match the two together, complete the process, and pay the vendor. We’ll use the Box Metadata APIs to extract metadata from the invoice and purchase order, and figure out which ones don’t match.

In the example below, we have 5 purchase orders, and 5 matching invoices. However, invoice A5555 does not include the purchase order number.

Create a metadata template

To work with metadata, we need a metadata template to define the metadata fields we want to use.

def create_invoice_po_template(
client: Client, template_key: str, display_name: str
) -> MetadataTemplate:
"""Create a metadata template"""

scope = "enterprise"

fields = []

# Document type
fields.append(
CreateMetadataTemplateFields(
type=CreateMetadataTemplateFieldsTypeField.ENUM,
key="documentType",
display_name="Document Type",
description="Identifies document as an invoice or purchase order",
options=[
CreateMetadataTemplateFieldsOptionsField(key="Invoice"),
CreateMetadataTemplateFieldsOptionsField(key="Purchase Order"),
],
)
)

# Date
fields.append(
CreateMetadataTemplateFields(
type=CreateMetadataTemplateFieldsTypeField.DATE,
key="documentDate",
display_name="Document Date",
)
)

# Document total
fields.append(
CreateMetadataTemplateFields(
type=CreateMetadataTemplateFieldsTypeField.FLOAT,
key="documentTotal",
display_name="Document Total",
description="Total USD value of document",
)
)

# Supplier
fields.append(
CreateMetadataTemplateFields(
type=CreateMetadataTemplateFieldsTypeField.STRING,
key="vendor",
display_name="Vendor",
description="Vendor name or designation",
)
)

# Invoice number
fields.append(
CreateMetadataTemplateFields(
type=CreateMetadataTemplateFieldsTypeField.STRING,
key="invoice",
display_name="Invoice #",
description="Document number or associated invoice",
)
)

# PO number
fields.append(
CreateMetadataTemplateFields(
type=CreateMetadataTemplateFieldsTypeField.STRING,
key="po",
display_name="PO #",
description="Document number or associated purchase order",
)
)

template = client.metadata_templates.create_metadata_template(
scope=scope,
template_key=template_key,
display_name=display_name,
fields=fields,
)

return template

In the main function, let’s check if the template already exists and if not, create it.

def main():
...

# check if template exists
template_key = "rbInvoicePO"
template_display_name = "RB: Invoice & POs"
template = get_template_by_key(client, template_key)

if template:
print(
f"\nMetadata template exists: '{template.display_name}' ",
f"[{template.id}]",
)
else:
print("\nMetadata template does not exist, creating...")

# create a metadata template
template = create_invoice_po_template(
client, template_key, template_display_name
)
print(
f"\nMetadata template created: '{template.display_name}' ",
f"[{template.id}]",
)

This results in:

Metadata template does not exist, creating...
Metadata template created: 'RB: Invoice & POs' [2257ed5b-c4c3-48b1-9881-875b5291ddfa]

Scanning the content using Box AI Metadata Extraction

Let’s create a method to scan the content and get metadata suggestions:

def get_metadata_suggestions_for_file(
client: Client, file_id: str, enterprise_scope: str, template_key: str
) -> IntelligenceMetadataSuggestions:
"""Get metadata suggestions for a file"""
return client.intelligence.intelligence_metadata_suggestion(
item=file_id,
scope=enterprise_scope,
template_key=template_key,
confidence="experimental",
)

Next in our main function we iterate through the files to scan the content and get metadata suggestions:

def main():
...

# Scan the purchase folder for metadata suggestions
folder_items = client.folders.get_folder_items(PO_FOLDER)
for item in folder_items.entries:
print(f"\nItem: {item.name} [{item.id}]")
suggestions = get_metadata_suggestions_for_file(
client, item.id, ENTERPRISE_SCOPE, template_key
)
print(f"Suggestions: {suggestions.suggestions}")

Your results may vary, but in my case:

Item: PO-001.txt [1443731848797]
Suggestions: {'documentType': 'Purchase Order', 'documentDate': '2024–02–13T00:00:00.000Z', 'vendor': 'Galactic Gizmos Inc.', 'invoiceNumber': None, 'purchaseOrderNumber': '001', 'total': '$575'}
Item: PO-002.txt [1443739645222]
Suggestions: {'documentType': 'Purchase Order', 'documentDate': '2024–02–13T00:00:00.000Z', 'total': '$230', 'vendor': 'Cosmic Contraptions Ltd.', 'invoiceNumber': None, 'purchaseOrderNumber': '002'}
Item: PO-003.txt [1443724777261]
Suggestions: {'documentType': 'Purchase Order', 'documentDate': '2024–02–13T00:00:00.000Z', 'total': '1,050', 'vendor': 'Quasar Innovations'}
Item: PO-004.txt [1443739415948]
Suggestions: {'documentType': 'Purchase Order', 'documentDate': '2024–02–13T00:00:00.000Z', 'vendor': 'AstroTech Solutions', 'invoiceNumber': None, 'purchaseOrderNumber': '004', 'total': '920'}
Item: PO-005.txt [1443724550074]
Suggestions: {'documentType': 'Purchase Order', 'documentDate': '2024–02–13T00:00:00.000Z', 'vendor': 'Quantum Quirks Co.', 'invoiceNumber': None, 'purchaseOrderNumber': '005'}

Updating the content metadata

Now that we have the suggestions for the metadata, let’s update the content metadata with the suggestions.

There are 3 things to consider here:

  • We may not get a suggestion for all the fields, or we may get a “None” value. In this case, we first set a default value and then merge the suggestions.
  • The metadata template may not have yet been associated with the document, so we may get an error when trying to update the metadata.
  • The update for the metadata is quite different from traditional updates. It supports operations such as add, replace, remove, test, move, and copy.

Here is an example of a method to update the content metadata:

def apply_template_to_file(
client: Client, file_id: str, template_key: str, data: Dict[str, str]
):
"""Apply a metadata template to a folder"""
default_data = {
"documentType": "Unknown",
"documentDate": "1900-01-01T00:00:00Z",
"total": "Unknown",
"vendor": "Unknown",
"invoiceNumber": "Unknown",
"purchaseOrderNumber": "Unknown",
}
# remove empty values
data = {k: v for k, v in data.items() if v}
# Merge the default data with the suggestions
data = {**default_data, **data}

try:
client.file_metadata.create_file_metadata_by_id(
file_id=file_id,
scope=CreateFileMetadataByIdScope.ENTERPRISE,
template_key=template_key,
request_body=data,
)
except APIException as error_a:
if error_a.status == 409:
# Update the metadata
update_data = []
for key, value in data.items():
update_item = UpdateFileMetadataByIdRequestBody(
op=UpdateFileMetadataByIdRequestBodyOpField.ADD,
path=f"/{key}",
value=value,
)
update_data.append(update_item)
try:
client.file_metadata.update_file_metadata_by_id(
file_id=file_id,
scope=UpdateFileMetadataByIdScope.ENTERPRISE,
template_key=template_key,
request_body=update_data,
)
except APIException as error_b:
logging.error(
f"Error updating metadata: {error_b.status}:{error_b.code}:{file_id}"
)
else:
raise error_a

Next, we’ll update the following code in the main function to store the content metadata:

def main():
...

# Scan the purchase folder for metadata suggestions
folder_items = client.folders.get_folder_items(PO_FOLDER)
for item in folder_items.entries:
print(f"\nItem: {item.name} [{item.id}]")
suggestions = get_metadata_suggestions_for_file(
client, item.id, ENTERPRISE_SCOPE, template_key
)
print(f"Suggestions: {suggestions.suggestions}")
metadata = suggestions.suggestions
apply_template_to_file(
client,
item.id,
template_key,
metadata,
)

If you check the metadata for the purchase orders, you should see the metadata updated with the suggestions:

Metadata captures and applied to a purchase order

Applying metadata to invoices

Add the following code to the main function to scan and apply the metadata to the invoices:

def main():
...

# Scan the invoice folder for metadata suggestions
folder_items = client.folders.get_folder_items(INVOICE_FOLDER)
for item in folder_items.entries:
print(f"\nItem: {item.name} [{item.id}]")
suggestions = get_metadata_suggestions_for_file(
client, item.id, ENTERPRISE_SCOPE, template_key
)
print(f"Suggestions: {suggestions.suggestions}")
metadata = suggestions.suggestions
apply_template_to_file(
client,
item.id,
template_key,
metadata,
)

Resulting in:

Item: Invoice-A5555.txt [1443738625223]
Suggestions: {'documentType': 'Invoice', 'invoiceNumber': 'A5555',
'total': '920'}

Item: Invoice-B1234.txt [1443724064462]
Suggestions: {'documentType': 'Invoice', 'documentDate': None,
'total': '575', 'vendor': 'Galactic Gizmos Inc.', 'invoiceNumber': 'B1234',
'purchaseOrderNumber': '001'}

Item: Invoice-C9876.txt [1443729681339]
Suggestions: {'documentType': 'Invoice', 'invoiceNumber': 'C9876',
'purchaseOrderNumber': '002', 'total': '$230',
'vendor': 'Cosmic Contraptions Ltd.'}

...

Getting metadata for a file

We can directly get the metadata for a file using the following method:

def get_file_metadata(client: Client, file_id: str, template_key: str):
"""Get file metadata"""
metadata = client.file_metadata.get_file_metadata_by_id(
file_id=file_id,
scope=CreateFileMetadataByIdScope.ENTERPRISE,
template_key=template_key,
)
return metadata

Let’s test with the ID of one of the files we just updated:

def main():
...

# get metadata for a file
metadata = get_file_metadata(client, "1443738625223", template_key)
print(f"\nMetadata for file: {metadata.extra_data}")

Resulting in:

Metadata for file: {'invoiceNumber': 'A5555', 'vendor': 'Unknown', 
'documentType': 'Invoice', 'documentDate': '1900-01-01T00:00:00.000Z',
'purchaseOrderNumber': 'Unknown', 'total': '920'}

Finding unmatched invoices

We may have invoices that do not have a matching purchase order. Let’s create a method to query our metadata:

def search_metadata(
client: Client,
template_key: str,
folder_id: str,
query: str,
query_params: Dict[str, str],
order_by: List[Dict[str, str]] = None,
):
"""Search for files with metadata"""

from_ = ENTERPRISE_SCOPE + "." + template_key

if order_by is None:
order_by = [
SearchByMetadataQueryOrderBy(
field_key="invoiceNumber",
direction=SearchByMetadataQueryOrderByDirectionField.ASC,
)
]

fields = [
"type",
"id",
"name",
"metadata." + from_ + ".invoiceNumber",
"metadata." + from_ + ".purchaseOrderNumber",
]

search_result = client.search.search_by_metadata_query(
from_=from_,
query=query,
query_params=query_params,
ancestor_folder_id=folder_id,
order_by=order_by,
fields=fields,
)
return search_result

And in our main function, search for invoices that do not have a matching purchase order:

def main():
...

# search for invoices without purchase orders
query = "documentType = :docType AND purchaseOrderNumber = :poNumber"
query_params = {"docType": "Invoice", "poNumber": "Unknown"}

search_result = search_metadata(
client, template_key, INVOICE_FOLDER, query, query_params
)
print(f"\nSearch results: {search_result.entries}")

Resulting in:

Search results: 
[{'metadata': {'enterprise_1133807781': {'rbInvoicePO':
{'$scope': 'enterprise_1133807781', '$template': 'rbInvoicePO',
'$parent': 'file_1443738625223', 'purchaseOrderNumber': 'Unknown',
'invoiceNumber': 'A5555', '$version': 11}}}, 'id': '1443738625223',
'type': 'file', 'etag': '3', 'name': 'Invoice-A5555.txt'}]

From this point on we can do some interesting things:

  • Update all purchase order metadata with each corresponding invoice number, in case a user finds one and needs a reference to the other.
  • For unmatched invoices, find specific vendor purchase orders that haven’t yet been matched, so it’s easier for someone to manually match the documents.

Final thoughts

Metadata is not just a technical detail — it’s a strategic asset that can transform the way your company works. Metadata templates help teams maintain consistency across the enterprise. Whether it’s purchase orders, legal contracts, or creative assets, predefined templates streamline processes and minimize errors.

The more your organization evolves, the more critical metadata becomes. It makes adaptability, integration with other systems, and maintaining a robust information architecture all possible. Plus, it provides the context that team members need to collaborate seamlessly.

Documents and references

API Reference

Thoughts? Comments? Feedback?

Drop us a line on our community forum.

--

--