Extracting structured data using Box AI

Rui Barbosa
Box Developer Blog
Published in
4 min readJun 25, 2024

In this article, we’ll demonstrate how to extract structured data from a document using the Box AI API.

This new endpoint allows your app to query an unstructured document and populate data based on an ad hoc structure.

This endpoint is still in beta and may not be available to your Box account tier.

In a previous article, we demonstrated how to extract structured metadata from an unstructured document using Box AI.

When we use Box AI-driven metadata extraction, we are instructing the Box AI API to extract data according to a pre-set structure obtained from the metadata template.

Let’s recap with an example. Consider this sample invoice:

Sample invoice

And this metadata template:

Metadata template

We ask for the metadata suggestions:

curl --location 'https://api.box.com/2.0/metadata_instances/suggestions?item=file_1443721424754&scope=enterprise_1133807781&template_key=rbInvoicePO&confidence=experimental' \
--header 'Authorization: Bearer Qj...RF'

We then see these results:

{
"$scope": "enterprise_1133807781",
"$templateKey": "rbInvoicePO",
"suggestions": {
"documentType": "Invoice",
"total": "1,050",
"vendor": "Quasar Innovations",
"invoiceNumber": "Q2468",
"purchaseOrderNumber": "003"
}
}

So Box AI is able to take a data structure, read the unstructured document, and return with its best suggestions to fill in the structured data.

Extracting structure data

We now can do the same but without the metadata template. We point the Box AI API to the document, and ad hoc structure, and it will do the same. For example:

curl --location 'https://api.box.com/2.0/ai/extract' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer Qj...RF' \
--data '{
"prompt": "{\"fields\":[{\"key\":\"vendor\",\"displayName\":\"Vendor\",\"type\":\"string\",\"description\":\"Vendorname\"},{\"key\":\"documentType\",\"displayName\":\"Type\",\"type\":\"string\",\"description\":\"\"}]}",
"items": [
{
"type": "file",
"id": "1443721424754"
}
]
}'

Results:

{
"answer": "{\"vendor\": \"Quasar Innovations\", \"documentType\": \"Invoice\"}",
"created_at": "2024-05-31T10:15:38.17-07:00",
"completion_reason": "done"
}

You might have noticed that my prompt suspiciously looks like the definition of the metadata template, expressed as a JSON string. Here is a summary of the actual template definition:

{
"id": "8105a3ed-dca3-495f-9e89-9bdf316bb832",
"type": "metadata_template",
"templateKey": "rbInvoicePO",
"scope": "enterprise_1133807781",
"displayName": "RB: Invoice & POs",
"fields": [
{
"id": "2af17183-d0c7-4510-8344-0cb5facb2120",
"key": "documentType",
"displayName": "Document Type",
"description": "Is this an invoice or a purchase order?",
},
{
"id": "546e67d0-307e-46b8-ab19-e8d2ae05d0ed",
"type": "string",
"key": "vendor",
"displayName": "Vendor",
}
]
}

When I first saw this I thought, “Why not use a proper JSON object? This feels cumbersome.”

This is when the engineering team explained that I don’t need to use a Box metadata definition, you can pass anything to help Box AI look for the structured data. For example:

curl --location 'https://api.box.com/2.0/ai/extract' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer Qj...RF' \
--data '{
"prompt": "{\"vendor\",\"total\",\"doctype\",\"date\",\"PO\"}",
"items": [
{
"type": "file",
"id": "1443721424754"
}
]
}'

Results:

{
"answer": "{\"vendor\": \"Quasar Innovations\", \"total\": \"$1,050\", \"doctype\": \"Invoice\", \"PO\": \"003\"}",
"created_at": "2024-05-31T10:28:51.906-07:00",
"completion_reason": "done"
}

You can even use plain English:

curl --location 'https://api.box.com/2.0/ai/extract' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer Qj...RF' \
--data '{
"prompt": "find the document type (invoide or po), vendor, total, and po number",
"items": [
{
"type": "file",
"id": "1443721424754"
}
]
}'

Results:

{
"answer": "{\"Document Type\": \"Invoice\", \"Vendor\": \"Quasar Innovations\", \"Total\": \"$1,050\", \"PO Number\": \"003\"}",
"created_at": "2024-05-31T10:30:51.223-07:00",
"completion_reason": "done"
}

Depending on the document, you can even get results with something as simple as:

curl --location 'https://api.box.com/2.0/ai/extract' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer Qj...RF' \
--data '{
"prompt": "summarize document",
"items": [
{
"type": "file",
"id": "1443721424754"
}
]
}'

Resulting:

{
"answer": "{\"Vendor\": \"Quasar Innovations\", \"Invoice Number\": \"Q2468\", \"Purchase Order Number\": \"003\", \"Total\": \"$1,050\"}",
"created_at": "2024-05-31T10:47:59.295-07:00",
"completion_reason": "done"
}

I paused for a moment to consider the implications.

This is genius! This means that if you have a formal structure originating from any other system, you can feed it to Box AI, JSON, XML, YAML, your own language, it doesn’t matter.

Of course your mileage may vary. You might need to tweak the prompt to help Box AI to find what you want. At some point, if you stress it enough, it might respond with an empty result.

For example, consider this document:

Sample unstructured document
curl --location 'https://api.box.com/2.0/ai/extract' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer Qj...RF' \
--data '{
"prompt": "summarize document",
"items": [
{
"type": "file",
"id": "1530265998769"
}
]
}'

Results:

{
"answer": "{}",
"created_at": "2024-05-31T10:41:03.228-07:00",
"completion_reason": "done"
}

Our engineering team continues to come up with interesting technical use cases and applications, and we’re excited to see how you apply these concepts in your applications.

Thoughts? Comments? Feedback?

Drop us a line in our community forum.

--

--