Working with Box Metadata Queries

Rui Barbosa
Box Developer Blog
Published in
6 min readMar 7, 2023

--

metadata brings your data in to focus

Box Metadata is custom data associated with the content, stored as key/value pairs, and allow you to contextualize your content. Metadata queries can be used to find the content by searching for metadata context attached to them, which is very powerful and helps users or services find exactly what they need.

In this article we will be exploring metadata queries using the Box CLI, REST API, and Python SDK.

This article picks up on the previous Building a metadata service using Box and FastAPI, where we created a FastAPI service to automatically populate metadata of media files.

Metada query format

To be able to use the metadata queries in Box you need to know the scope and key of the metadata template which contains the attributes you're filtering by. You also need a starting folder which can be root.

You also need the query, where you define the parameters of your search, and the values of those parameters.

Finally you specify any sorting and return attributes of the content.

Optionally you may choose to impersonate another user to simulate metadata searches under its security context.

Keep these in mind for all the examples:

  • Scope: enterprise_877840855
  • Key: demoMediaMetadata
  • Root folder: 0
  • Attribute to filter: duration
  • Attribute key: enterprise_877840855.demoMediaTemplate.duration
  • User to impersonate: 18622116055

Although the CLI is mostly used by admins, it is very useful for a developer to do a quick test. In this particular case, I'm using it to also discover what user, which template and can the app access the content.

Using the CLI

We start with the box-cli configured to use JWT authentication. We have the service user associated with the authentication, and also the user who owns the content.

Who am I:

❯ box users:get --csv --fields type,id,name
# output
type,id,name
user,20130487697,JWT

The user to impersonate is me so that the service user has access to my content:

❯ box users --csv --fields type,id,name
# output
type,id,name
user,18622116055,Rui Barbosa
...

The content that has metadata is scattered in several folders, 192442970500, and 19244166451 , let's do a quick check to see if we can list any content:

❯ box folders:items 192442970500 \
--as-user 18622116055 \
--csv --fields type,id,name
# output
type,id,name
file,1121082178302,BigBuckBunny.mp4
file,1127033253925,file_example_AVI_1280_1_5MG.avi
...
❯ box folders:items 192444166451 \
--as-user 18622116055 \
--csv --fields type,id,name
# output
type,id,name
file,1127036724403,file_example_MP3_1MG.mp3
file,1127032879396,file_example_MP3_2MG.mp3
...

To interact with the metadata queries we need to know the template scope and key. We'll be using the Demo Media Metadata (enterprise_877840855 demoMediaMetadata):

❯ box metadata-templates \
--csv --fields id,type,templateKey,scope,displayName
# output
type,id,templateKey,scope,displayName
.,.,demoMediaMetadata,enterprise_877840855,Demo Media Metadata
...

So, let's try to find media files with a duration between 1 and 30 seconds. Note that the duration metadata field is expressed in milliseconds:

❯ box metadata-query enterprise_877840855.demoMediaMetadata 0 \
--as-user 18622116055 \
--query "duration >= :min AND duration <= :max" \
--query-params min=1000f,max=30000f \
--order-by duration=ASC \
--extra-fields type,id,name,metadata.enterprise_877840855.demoMediaMetadata.duration \
--csv --fields type,id,name,metadata.enterprise_877840855.demoMediaMetadata.duration
# output
type,id,name,metadata.enterprise_877840855.demoMediaMetadata.duration
file,1127036724403,file_example_MP3_1MG.mp3,27167
file,1127036652230,file_example_MP3_700KB.mp3,27252
file,1127033373615,file_example_WAV_5MG.wav,29628

Breaking down the above CLI command:

# search for content using the demo media template (scope.key)
# starting from the root folder (0)
box metadata-query enterprise_877840855.demoMediaMetadata 0 \

# impersonating a user
--as-user 18622116055 \

# query the duration with 2 parameters (:min and :max)
--query "duration >= :min AND duration <= :max" \

# set :min to 1000 miliseconds and max to 30000ms
--query-params min=1000f,max=30000f \

#sort the output by duration
--order-by duration=ASC \

# include these extra fields
--extra-fields type,id,name,metadata.enterprise_877840855.demoMediaMetadata.duration \

# set the output to csv and show only these fields
--csv --fields type,id,name,metadata.enterprise_877840855.demoMediaMetadata.duration

Check out the documentation on the CLI for box metadata-query here.

Using the REST API

The easiest way to test this is to use Postman, checkout how to fork the Box API collection to get started.

One option is to use the metadata filter URL parameter and execute a GET on the search end point.

Create the metadata parameter, and use it on the end point:

[
{
"scope":"enterprise_877840855",
"templateKey":"demoMediaMetadata",
"filters":{
"duration":{
"gt":1000,
"lt":30000
}
}
}
]
https://{{api.box.com}}/2.0/search
?mdfilters=[{"scope":"enterprise_877840855","templateKey":"demoMediaMetadata","filters":{"duration":{"gt":1000,"lt":30000}}}]
&fields=type,id,name,metadata.enterprise_877840855.demoMediaMetadata.duration

Resulting in:

output of the api call using postman

Another option is to use the metadata queries and send a POST to the execute read end point.

The post body looks like this:

{
"from": "enterprise_877840855.demoMediaMetadata",
"query": "duration >= :min AND duration <= :max",
"query_params": {
"min": 1000,
"max":30000
},
"ancestor_folder_id": "0",
"order_by": [
{
"field_key": "duration",
"direction": "asc"
}
],
"fields": [
"type",
"id",
"name",
"metadata.enterprise_877840855.demoMediaMetadata.duration"
]
}

And results in:

output of the api call using postman

Checkout the API reference for search and metadata queries.

Using the Python SDK

First note the basic configuration class and the get_box_client() method which returns a box client impersonating a user.

"""sample script for box metadata query"""
from boxsdk import Client, JWTAuth
from boxsdk.object.search import MetadataSearchFilter, MetadataSearchFilters

# configuration class
class Config:
"""configurations"""
SCOPE = "enterprise_877840855"
KEY = "demoMediaMetadata"
ATTRIBUTE = "duration"
AS_USER = "18622116055"
JWT_FILE = ".jwt.config.json"

def get_box_client() -> Client:
"""returns a box client impersonating the user"""
cfg = Config()
auth = JWTAuth.from_settings_file(cfg.JWT_FILE)
service_client = Client(auth)
as_user = service_client.user(user_id=cfg.AS_USER)
return service_client.as_user(as_user)

One option to execute a query against the metadata is to use the search with the MetadataSearchFilters(), and then pass that into the client.search().query() method as a metadata_filter argument:

def main():
"""main function"""
cfg = Config()

# get a client inpersonating the user
client = get_box_client()

# initialize the metadata search class
mt_query = MetadataSearchFilters()

# create a metadata search filter for the specific template
mt_filter = MetadataSearchFilter(template_key=cfg.KEY, scope=cfg.SCOPE)

# add a range filter for the attribute
mt_filter.add_range_filter(field_key=cfg.ATTRIBUTE, gt_value=1000, lt_value=30000)

# add the filter to the query
mt_query.add_filter(mt_filter)

# search for items
items = client.search().query(None, limit=100, offset=0, metadata_filters=mt_query)

#### code removed for simplicity ###

Resulting in:

----------------------------------------------------------------------
type id name duration
----------------------------------------------------------------------
file 1127033373615 file_example_WAV_5MG.wav 29628
file 1127036652230 file_example_MP3_700KB.mp3 27252
file 1127036724403 file_example_MP3_1MG.mp3 27167

The other option is to use the search metadata_query() method. This time let's look for files with a duration between 100 and 600 seconds:

def main():
"""main function"""
cfg = Config()

# get a client inpersonating the user
client = get_box_client()

# create a query using parameters
md_query = "duration >= :min AND duration <= :max"

# set the parameter values
md_params = {"min": 100000, "max": 600000}

# set the order by
order_by = [{"field_key": "duration", "direction": "ASC"}]

# set the fields to return
fields = [
"type",
"id",
"name",
"metadata." + cfg.SCOPE + "." + cfg.KEY + "." + cfg.ATTRIBUTE,
]

# search for items
mt_items = client.search().metadata_query(
from_template=cfg.SCOPE + "." + cfg.KEY,
ancestor_folder_id=0, # root folder
query=md_query,
query_params=md_params,
order_by=order_by,
fields=fields,
)

Resulting in:

----------------------------------------------------------------------
type id name duration
----------------------------------------------------------------------
file 1127038082548 file_example_MP3_5MG.mp3 132205
file 1121082178302 BigBuckBunny.mp4 596473

This full working script can be downloaded from this GitHub Gist.

Checkout the Python SDK search documentation.

Our other SDK's, Java, .NET, and Node work in a similar way, check them out.

Box app

Let's not forget about the users, they can also perform searches using metadata. The example will look like this:

using a metadata query on the box.com app

Metadata Query API vs Search API

The Metadata Query API complements the Search API and enables you to find content based strictly on values stored in custom metadata, leverage SQL-like logic to find content, and get results that immediately reflect metadata creations, updates, and immediate deletions without any indexing delay.

The Search API, on the other hand, is used when you want fuzzy matching, results based on document text and the Box default attributes. It allows you to search by file or folder names, as well as the corresponding document text, and to filter results based on criteria with simple boolean searches, such as “attribute” = “value”.

The Search API relies on indexing, which is a process that happens whenever you upload or modify content, and has limitations including delayed indexing and limited operations for filters.

The Metadata Query API should be used when you want to find content based strictly on your metadata and not on document content.

In general, the Metadata Query API is better suited for complex queries that involve multiple metadata fields, such as queries that require number, date, and string comparisons, while the Search API is better for simpler queries and searches that involve full-text search.

For more information check out the Understanding the Metadata Query API and Search APIGuides & References.

If you want to learn more about Box Metadata check out these links:

--

--