External Tables for Nested JSON with Trino

Andreas Eberle
arconsis
Published in
2 min readMar 15, 2021

To process JSON files we have in our Min.IO data lake, we wanted to create an external table with our Trino (formerly named PrestoSQL) instance. Often, JSON contains nested fields with arrays and objects. Although Trino uses Hive Metastore for storing the external table’s metadata, the syntax to create external tables with nested structures is a bit different in Trino. In this post I’ll show you a small example of how to handle nested JSON with Trino.

JSON Example Data

Our example json looks like this:

{
"firstName": "Hans",
"lastName": "Maulwurf",
"address": {
"street": "920 048 Grove",
"city": "Springfield",
"country": "USA"
},
"jobs": [
{
"job": "Firefighter",
"episode": "s04e03"
},
{
"job": "Teacher",
"episode": "s05e22"
}
]
}

We will store this in our Min.IO in a bucket called data-lake in a directory persons . Please keep in mind that you need to store a minified version of the above JSON or the deserializer will not be able to deserialize the json.

Trino External Table Defintion

To create the table definition, we connect to Trino with the command line interface and run the following CREATE TABLE statement to create the external table.

CREATE TABLE persons (
firstName varchar,
lastName varchar,
address ROW(
street varchar,
city varchar,
country varchar
),
jobs ARRAY(
ROW(
job varchar,
episode varchar
)
)
)
WITH (
external_location = 's3a://data-lake/persons',
format = 'JSON'
)

Querying Nested Fields

Now to find everyone who ever worked as a firefighter, and get the details about that firefighting job you can run a query like this:

SELECT 
firstname,
lastname,
filter(jobs, jobObject -> jobObject.job = 'Firefighter') as jobs
FROM
persons
WHERE
any_match(jobs, jobObject -> jobObject.job = 'Firefighter');

And get the result:

 firstname | lastname |                jobs
-----------+----------+--------------------------------------
Hans | Maulwurf | [{jobs=Firefighter, episode=s04e03}]

Recap

  • With Trino, you can easily make files in your data lake accessible via SQL-like queries.
  • For arrays use ARRAY(<type>)
  • For nested objects use ROW(fieldName1 <type>, fieldName2 <type>, ...)

More Documentation

--

--

Andreas Eberle
arconsis

Solutions Architect & Software Engineer @arconsis