Data Virtualization with Trino Part 2

Published in

cisco-fpie

6 min readFeb 4, 2022

Welcome back! I will be explaining some of the details of Trino, going through the process of setting up Trino and running some queries that combine data from multiple sources. A Linux operating system or a MAC, a 64-bit version of Java 11 and a Java Development Kit are required for running Trino.

Trino Cluster Setup

You will find everything you need to get started at https://trino.io/download.html . There will be downloads for the server package, a CLI, and a JDBC driver. There is also a docker image available. There’s information about the docker image at https://hub.docker.com/r/trinodb/trino . The server package is a ZIP file that contains the files needed to run a Trino server. This package does not contain a default config. After unzipping the contents the config files will need to be added. Given the fact that you are performing the setup it will be beneficial when troubleshooting because it allows you to easily locate the configuration files. The first part of the battle when using software is locating these files.

First you have to create an “etc” directory inside the installation directory. This directory will hold:

Node Properties: environmental configuration specific to each node
JVM Config: command line options for the Java Virtual Machine
Config Properties: configuration for the Trino server
Catalog Properties: configuration for Connectors (data sources)

The node properties file should be named node.properties. The contents will look like:

node.environment=trino_test
node.id=coordinator_node_one
node.data-dir=/var/trino/data

The “node.environment” setting describes a cluster. All nodes within a cluster will need to have the same environment name. The “node.id” is a unique id for this instance of Trino. Remember every node is a separate instance of a Trino server. The “node.data-dir” is where the logs and other data will reside. There will also be a link to your etc directory and the plugin directory in the data directory.

The JVM config will be named jvm.config. The JVM config should look something like…

-server
-Xmx16G
-XX:-UseBiasedLocking
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000

At first glance it could look daunting, at least it did to me. The options are straight forward like other configs. Something to make note of. When a out of memory exception occurs the process is terminated because the JVM becomes inconsistent. Also the default /tmp directory does not allow execution of code in some installations, which prevents Trino from starting. You can workaround this by overriding the temporary directory by adding -Djava.io.tmpdir=/path/to/other/tmpdir to the list of JVM options.

Lets move on to config properties. Config properties will be named config.properties in your etc directory. Config properties is where you decide if your Trino server or node will function as a coordinator or a worker. For testing, a coordinator can process work as a worker and handle coordinating. This is not ideal for larger clusters. A coordinator looks like:

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery.uri=http://example.net:8080

For a worker:

coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery.uri=http://example.net:8080

Changing node-scheduler.include-coordinator=true will allow a coordinator to schedule work.

query.max-memory: The maximum amount of distributed memory, that a query may use.
query.max-memory-per-node: The maximum amount of user memory, that a query may use on any one machine.

Distributed Memory Figure from Future Learn

Distributed memory is memory used in a multiprocessor computer system where each processor has its own private memory. This is also could be multiple computers communicating through a network.

From here you will need some sort of client. I’ll be using the CLI tool that Trino provides. There will be an example of the catalog.properties file when we start moving data.

Command Line Interface

The CLI comes as a jar and will need permission to execute. If you are using the CLI you can do that by typing chmod +x path-to-cli.jar in a terminal. To the execute the jar run the command ./name-of-jar.jar --server localhost:8080 --catalog medium --schema default. The server is where the CLI will be sending the queries and where the server we set up earlier will be listening. You can exit the CLI by typing quit or exit.

Catalog.properties/Connectors

I will be using some MySQL databases. Connectors will need to be created in a “catalog” folder. The path should be your-server-path/etc/catalog. A docker setup will also have a ‘etc’ directory. The mysql.properties connector will look something like the following.

connector.name=mysql
connection-url=jdbc:mysql://example.net:3306
connection-user=root
connection-password=secret

Each connector can also have connection options that can be altered depending on which database you’re using. You can view the options of each connector in Trino’s documentation. That can be found at https://trino.io/docs/current/connector/mongodb.html.

Now that the connectors are setup, run the server then your client. I run my server locally and in the foreground when testing to catch any errors. If there are any issues I will see them print out to the terminal.

SQL Queries

Trino Table Request Breakdown by Juacall - Bracket icons created by Roundicons Premium

The above image shows how a request to a table will be structured. The catalog will be the name of the connector/catalog properties file. Do not confuse this with the name inside the file. That represents what database you’re using. In this case I’m using a MySQL database. The schema holds the tables and functions. The tables hold the data.

Some things to make note of. You cannot create auto-incrementing columns. or use select into queries. The select into query creates the result table for you when merging/joining data from multiple tables. The result table will have to be created beforehand and then an insert into query can be used. Also for MySQL you cannot drop tables unless you add the config option allow-drop-table=true .

Before starting, run the query show catalogs; . This is to make sure your connectors are available to use. So what I am going to be doing first is taking data from the airport database and world_x database and add it to the trino_result database. First we need to create the table. Unless it is necessary to know what kinda of data is in the columns. You can just run a show columns query to pick the columns of interest that will be used to create a new trino_result table.

If the connector supports it you can also use create view . Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query.

In the insert into query you can see aliases for the tables. I recommend using aliases when possible to prevent typing the long call to the table and misspellings. When that’s done. A simple select query will need to be run to make sure our data was moved and combined successfully.

Trino also provides functions and operators for advanced users of SQL. Like not showing airports from the country ‘AFG’ or showing all countries where the amount of people who speak English is between 10 to 20 percent.

Trino is a powerful tool and hopefully this gives you a good intro on how to use Trino and what Trino can do.

Thanks for reading this article! Leave a comment below if you have any questions.

References:https://trino.io/, https://www.starburst.io/info/oreilly-trino-guide/, https://hub.docker.com/r/trinodb/trino

Data Virtualization with Trino Part 2

Written by Juacall Bernard