Exploring the Machine Learning capabilities of Trino

Published in

Towards Data Engineering

5 min readJun 11, 2024

Having utilized Trino extensively as a high-performance SQL engine to query diverse data sources in a unified manner, I’ve developed a deep appreciation for its capabilities. Recently, I discovered an exciting aspect of Trino that I hadn’t explored before: its built-in machine learning features, thanks to a colleague’s insightful suggestion.

According to the documentation, the machine learning plugin enables users to train Support Vector Machine (SVM) based classifiers and regressors using an aggregation-function-like interface. Now, I have a sweet spot for SVM-based models from my days as a post-doc researcher: I’ve trained my fair share of them in Python, R, end even MATLAB. How difficult will it be to train a SVM classifier by just using SQL? Let’s find out!

Note: This article references Trino 449 which is the latest available version at the time of writing this article (as of June 11th, 2024).

The Trino documentation

I studied the Trino documentation which unfortunately provides very limited information regarding the ML functionality:

The machine learning functions are not optimized for distributed processing. The capability to train large data sets is limited by this execution of the final training on a single instance.
The trained model can not be saved natively, and needs to be passed in the format of a nested query. As a result you need to run the training process at the same time when predicting values.
Internally, the model is trained by libsvm. You can use learn_libsvm_classifier() and learn_libsvm_regressor() to control the internal parameters of the model.

A few questions arise after reading this description. What type of SVM does this plugin use? LIBSVM supports C-SVC, nu-SVC, one-class SVM, epsilon-SVR and nu-SVR. Moreover, SVM models have several hyperparameters that should be configured, often in a k-fold cross-validation setting, to tune the model to the available data and obtain the best possible classifier or regressor. From the documentation, it seems possible to control the internal parameters of the models, but no details are provided on the actual format of the parameters.

No description is provided for **params**.

The Trino source-code

By looking at the code in the io.trino.plugin.ml package, I was able to infer the following:

The classifier implementation used is C-SVC;
The regressor implementation used is nu-SVR;
For both classifier and regressor, the features are automatically normalized in [0, 1] during the model training. The same transformation is then applied to the test data as well;
Among the available LIBSVM parameters, the cache memory size is fixed to 100MB, the shrinking heuristic is enabled, while probability estimates and weights are disabled;
The params argument is a comma-separated string of key=value pairs described in what follows:

C: the regularization parameter
possible values: Doubledefault: 1.0example: C=1.0
nu: the nu parameter of nu-SVR
possible values: Doubledefault: 0.5example: nu=0.8
eps: the tolerance of the termination criterion
possible values: Doubledefault: 0.1example: eps=0.001
kernel: the type of kernel function to be used
possible values: linear, poly, rbf, sigmoid
default: linear
example: kernel=rbf
degree: the degree of the polynomial kernel function (poly kernel only)
possible values: Integerdefault: 3example: degree=3
gamma: the gamma parameter for the rbf, poly, and sigmoid kernels
possible values: Doubledefault: 0.0example: gamma=0.25
coef0: the coefficient for the poly, and sigmoid kernels
possible values: Doubledefault: 0.0example: coef0=0.0

Creating a sample dataset

I am testing the Machine Learning capabilities on Trino 449. I have configured a Delta Lake connector for these tests called “delta” and a schema in this connector called “ml”. I will use the famous Iris flower dataset for this experiment.

USE "delta".ml;

-- First, let's create the iris table
CREATE OR REPLACE TABLE
  ml.iris (
    sepal_length DOUBLE NOT NULL,
    sepal_width DOUBLE NOT NULL,
    petal_length DOUBLE NOT NULL,
    petal_width DOUBLE NOT NULL,
    species VARCHAR NOT NULL
  );

-- Then, fill it with the dataset values
INSERT INTO ml.iris
VALUES (5.1, 3.5, 1.4, 0.2, 'I.setosa'),
       (4.9, 3, 1.4, 0.2, 'I.setosa'),
       -- I am not including all the 150 records here
       (5.9, 3, 5.1, 1.8, 'I.virginica');

Train/test split

We need to split the dataset into a training set to train the model on, and a testing set to evaluate its performance. Since the ML plugin automatically scales the features to the [0, 1] interval, no pre-processing is required.

/* 
  The following needs to be a table because Trino does not cache views,
  and each query referencing the view would get different results.
*/
CREATE OR REPLACE TABLE ml.iris_with_random AS
SELECT *, RANDOM() AS rand_val
FROM ml.iris;

/*
  The following can be views since the result is deterministic at this point.
  We will do a 80% - 20% split.
*/
CREATE OR REPLACE VIEW ml.iris_training AS
SELECT
    sepal_length,
    sepal_width,
    petal_length,
    petal_width,
    species
FROM ml.iris_with_random
WHERE rand_val < 0.8;

CREATE OR REPLACE VIEW ml.iris_testing AS
SELECT
    sepal_length,
    sepal_width,
    petal_length,
    petal_width,
    species
FROM ml.iris_with_random
WHERE rand_val >= 0.8;

Preparing the classifier

Now let’s define our classifier and store it as a view for future use. Please remember that there is no way of storing the actual model, which means that the training will be performed at each prediction.

CREATE OR REPLACE VIEW ml.iris_model AS
  SELECT
    learn_libsvm_classifier(
      species,
      features(
        sepal_length,
        sepal_width,
        petal_length,
        petal_width
      ), 
      'kernel=rbf,gamma=0.25,C=100'
    ) AS m
  FROM ml.iris_training;

Using the classifier

Finally, let’s predict species for the samples of the test set, and measure the model accuracy.

CREATE OR REPLACE VIEW ml.iris_model_predictions AS
  SELECT classify(
           features(
             sepal_length,
             sepal_width,
             petal_length,
             petal_width
           ),
           m
         )       AS predicted_label,
         species AS actual_label
  FROM ml.iris_testing, ml.iris_model;

SELECT SUM(CASE WHEN predicted_label = actual_label THEN 1.0 ELSE 0.0 END) / CAST(COUNT(*) AS DOUBLE) AS accuracy
FROM ml.iris_model_predictions;

The model achieves a 96% classification accuracy.

With only one misclassified record out of 25.

Final considerations

Although limited and somewhat clunky (I challenge anyone to implement hyperparameter optimization with k-fold cross-validation in pure SQL), this feature can provide a handy tool for simple missing data imputation and regression tasks without having to resort to external libraries.

The complete SQL code is available here: