How to Build a classification model with SQL

An Introduction to PostgresML, an open-source extension of PostgreSQL for building, fine-tuning and deploying ML models using only SQL; A step by step guide

Ali.Mz
CodeX
8 min readOct 6, 2022

--

Photo by Wolfgang Hasselmann on Unsplash

Introduction

SQL is a powerful programming language for storing, manipulating and retrieving data in databases. However, when it comes to developing machine learning (ML) models, Python would be the best fit for several reasons, such as great ML libraries and frameworks, flexibility, platform independence, and a wide active community.

What about SQL? What if we build and run ML models in the heart of data warehouse, where data stay, without ever having to leave the database? This approach allows us to run algorithms efficiently without transferring data to external destinations, which could lead to higher operating expenses to run and maintain models, particularly when dealing with large datasets.

This is the question I had, which landed me on multiple cloud-based solutions, such as Oracle Autonomous DB, Google BigQueryML, and AWS RedshiftML, with out-of-the-box support for ML systems. Along with these cloud services, there is an open-source extension for PostgreSQL, called PostgresML, that currently supports ML development workflow for classification and regression models using algorithms from Scikit-Learn, XGBoost, and LightGBM libraries. It first came out as a distributed production-ready solution to eliminate the latency and complexity of extracting data from databases and transforming them into a useful format for ML models.

In this introductory note, I’ll be using our friendly iris dataset to walk you through different components of PostgresML while building a simple multi-classification model. This way, we’ll improve our knowledge of machine learning and SQL, two essential skills in the data world.

The remainder of this note will cover the following subjects:

  • PostgresML installation
  • Data ingestion in PostgresML
  • Model training in PostgresML
  • Model prediction and performance evaluation in PostgresML

Installation

Step 1: Install prerequisites

Thanks to containerized services, all we need to start with PostgresML is to set up the following prerequisites:

  • Docker(to run services for PostgresMl-extension, model performance monitoring dashboard, and documentation)
  • Docker Compose (to manage multiple containers)

In my previous blog, I briefly described these two systems in case you need a quick refresher.

Step 2: Clone the PostgresML repository

Let’s start by cloning the source code:

Step 3: Start dockerized services

Next, we’ll use the docker-compose file to build and run dockerized services for us:

In the background, docker-compose runs three containers:

  • PostgresML-extension — This is the main container that installs a Python-based ML extension for PostgreSQL system and defines various elements of ML workflows (e.g. data ingestion and model training) as PostgreSQL functions. At heart, PostgresML uses Scikit Learn and XGBoost to load the datasets, train regression/classification models, fine-tune the hyperparameters and make predictions.
  • PostgresML-Dashboard — Dashboard provides a useful UI tool to navigate through different projects and datasets within the database. It also enables users to interact with models through a built-in SQL IDE. This comes in handy, particularly when treating ML workflows as PostgreSQL tables. In this note, I’ll use the dashboard tool to build out my model and get some predictions.
  • PostgresML-Docs — Lastly, this container provides PostgresML documentation.

Step 4: Validate your installation

Great! Let’s validate the installation before moving on to the development stage. Once containers attain their running state, open up a new browser and log into the local dashboard through port 8000:

You should be able to see the main window similar to what is shown below.

A view of the main PostgresML dashboard. “Image by author”

Quick Start with a multiclassification problem using the IRIS dataset

Since the focus of this introductory note will be on describing the components of PostgresML and not building a complex ML model, I decided to use our helpful IRIS dataset to frame a simple multiclassification problem.

Fisher’s Iris data set, also known as the Iris flower data set, is a multivariate data set that includes 50 samples from each of the three Iris species (Iris setosa, Iris virginica and Iris versicolor). From each sample, the length and width of the sepals and petals were measured in centimetres. The goal is to create a discriminative model that separates the species based on the combination of these properties. Here is a simple graphical EDA from Wikipedia representing the relationship between these four features:

Import data

The first stage in an ML workflow is to source and prepare our dataset. That’s usually the hardest and most time-consuming part of a data science project, but thanks to the preprocessed IRIS dataset, we can skip this step. The full list of sample datasets can be found under the posgresml-extension/example folder.

To load the iris dataset in the dashboard, navigate to the built-in SQL IDE and run the following query:

You should see something like this:

Image by author

This means that the iris dataset is successfully loaded, and it’s now available within the pgml scheme. In the background, PostgresML calls the Scikit-learn library to load the dataset for us. To browse the data, run the following query in the same SQL IDE we use to load the data:

This results in:

Image by author

This is a simple way to explore our dataset and get a better sense of how it’s structured. The dataset contains four predictor columns corresponding to Sepal or Petal structural information and one response (or label) column referring to a class belonging to a specific type of IRIS plant (Setosa, Versicolour, and Virginica):

Model training

Training a regression or classification model is fairly straightforward in PostgresML and only requires running a single training call. Here is the full list of acceptable arguments:

In this step, we:

  • Give our project a name.
  • Determine the type of problem, i.e. classification or regression. PostgresML also uses this argument to call the relevant metrics to evaluate the model performance.
  • Provide the name of table or view within pgml scheme that has already been loaded.
  • Identify the target column. This argument is equivalent to class label, dependent variable and output variable.
  • Specify a specific algorithm, e.g. XGBoost, SVM, and Linear.
  • Optionally provide details on how to implement the hyperparameter tuning. For example, search_param allows us to list the name and ranges of hyperparameters we want to tune in a dictionary format, and withsearch , we can set the hyperparameter tuning strategy (i.e. grid or random search).
  • Optionally determine test size and test sampling strategy. By default, 25% of the data (specified by test_size) will be randomly sampled to measure the performance of the model.

From here, PostgresML registers a new project in the PostgreSQL database and uses the Python ML libraries to train and validate the model against the target table in pgml scheme. The results will then be evaluated against the test dataset using predefined metrics relevant to a particular task. For example, confusion matrix is used to measure the performance of classification models, while RMSE is used for regression models.

Great! Let’s navigate to the SQL IDE and train a simple linear model on the iris dataset using the following command:

Once the model training process is done, you should see the following results in your notebook:

Image by author

By running the train function we:

  • Created a new project called “my_iris_classification”
  • Took a snapshot of the table pgml.iris to make the process reproducible. This is a useful feature because data in real-word tend to change frequently, thus having snapshots will make it simple to go back in time.
  • Trained a linear classifier on the data contained in the pgml.iris table using feature and target columns.
  • Deployed the model into production. Sometimes the deployment status changes to not-deployed, meaning that the model performance evaluation has not satisfied the deployment criteria.

Great! we are now ready to feed our linear classifier model with unseen data points and make predictions.

Model Prediction

PostgresML uses the most recently deployed model to provide online predictions for given new data points. With a single call to the prediction function, we’ll be able to pass the array of new inputs to our model and receive a prediction.

Let’s test our classifier algorithm to see if it correctly identifies the type of iris flower for a new array of inputs that we know belongs to a Setosa flower (or class 0). Navigate to the IDE and run the following query:

You should see the following results in your notebook:

Image by author

The prediction column represents the possible outcome of the target column given the new array of features we just passed into the function. Note that ARRAY[...] must be in the same order of features used in training.

We can simply predict multiple random points and compare them to the actual labels in the dataset:

Here is the result I got:

Image by author

Model Performance Evaluation

Great! So far, we’ve trained our model and generated a few random predictions, but before deploying our model into real production, we need to validate the trustworthiness of our model by evaluating it against several metrics.

In the dashboard, if you navigate to the Projects tab and select our newly created project, “my_iris_project”, you can find our linear classifier tested against a commonly used classification evaluation metric, the confusion matrix.

Image by author

Conclusion

Thanks for reading! This was an introduction to PostgresML, an extension to PostgreSQL with capabilities to build, evaluate and deploy regression or classification models in SQL. In this introductory note, we reviewed how to use docker to install PostgresML, import predefined data, train and validate a simple classifier model, and evaluate the performance of our model with unseen data points, all within a SQL IDE in the PostgresML dashboard app.

Stay tuned for the next blog that focuses more on the application of PostgresML in solving real-world problems by going through some case studies.

Also, if you are a data scientist, analyst, or engineer and would like to implement your regression/classification model in SQL using PostgresML, reach out to me. I’m just a ping away :)

--

--