Greenplum for Data Science Blog Series Part 4: Model deployment on Greenplum using Python with GreenplumPython

Published in

Greenplum Data Clinics

5 min readApr 19, 2023

This blog constitutes the fourth part of the “Greenplum for End-to-End Data Science & ML” series. In this article, we introduce GreenplumPython, which enables users to create and execute Greenplum UDFs and UDAs in Python.

Introduction

GreenplumPython enables Data Scientists to code in a familiar Pythonic way, execute data transformations and train Machine Learning models faster and more securely in a single Greenplum platform. Under the hood, function calls on DataFrames are translated into SQL statements by GreenplumPython and automatically sent to Greenplum. As a result, computations take place in parallel within the Greenplum cluster.

Dataset overview

This blog uses the dataset cleaned in part 3 of our blog series.

The dataset consists of 1967 stock market news from English language financial news categorised by sentiment. Data Fields

docid: the unique identifier of financial news
original_news: the financial news raw text
label: a label corresponding to the class as a ‘positive’ or ‘negative’ string.
cleaned_text: the original_news cleaned

sentiment_news = db.create_dataframe(table_name="sentiment_news", schema="ds_demo")
sentiment_news[:5]

NLP — Model Deployment with GreenplumPython

Import packages

Previously, the user had to write all the dependencies within the function body to create a UDF in the database. This often results in annoying boilerplate code, especially when importing the same packages into different functions.

GreenplumPython supports references to objects outside the UDF, including functions and modules. This bridges the gap between a UDF and a normal Python function.

We import all packages our UDFs need:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
from typing import List

Model Training function

As in blog 3, we want to create a function to train an NLP model and return the logistic regression model and count vectoriser model trained.

Create Type

To do so, we need first create a special class sentiment_nlp_type to store the output of the function as binaries:

class sentiment_nlp_type:
    model_logreg: bytes
    model_bow: bytes

Create function

We can write a train function using the PL/Python3 extension as below by specifying the return type.

The @gp.create_column_function decorator converts a Python function into a User-Defined Function (UDF) in the database and automatically aggregates the input columns to be applied to columns.

Note: These functions created are only temporarily saved in the database and will be deleted outside the session.

@gp.create_column_function
def save_nlp_models(cleaned_text: List[str], label: List[str]) -> sentiment_nlp_type:

    X_train = cleaned_text
    y_train = label

    vectorizer = CountVectorizer(min_df=4, stop_words="english")

    X_train = vectorizer.fit_transform(X_train)

    # LOGISTIC REGRESSION
    logreg = LogisticRegression()
    logreg.fit(X_train, y_train)

    # Save Logistic Regression model
    model_logreg = pickle.dumps(logreg)

    # Save CountVectorizer
    model_countvectorizer = pickle.dumps(vectorizer)

    return {'model_logreg': model_logreg,
            'model_bow': model_countvectorizer}

Apply function

We can use this function to train a model that learns from the DataFrame sentiment_news:

model_nlp = sentiment_news.apply(
                lambda t: save_nlp_models(t["cleaned_text"], t["label"]), 
                expand=True
            )

Prediction function

In the same way, as we created the train function, we will now create the predict function using the @gp.create_function decorator:

@gp.create_function
def predict_sentiment(cleaned_text: str, model_logreg: bytes, model_bow: bytes) -> str:
    logreg = model_logreg
    vectorizer = model_bow
    texts = cleaned_text

    # Extract Logistic Regression model
    model_logreg_bytes = pickle.loads(logreg)
    # Extract CountVectorizer
    model_countvectorizer = pickle.loads(vectorizer)

    return list(
              model_logreg_bytes.predict(
                model_countvectorizer.transform([texts])
              )
           )[0]

Model Deployment

The training process is complete, and we can move on to the next step: model deployment.

In this section, we will use the model trained above and the model trained in the 3rd post of this series saved in the database to predict the news below:

news = "Previously, the company anticipated its operating profit to improve over the same period."

Use model_nlp trained above

model_nlp.apply(
    lambda t: predict_sentiment(news, t["model_logreg"], t["model_bow"])
)

Use the model in table ds_demo.saved_models trained in blog 3

# Access to the model saved in the database
model_table = db.create_dataframe(table_name="saved_models", schema="ds_demo")
model_saved = model_table[lambda t: t["model_name"] == "nlp_sentiment_analysis_bow_logreg"]

# Call function predict
model_saved.apply(
    lambda t: predict_sentiment(news, t["model_logreg"], t["model_bow"])
)

Use pre-defined UDF saved in Greenplum

You can also get access to the pre-defined predict_sentiment UDF in Greenplum, which was created in blog3 using the gp.function() :

# Get pre-defined UDF
predict_function = gp.function("predict_sentiment")
# Call predict UDF
model_saved.apply(
    lambda t: predict_function(news, t["model_logreg"], t["model_bow"])
)

Predict multiple records from the database

Of course, we can also apply the model to a table of News stored in a database with multiple records!

To do so, we need first to join model_table and data_table together:

data_model_join = sentiment_news.cross_join(nlp_sentiment_analysis_bow_logreg)
data_model_join[:1]

Now, we can assign the predicted value to our DataFrame:

data_model_join.assign(
    pred=lambda t: predict_function(t["cleaned_text"], t["model_logreg"], t["model_bow"])
)[["cleaned_text", "label", "pred"]][:5]

Conclusion

In this post, we have illustrated how GreenplumPython can be used by data scientists to perform parallel computations on Greenplum directly from Python implicitly. This benefits data scientists who are more comfortable coding in Python than SQL and offers expanded flexibility in managing the deployment and usage of models on Greenplum. If you want to learn more about GreenplumPython, we invite you to check out the documentation page, which provides more details and additional usage examples.