Revolutionize Your Data Science Experience: Elevating Data Exploration with GreenplumPython 1.1.0’s Advanced Embeddings Search in PostgreSQL and Greenplum

Unlocking New Dimensions in Data Analysis in Database with Enhanced Embeddings Search in Python

Published in

Greenplum Data Clinics

5 min readMar 8, 2024

GreenplumPython, a Python API, allows users to process large datasets in Greenplum/PostgreSQL databases using Python, from Terabytes to Petabytes, eliminating the need to export data to local machines. It was released as the first GA version in February 2023.

GreenplumPython Essentials

This Python library allows Data Scientists to code in their familiar Pythonic way, like the Pandas package. It simplifies the way to process data transformations, machine learning model training, and interactive analysis of large dataset are processed — all within the secure confines of the Greenplum platform.

The Evolution to 1.1.0: Embedding Search

Fast forward to 5 January 2024, the release of GreenplumPython 1.1.0 introduces an innovative feature — embedding search. Embeddings are generated by AI models, provide a compact and meaningful representation of objects in a numerical vector space. They capture the semantic relationships between objects..

In the rapidly evolving field of Artificial Intelligence and Machine Learning, embeddings contain an increasing number of properties or features. These features extract meaningful information from the data that is needed to understand patterns, data correlations and underlying structures. Managing embeddings has proven to be a complex task, but this complexity also proves the importance of embeddings in extracting nuanced patterns from data.

The Significance of Embeddings Search

GreenplumPython 1.1.0 addresses the challenge of managing embeddings head-on with the pgvector extension. With this new feature, exploring unstructured data in a database become not only accessible but integral to the data science process.

Let’s delve into how this new feature works and why it puts GreenplumPython at the forefront of intelligent data exploration.

WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.

Package Installation

To ensure you are using the latest released version, let’s install it:

python3 -m pip install --upgrade greenplum-python

Database Connection

With GreenplumPython installed, let’s establish a connection to our Greenplum Database.

import greenplumpython as gp
db = gp.database(uri="postgres://localhost:7000")

Dataset Exploration

For this article, we have a wikimedia dataset containing a copy of Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. This dataset is stored in table wiki_small, which contains 6 columns. (pub_date, label, id,category_id, title and body),

wiki = db.create_dataframe("wiki_small")
wiki.order_by("id", ascending=False)[:2]

and 100K rows.

import greenplumython.builtins.functions as F
wiki.apply(lambda _: F.count())

For better efficiency, the generated embeddings are stored in a column-oriented approach, i.e., separated from the input DataFrame. The input DataFrame must have a unique key to identify the tuples in the search results. That’s why, before creating an embedding index, the DataFrame should be saved in the database and we need to check that the DataFrame has a unique key.

wiki = wiki.check_unique(columns={"id"})

Embeddings Generation and Indexation

Once we checked the unique key, we can now create an embedding index of the column body using the model all-MiniLM-L6-v2 from HuggingFace with the new embedding module. The pgvector extension now supports two types of index, HNSW and IVFFLAT. The default index type of the GreenplumPython Embedding index is HNSW, we may change it to IVFFLAT by specifying method="ivfflat":

import greenplumpython.experimental.embedding

wiki = wiki.embedding().create_index(column="body", model_name="all-MiniLM-L6-v2")
wiki.order_by("id", ascending=False)[:2]

Semantic Search by Embeddings

With the embedding index, we can search for body based on the semantic similarity:

wiki.embedding().search(column="body", query="artificial intelligence", top_k=1)

This is going to be very efficient since we don’t need to scan all the data.

Embeddings Generation without Indexing

If we just want to generate the embeddings without creating a vector index, we can use the create_embedding() function from the embedding module:

from greenplumpython.experimental.embedding import create_embedding

Since we’re not indexing vectors, the DataFrame doesn’t need to be stored as a table in the database. And we do not need to specify the unique key if the embeddings are in the same DataFrame.

Furthermore, if we want to save the embeddings as a vector type with an embedding dimension, so that we can index them later, we need to cast the result to the type gp.type_("vector", modifier=<embedding_dimension>).

wiki.order_by("id", ascending=False)[:2].assign(
    embedding_col=lambda t: (
        gp.type_("vector", modifier=384)(create_embedding(t["body"], "all-MiniLM-L6-v2"))
    ),
)

Cleaning All at Once

For ease of management, the dependencies of the embedding index and the base table are recorded in the database.

As a result, an attempt to drop the base table alone will fail:

%reload_ext sql
%sql postgresql://localhost:7000
%sql DROP TABLE wiki_small

To drop the base table, we also need to drop the embedding index. This can be achieved using CASCADE:

%%sql
DROP TABLE wiki_small CASCADE;

SELECT oid, relname
FROM gp_dist_random('pg_class')
WHERE relname = 'cte_88c394ed9d744c0f8e2fcf78be806a9a';

As we can see, after DROP CASCADE, the embedding index is also dropped on all segments.

Conclusion

In this post, we have illustrated how GreenplumPython can be used by data scientists to manage Embeddings Generation and Search directly in Greenplum with simple python commands. This benefits data scientists who want to explore large amounts of unstructured data and use Greenplum as a vector database. This offers also expanded flexibility in managing the deployment and usage of Open Source models on Greenplum.

If you want to learn more about GreenplumPython, we invite you to check out the documentation page, which provides more details and additional usage examples.