Revolutionize Your Data Science Experience: Elevating Data Exploration with GreenplumPython 1.1.0’s Advanced Embeddings Search in PostgreSQL and Greenplum
Unlocking New Dimensions in Data Analysis in Database with Enhanced Embeddings Search in Python
GreenplumPython, a Python API, allows users to process large datasets in Greenplum/PostgreSQL databases using Python, from Terabytes to Petabytes, eliminating the need to export data to local machines. It was released as the first GA version in February 2023.
GreenplumPython Essentials
This Python library allows Data Scientists to code in their familiar Pythonic way, like the Pandas package. It simplifies the way to process data transformations, machine learning model training, and interactive analysis of large dataset are processed — all within the secure confines of the Greenplum platform.
The Evolution to 1.1.0: Embedding Search
Fast forward to 5 January 2024, the release of GreenplumPython 1.1.0 introduces an innovative feature — embedding search. Embeddings are generated by AI models, provide a compact and meaningful representation of objects in a numerical vector space. They capture the semantic relationships between objects..
In the rapidly evolving field of Artificial Intelligence and Machine Learning, embeddings contain an increasing number of properties or features. These features extract meaningful information from the data that is needed to understand patterns, data correlations and underlying structures. Managing embeddings has proven to be a complex task, but this complexity also proves the importance of embeddings in extracting nuanced patterns from data.
The Significance of Embeddings Search
GreenplumPython 1.1.0 addresses the challenge of managing embeddings head-on with the pgvector extension. With this new feature, exploring unstructured data in a database become not only accessible but integral to the data science process.
Let’s delve into how this new feature works and why it puts GreenplumPython at the forefront of intelligent data exploration.
WARNING: The feature introduced in this tutorial is currently experimental. It does not have any API stability guarantee.
Package Installation
To ensure you are using the latest released version, let’s install it:
python3 -m pip install --upgrade greenplum-python
Database Connection
With GreenplumPython installed, let’s establish a connection to our Greenplum Database.
import greenplumpython as gp
db = gp.database(uri="postgres://localhost:7000")
Dataset Exploration
For this article, we have a wikimedia dataset containing a copy of Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. This dataset is stored in table wiki_small
, which contains 6 columns. (pub_date
, label
, id
,category_id
, title
and body
),
wiki = db.create_dataframe("wiki_small")
wiki.order_by("id", ascending=False)[:2]
and 100K rows.
import greenplumython.builtins.functions as F
wiki.apply(lambda _: F.count())
For better efficiency, the generated embeddings are stored in a column-oriented approach, i.e., separated from the input DataFrame. The input DataFrame must have a unique key to identify the tuples in the search results. That’s why, before creating an embedding index, the DataFrame should be saved in the database and we need to check that the DataFrame has a unique key.
wiki = wiki.check_unique(columns={"id"})
Embeddings Generation and Indexation
Once we checked the unique key, we can now create an embedding index of the column body
using the model all-MiniLM-L6-v2 from HuggingFace with the new embedding
module. The pgvector extension now supports two types of index, HNSW and IVFFLAT. The default index type of the GreenplumPython Embedding index is HNSW, we may change it to IVFFLAT by specifying method="ivfflat"
:
import greenplumpython.experimental.embedding
wiki = wiki.embedding().create_index(column="body", model_name="all-MiniLM-L6-v2")
wiki.order_by("id", ascending=False)[:2]
Semantic Search by Embeddings
With the embedding index, we can search for body
based on the semantic similarity:
wiki.embedding().search(column="body", query="artificial intelligence", top_k=1)
This is going to be very efficient since we don’t need to scan all the data.
Embeddings Generation without Indexing
If we just want to generate the embeddings without creating a vector index, we can use the create_embedding()
function from the embedding
module:
from greenplumpython.experimental.embedding import create_embedding
Since we’re not indexing vectors, the DataFrame doesn’t need to be stored as a table in the database. And we do not need to specify the unique key if the embeddings are in the same DataFrame.
Furthermore, if we want to save the embeddings as a vector type with an embedding dimension, so that we can index them later, we need to cast the result to the type gp.type_("vector", modifier=<embedding_dimension>)
.
wiki.order_by("id", ascending=False)[:2].assign(
embedding_col=lambda t: (
gp.type_("vector", modifier=384)(create_embedding(t["body"], "all-MiniLM-L6-v2"))
),
)
Cleaning All at Once
For ease of management, the dependencies of the embedding index and the base table are recorded in the database.
As a result, an attempt to drop the base table alone will fail:
%reload_ext sql
%sql postgresql://localhost:7000
%sql DROP TABLE wiki_small
To drop the base table, we also need to drop the embedding index. This can be achieved using CASCADE
:
%%sql
DROP TABLE wiki_small CASCADE;
SELECT oid, relname
FROM gp_dist_random('pg_class')
WHERE relname = 'cte_88c394ed9d744c0f8e2fcf78be806a9a';
As we can see, after DROP CASCADE
, the embedding index is also dropped on all segments.
Conclusion
In this post, we have illustrated how GreenplumPython can be used by data scientists to manage Embeddings Generation and Search directly in Greenplum with simple python commands. This benefits data scientists who want to explore large amounts of unstructured data and use Greenplum as a vector database. This offers also expanded flexibility in managing the deployment and usage of Open Source models on Greenplum.
If you want to learn more about GreenplumPython, we invite you to check out the documentation page, which provides more details and additional usage examples.