The 5D’s of migrating from PostgreSQL to Databricks 🚢
Reflecting on the history of PostgreSQL and its impact on the database industry, I am truly amazed. Many of the advancements we have seen in modern cloud data platforms can be traced back to the original PostgreSQL whitepaper.📜
Having said all of that, we sometimes still need to consolidate our systems and utilise newer technologies.🔮
This blog post aims to show how we can take advantage of the Databricks Data Intelligence Platform features and migrate away from PostgreSQL.
During my career, and having migrated from many different systems, I have settled upon the 5D’s as a method to help perform migrations:
1. Discover: During this step I want to fully understand the source PostgreSQL system(s). I’d like to conduct a thorough assessment of the existing data infrastructure, including data sources, sizes, number of databases, number of schemas, number of tables, number of functions, number of procedures, and procedural languages used. Said in my usual lingo and with a bit of Mockney flair: Perfect Preparation Prevents P*ss Poor Performance!🤣
2. Define: Decide on the migration strategy: Lift & Shift, partial migration, or full redesign of data architecture. Due to the inherent differences in Databricks Vs PostgreSQL we may not be able to perform a 100% Lift & Shift migration. Define tables, views, and other database objects required to support the data and reporting requirements. Also define any ETL/ELT processes that will be needed in Databricks for the movement of data.
3. Design: Design the Unity Catalog in Databricks that aligns with your business needs and takes advantage of all the Databricks Data Intelligence Platform’s capabilities. This step is also where the discovered code is developed to become compatible with Databricks. Databricks supports all ANSI SQL commands, but PostgreSQL (like other RDBMs) extends it. Therefore, CREATE PROCEDURE is not valid syntax inside Databricks. Later in this blog post, I will go through an example of how to take a stored procedure from PostgreSQL to Databricks.
4. Deploy: Conduct a pilot phase with a subset of users or data to validate the migration and gather feedback before further code migrations. Rinse & repeat until all PostgreSQL code and data has been migrated to Databricks.
5. Decommission: This step is often overlooked 🤷♂️ and hence why there is always a lot of technical debt built up in organisations everywhere. To realise your full ROI, you must plan to and then actually shutdown and remove the legacy PostgreSQL systems.
So, let’s move from theory and on to a more practical scenario. Assume you have access to a PostgreSQL system, and you want to perform discovery on it.
If you already have a PostgreSQL server that already has lots of data, stored procedures, functions, domains, and other database objects then you can skim through this section!
Creating a test accounts table:
DROP TABLE IF EXISTS accounts;
CREATE TABLE accounts (
id INT GENERATED BY DEFAULT AS IDENTITY,
name VARCHAR(100) NOT NULL,
balance NUMERIC(15, 2) NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO
accounts (name, balance)
VALUES
('Bob', 10000);
INSERT INTO
accounts (name, balance)
VALUES
('Alice', 10000);
SELECT
*
FROM
accounts;
This will be the output:
Creating the Stored Procedure:
CREATE OR REPLACE procedure transfer (sender INT, receiver INT, amount DEC)
language plpgsql AS $$
begin
-- subtracting the amount from the sender's account
update accounts
set balance = balance - amount
where id = sender;
-- adding the amount to the receiver's account
update accounts
set balance = balance + amount
where id = receiver;
commit;
end;
$$;
CALL transfer (1, 2, 1000);
SELECT
*
FROM
accounts;
This will be the output:
The tools to use are PG_DUMP for a single database and PG_DUMPALL for all databases. For my example, I will use a Windows 10 laptop with PostgreSQL 16 on it, but you should be able to follow along on your own systems.
NB: The full syntax is documented here: PostgreSQL: Documentation: 16: pg_dump
To get the SQL DDL out of a single database named sunny_test with a public schema, I would use this syntax:
C:\Program files\PostgreSQL\16\bin\pg_dump.exe --verbose --host=localhost
--port=5432 --username=postgres --schema-only --format=p
--file C:\schema-only-dump-sunny_test.sql -n “public” sunny_test
Open the sql text file generated in a text editor, search for “procedure” and you should discover all your stored procedure code!
Interestingly, pg_dump does not capture the original DDL we used. Instead, it captures the following CREATE and ALTER DDL statements (which are equivalent to our original DDL but a bit longer as we also need to specifically add a new sequence!):
CREATE TABLE public.accounts (
id integer NOT NULL,
name character varying(100) NOT NULL,
balance numeric(15,2) NOT NULL
);
ALTER TABLE public.accounts ALTER COLUMN id ADD GENERATED BY DEFAULT AS IDENTITY (
SEQUENCE NAME public.accounts_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1
);
Now let’s recreate the stored procedure from PostgreSQL in Databricks. Remember I told you earlier that we cannot simply use a CREATE PROCEDURE in Databricks? So we will have to use Python as a workaround.
First we will create a test accounts table in Databricks (much like what we did earlier for PostgreSQL):
DROP TABLE IF EXISTS hive_metastore.default.accounts;
CREATE TABLE hive_metastore.default.accounts (
id BIGINT GENERATED BY DEFAULT AS IDENTITY,
name VARCHAR(100) NOT NULL,
balance NUMERIC(15, 2) NOT NULL
);
INSERT INTO hive_metastore.default.accounts (name, balance)
VALUES ('Bob', 10000);
INSERT INTO hive_metastore.default.accounts (name, balance)
VALUES ('Alice', 10000);
SELECT * FROM hive_metastore.default.accounts;
This will be the output:
The eagle-eyed readers 🦅 will notice that we have used BIGINT as the data type for the ID identity column and in PostgreSQL it was just an INT. Please see this blog article on further details on Identity Columns: Identity Columns in Databricks Lakehouse | Databricks Blog
Now we will use Python to create/define a function to emulate the PostgreSQL Procedure:
%python
def transfer(sender, receiver, amount):
# Subtract the amount from the sender's account
query = f"""
UPDATE hive_metastore.default.accounts
SET balance = balance - {amount}
WHERE id = {sender}
"""
spark.sql(query)
# Add the amount to the receiver's account
query = f"""
UPDATE hive_metastore.default.accounts
SET balance = balance + {amount}
WHERE id = {receiver}
"""
spark.sql(query)
Call the Python function:
%python
transfer(2, 1, 1000)
Prove it worked:
SELECT * FROM hive_metastore.default.accounts;
This will be the output:
As you hopefully understood from above we can basically take any existing SQL code from PostgreSQL and lift & shift the business logic into a function using the f-string capabilities of Python.🐍
As the code is in a Databricks Notebook, we are able to utilise Databricks Workflows to trigger it. Rather than make this blog post any longer, I would recommend you to read the following 3 blogs (all from Ryan Chynoweth — Medium — Thank You Ryan!) for more details:
- Task Parameters and Values in Databricks Workflows | by Ryan Chynoweth | Medium
- Converting Stored Procedures to Databricks | by Ryan Chynoweth | Medium
- Converting Chained Stored Procedures to Databricks | by Ryan Chynoweth | DBSQL SME Engineering | Medium
There’s obviously a lot more to migrating from PostgreSQL to Databricks than I could ever fully cover in a short blog post.
If you are planning a PostgreSQL to Databricks migration, and would like some assistance then please reach out to me here on Medium or on my Mphasis Datalytyx email address: sunny.sharma@datalytyx.com.
Our professional services include accelerators that will take the input PostgreSQL code and automatically convert it into Databricks SQL for you.👋
Please note the opinions above are the author’s own and not necessarily my employer’s opinion. This blog article is intended to generate discussion and dialogue with the audience. If I have inadvertently hurt your feelings in anyway, then I’m sorry.