From MySQL to Greenplum: Lessons and Insights from an ETL Pipeline Migration Journey

Muhammad Ahsan
7 min readJun 15, 2023

--

Migrating from one database system to another can be a challenging endeavor, but it presents an opportunity for organizations to leverage the power and scalability of modern data platforms. I was recently part of a MySQL to GreenplumDB migration project at work, which left me with my fair share of learnings and challenges that I overcame. In this article, we explore the lessons learned and some pitfalls one can avoid when migrating from MySQL (ver 5.7) to Greenplum DB (PostgreSQL 9.4) , a massively parallel processing (MPP) database designed for handling large-scale data workloads. We will delve into the key considerations to gain valuable insights from real-world experience, providing you with a comprehensive guide to ensure a smooth and successful migration process. Let’s begin.

Photo by Compare Fibre on Unsplash

Lesson 1: Impact of DDL commands on underlying Views and Dependencies in Greenplum

Challenge: In MySQL, when we RENAME a particular table, it doesn’t change the table name in the views using that particular table. However, in Greenplum, if you RENAME a table, all the corresponding views will reflect the updated table name as well. This means that if you need to make a new table with the same name as before but with a different structure, you will have to either DROP the current VIEW or use CREATE or REPLACE VIEW statement to make a VIEW using the table with the new structure.

The reason for this difference is how each database is designed to maintain the integrity of the database structure and manage dependencies between tables and views. MySQL allows for more flexibility in managing views by treating a view and its related tables as separate entities. Meanwhile, Greenplum automatically updates the change in tablename in the views to maintain consistency, avoiding any potential issues with the views becoming invalid when referencing non-existent tables.

The same is true when you have to DROP a table. In MySQL, the action of dropping a table is independent of the views associated with that table. However, in Greenplum, you cannot DROP a table as long as there are any dependent views on that table. You will have to DROP the views first and then DROP the table to ensure proper deletion.

Lesson 2: Handling Aggregated Queries: MySQL vs Greenplum DB

Challenge: MySQL allows grouping without explicitly including all non-aggregated columns. Whereas, Greenplum DB requires all non-aggregated columns to be included in the GROUP BY clause. This can lead to errors in Greenplum DB if the query is not structured correctly.

Let’s suppose we have a Sales table containing data related to the sale of three different products:

Table 1 — Image By Author

Description for each column is as follows:

  1. id: Unique identifier for each row
  2. product: Product’s name
  3. category: Identifier for the category each product belongs to.
  4. price: Price for each unit
  5. quantity: Quantity of the product purchased.
  6. sale_date : Date and Time of the transaction

Suppose we want to find the total sales revenue per product.

In Greenplum DB, we can use the following query:

SELECT product, 
SUM(price * quantity) AS total_revenue
FROM sales
GROUP BY product;

The query above not only provides expected result, but also is logically correct.

However, MySQL has a non-compliance mode, which when active, allows the inclusion of columns in SELECT statement that are not part of the GROUP BY clause or an aggregate function. For example:

SELECT product, category, SUM(price * quantity) AS total_revenue
FROM sales
GROUP BY product;

In MySQL, this query would execute without error, giving us the result we need, but the “category” column, which is not part of the GROUP BY clause, would not necessarily represent the correct category for the product. Adding additional records to Table-1 from all Categories for each product will result in the “category” column containing a random category name in our result for each product.

In contrast, if we try to write a query using a similar approach to Greenplum DB, we’ll receive the following error:

image by author

The reason is that we cannot group in the same way in Greenplum DB. It forces the user to follow the SQL compliant practice of using all non-aggregated columns from the query in the GROUP BY clause.

Lesson 3: Navigating Functionality Differences: Absence Of MySQL’s Common Functions in Greenplum

Challenge: Functions that are very commonplace in MySQL are unfortunately not available in Greenplum. For example, a very common function that is used all the time in MySQL is DATEDIFF(date1, date2), which returns the number of days between two date values. It turns out we do not have a similar function in Greenplum DB for this.

As a workaround, one can follow the practice of creating their own custom functions with the underlying logic of the in-built MySQL function we used before. Here’s an example of a custom function for theDATEDIFF() function from MySQL for Greenplum:

CREATE OR REPLACE FUNCTION my_schema.datediff($1 timestamp,$2 timestamp)
RETURNS int4
LANGUAGE sql
VOLATILE
AS $$
SELECT CAST($1 AS date) - CAST($2 AS date) as DateDifference
$$
EXECUTE ON ANY;

Lesson 4: Rights Assignment on new tables

Challenge: Let’s say we have User A and User B. In MySQL, as long as both users have access to a particular schema, both users can access the table created by the other. However, in GreenplumDB, if User A creates a table, he needs to individually assign CRUD rights to USER B’s username for access.

Reason being Greenplum DB following a principle of least privilege, where it does not automatically assign CRUD (Create, Read, Update, Delete) rights to other users for a table created by a particular user. This approach ensures tighter control over data access and security. In contrast, when a user creates a table in MySQL, by default, all other users have access to that table within the same schema. This is because MySQL grants the necessary privileges to all authenticated users by default.

Lesson 5: Simplified Filtering / Deduplication in Greenplum: Leveraging Window Functions for Efficient Data Cleaning

Although it’s been five years now since the introduction of Window Functions in MySQL, if you’re migrating from MySQL’s version of 5.7 or below, you’ll be gladly surprised that data cleaning steps like deduplication can be easily performed using row_number methodology in a few lines of code in Greenplum as compared to the manual techniques in MySQL.

Let’s look at an example by revisiting the Sales table again from earlier (with more records this time):

Image by author

Looking at the data, we can see we have multiple sales of Product A on the date 2022–01–01. Let’s say we want to look at only the last sale for each product on a particular date. In MySQL we’ll have to implement row number technique manually with the help of user defined variables like:

-- Step 1: Initialize variables
SET @rown := 0, @rk1 := '', @rk2 :='';

-- Step 2: Create a temporary table to store deduplicated data
DROP TEMPORARY TABLE IF EXISTS my_schema.dedup_data;
CREATE TEMPORARY TABLE IF NOT EXISTS my_schema.dedup_data
SELECT *,
@rown := CASE WHEN @rk1 = product AND @rk2 = DATE(sale_date) THEN @rown + 1 ELSE 1 END AS rownum,
@rk1 := product AS dummy_rn1,
@rk2 := sale_date AS dummy_rn2
FROM my_schema.sales
ORDER BY product, sale_date DESC;

-- Step 3: Cleanup table with irrelevant columns and records
DELETE FROM dedup_data
WHERE rownum > 1;

ALTER TABLE my_schema.dedup_data
DROP COLUMN dummy_rk1,
DROP COLUMN dummy_rk2;

-- Step 4: Finally Selecting the deduplicated data
SELECT *
FROM my_schema.dedup_data;

In contrast, the same process is a few lines of code in Greenplum using ROW_NUMBER() Window function:

CREATE TEMPORARY TABLE IF NOT EXISTS gp_dedup_data AS (
WITH cte_data AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY product, DATE(sale_date) ORDER BY sale_date DESC) AS rn
FROM sales
)
SELECT *
FROM cte_data
WHERE rn = 1
);

SELECT *
FROM gp_dedup_data;

Lesson 6: A Hidden Nuance: Datetime Conversions in MySQL and Greenplum

In MySQL, a datetime column can unintentionally receive a dummy date when inserting a time-only value in it without raising an error. This results in the value being associated with the default “1970–01–01” date string.

However, in Greenplum, loading such data will not allow implicit conversion as it enforces the validity of the datetime value.

To address this, a workaround involves concatenating a dummy date with the time value and using Greenplum’s to_timestamp() function to convert the concatenated string into a valid datetime value.

These disparities require careful consideration during database migration process.

If you found this article helpful, please let me know your thoughts and feedback. Do you have any personal insights from your own experience with database migration projects? I would love to read about your experience in the comments!

--

--