Don’t let pg_dump slow you down, use pgsync for speed and reliability

Postgres database migration made easy with pgsync

Akhilesh Mishra
KPMG UK Engineering
7 min readApr 8, 2024

--

People have been using pg-dump/pg_restore to migrate the Postgres database for ages. While these tools are reliable and widely used, they come with several limitations.

  • The process involves manual steps, such as creating and restoring backups, which can be time-consuming and error-prone.
  • They can be complex to use, especially for large databases with complex schemas and dependencies.
  • There is a risk of data loss or corruption, especially when handling schema changes or table truncation during the migration process.
Image generated with Gemini

Introducing pgsync, a modern and efficient alternative for Postgres migration

pgsync automates many aspects of the migration process, making it faster, safer, and more convenient.

What is pgsync?

pgsync is a tool used to synchronize data between two PostgreSQL databases. It makes syncing data easier by managing schema changes, truncating tables, and reloading data.

It can be useful in various scenarios such as

  • keeping development and production databases in sync
  • Replicating data between different environments
  • Migrating data between databases.

Here are some features that make this tool shine

  • pgsync lightning fast as it transfers tables in parallel
  • pgsync doesn't require predefined schemas. It can automatically detect schema changes and adapt to them during synchronization.
  • pgsync provides flexibility in choosing the tables to sync, allowing you to specify specific tables or sync entire databases.
  • It is highly secure as it provides built-in methods to prevent sensitive data from leaving the servers.

Installation

pgsync is available as a CLI utility and docker image.

This tool can be installed with gem

gem install pgsync

This can fail crying about the dependencies. If you install it on a MacBook, use brew install pgsync command.

In Ubuntu, you can install using apt-get.

sudo apt-get install ruby-dev libpq-dev build-essential -y
sudo gem install pgsync

How to use pgsync?

To use pgsync in the command line, run the below command

pgsync --init

This will generate a .pgsync file where we define the parameters such as source, destinations, etc

Default .pgsync.yml file which will look something like this

# source database URL
# database URLs take the format of:
# postgres://user:password@host:port/dbname
#
# we recommend a command which outputs a database URL
# so sensitive information is not included in this file
#
# we *highly recommend* you use sslmode=verify-full when possible
# see https://ankane.org/postgres-sslmode-explained for more info
from: $(some_command)?sslmode=require
# destination database URL
to: postgres://localhost:5432/myapp_development
# exclude tables
# exclude:
# - table1
# - table2
# define groups
# groups:
# group1:
# - table1
# - table2
# sync specific schemas
# schemas:
# - public
# protect sensitive information
data_rules:
email: unique_email
phone: unique_phone
last_name: random_letter
birthday: random_date
encrypted_*: null

You can customize the config as per your needs.

If you want to sync between 2 databases, you can use from and to configure the source and destination database endpoints.

To prevent unintended changes to production data, the default destination is restricted to localhost or 127.0.0.1. To sync between two database hosts, add to_safe: true in your .pgsync.yml configuration file.

To use encryption during data transfer, you can pass sslmode query parameters.

from: postgres://user:password@source-host:port/dbname?sslmode=require
to: postgres://user:password@destination-host:port/dbname
to_safe: true

sslmode supports multiple options, check this for more details.

To sync two databases, we can run the pgsync command. We need to make sure that schema is set up in both databases.

pgsync provide the schema migration option.

# Sync the schema without data (this wipes out existing data)
pgsync --schema-only

Once the Schema is set up, run pgsync to sync both databases.

# To Sync the databases
pgsync

# To Sync specific tables
pgsync table1,table2

To exclude specific tables from getting synced, use--exclude

pgsync --exclude table1,table2

Add to .pgsync.yml to exclude by default

exclude:
- table1
- table2

You can also sync the Schema before tables in a single command.

# Sync the schema before the data (this wipes out existing data)
pgsync --schema-first

Sync tables from all schemas or specific schemas (by default, only the search path is synced)


# Sync tables from all schemas or specific schemas (by default, only the search path is synced)
pgsync --all-schemas
# or
pgsync --schemas public,other
# or
pgsync public.table1,other.table2

Use pgsync as a docker container

# Pull the docker image
docker pull ankane/pgsync:latest

# Create the .pgsync with the required values
vi .pgsync.yml

# Mount the config and run pgsync container
docker run -ti -v "$(pwd)":/app ankane/pgsync --config /app/.pgsync.yml

Demo

To demo the tool, I will use pgsync to sync the two Postgres databases hosted in AWS RDS.

I have created 2 RDS Postgres instances with the below details. I will use an EC2 machine running Ubuntu Linux images in the VPC where we created the RDS instances.

RDS: pg-database-1
db: db1
user: postgres
password: admin1234
port: 5432


RDS: pg-database-2
db: db2
user: postgres
password: admin1234
port: 5432

I will use a Flask-based web app to write to the first database. I will copy the code from my GitHub repo.

I will create the DB links for both of the Postgres DBs. The Postgres db link should look like this -

postgresql://<user>:<password>@<end point>:5432/<database name>

pg1: postgresql://postgres:admin1234@pg-database-1.cfykukwcw419.ap-south-1.rds.amazonaws.com:5432/db1

pg2: postgresql://postgres:admin1234@pg-database-2.cfykukwcw419.ap-south-1.rds.amazonaws.com:5432/db2

Create one Ubuntu instance with the below specifications

  • Public IP
  • The security group allows access to the Postgres database on port 5432.
  • Access to the database, either with an IAM role attached to EC2 or using AWS access keys.

I have created a role with admin access and attached it to EC2 for the demo. Using admin access policies are not recommended, you should use granular permissions.

Installing pgsync in Ubuntu

# Install postgresql
sudo apt-get install wget ca-certificates
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt-get install postgresql-client -y
# Install ruby dependencies
sudo apt-get install ruby-dev libpq-dev build-essential -y
# Install pgsync with gem
sudo gem install pgsync

# install python if not installed already
sudo apt install python3
# Install pip if not already done
sudo apt install python3-pip

For the demo, I will clone the Flask code from my public GitHub Repo

git clone https://github.com/akhileshmishrabiz/Devops-zero-to-hero
cd Devops-zero-to-hero/project5

I have configured the Flask app in app.py and used app.config[‘SQLALCHEMY_DATABASE_URI’] to configure the database link.

I will use the first database pg1 with the app.

pg1: postgresql://postgres:admin1234@pg-database-1.cfykukwcw419.ap-south-1.rds.amazonaws.com:5432/db1

# edit the app.py
vi app.py
# or use any other editor like nano, vim to edit the app.py

Install the dependencies with pip and run the Flask app

pip install -r requirements.txt
python3 app.py

This app can be accessed with the public IP of the Ubuntu machine. Make sure you have opened inbound on port 5000 for the Flask app.

Use the public IP with port 5000 to access the application. Here we can create some data by filling out the form. All data will be stored in the database attached to the Flask app.

I have listed data from the database below the form.

Use psql to verify that data on the first database

# psql -h host_name -p port_number -U username -d database_name
psql -h pg-database-1.cfykukwcw419.ap-south-1.rds.amazonaws.com -p 5432 -U postgres -d db1

Let’s connect to the second database and see if it has anything inside.

Now we will use the pgsync tool to sync the data from the first database to the second one.

In the project directory, run the below command to initialize the pgsync

pgsync --init

We will edit the file, replace the from and to with the source and destination database links and remove everything else.

from: postgresql://postgres:admin1234@pg-database-1.cfykukwcw419.ap-south-1.rds.amazonaws.com:5432/db1
to: postgresql://postgres:admin1234@pg-database-2.cfykukwcw419.ap-south-1.rds.amazonaws.com:5432/db2
to_safe: true

I will use ?sslmode=require at the end of the db link for the database for better security.

Now we are ready to sync both the databases.

pgsync --schema-first

Let’s use psql to verify if data is synced across both DBs.

We can connect the second database to our Flask app by modifying app.config[‘SQLALCHEMY_DATABASE_URI’] parameter value in app.py and rerun the app.

Now we have successfully synced the data across both the Postgres databases using pgsync.

I hope you found this article useful, let me know your thoughts in the comments section.

If you found this blog post useful, clap, follow, and subscribe so you don’t miss my future articles.

Connect with me on Linkedin: https://www.linkedin.com/in/akhilesh-mishra-0ab886124/

--

--

Akhilesh Mishra
KPMG UK Engineering

Self taught DevOps engineer with expertise in multi-cloud, and various DevOps tools. Open for mentorship - https://topmate.io/akhilesh_mishra