Schema Migration from Int to Bigint on a Massive Table in Postgres
In my previous post, I discussed the importance of monitoring tables to ensure we don’t unknowingly hit database limits and different approaches for performing a Primary Key migration, including time estimates based on our dataset.
The main reasons, in order of importance, why we chose Strategy 4 for the migration were:
- Data integrity guarantee
- Ability to do repeated timed dry runs
- Minimal application code changes
- Minimal engineering resources
- Constant time maintenance window
Overview of the process
- We spec’d out all the queries
- We spec’d out the steps for backup
- We tested all the queries on a copy of our staging db
- We performed the migration on our staging db
- We performed the steps for backup on our staging db
- We tested all the queries on a copy of our production db (twice)
- Alerted all the stakeholders and performed the time bound migration in production
Details of the process
In this section, I’ll walk through each step and the exact queries that were run in this migration.
- Add new nullable bigint column without a default specified(1s).
ALTER TABLE 'big_table' ADD COLUMN 'id_bigint' BIGINT NULL;ALTER TABLE 'big_table' ALTER COLUMN 'id_bigint' DROP DEFAULT;
2. Copy id to id_bigint (estimate 3–4 days). The reason we chose increments of 1000000
is mainly because it performed better in terms of incremental speed, disk utilization and cleanup between transactions.
def copy_bigtable_id(counter):
while counter <= latest_bigtable_id:
with connection.cursor() as cursor:
cursor.execute("update big_table set id_bigint=id
where id >= %s and id < %s",
[counter, counter + 1000000])
counter = counter + 1000000
print counter, latest_bigtable_idcopy_bigtable_id(0)
3. Once the migration caught up with current additions to the table, we selected a day for our planned maintenance and ran updates until that day.
4. Before entering into maintenance mode, ensure that the PG replica is caught up with master.
5. Once you are in maintenance mode,
- Detach the PG replica so it can be used as a standalone master in order to recover cleanly if the migration is aborted.
- Make sure there are no queries hitting the DB.
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'your_db_name' AND state = 'active' LIMIT 10;
- Run updates on the last
1000000
records just to be sure that the big_table is fully migrated over.
UPDATE big_table
WHERE id >= <latest_bigtable_id> - 1000000 SET id_bigint = id;
- Select and remove Foreign Keys to the
id
column if any (1s).
SELECT tc.constraint_name, tc.table_name, kcu.column_name, ccu.table_name AS foreign_table_name,
ccu.column_name AS foreign_column_name
FROM information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage AS ccu
ON ccu.constraint_name = tc.constraint_name
WHERE constraint_type = 'FOREIGN KEY' AND ccu.table_name = 'big_table';
- Remove Primary Key constraint on
id
.
ALTER TABLE big_table DROP CONSTRAINT big_table_pkey;
- Drop sequences on
id
.
DROP SEQUENCE big_table_id_seq CASCADE;
- Rename column
id
.
ALTER TABLE big_table RENAME COLUMN id TO id_old;
- Rename column
id_bigint
.
ALTER TABLE big_table RENAME COLUMN id_bigint TO id;
- Add Primary Key constraint to
id
(30 minutes).
ALTER TABLE bagels_bagel ADD PRIMARY KEY (id);
- Add Foreign Key constraints back to
id
(35 minutes for 2 constraints). - Create and add back the sequence to
id
.
SELECT MAX(id) + 1 FROM big_table;CREATE SEQUENCE big_table_id_seq START WITH <max_id>;ALTER TABLE big_table ALTER COLUMN id SET DEFAULT NEXTVAL('big_table_id_seq');
- Drop not null constraint on
id_old
(1s)
ALTER TABLE big_table ALTER COLUMN id_old DROP NOT NULL;
6. Validate the data looks good by running automated tests with fixed input and expected output.
7. Safely turn off maintenance mode and resume normal operation.
8. Drop columnid_old
(1s).
ALTER TABLE big_table DROP COLUMN id_old;
Important notes
- Monitor disk usage during the long running migration with batch updates on
big_table
and pause the migration when it hits 70% usage. - Schedule a daily full vacuum on
big_table
, when the disk usage <70% so that dead tuples get cleared out.
Issue we encountered and how we fixed it
When we were on the step of adding back the Primary Key constraint to column id
, the db errored almost instantaneously with
Error: null value in column "id" violates not-null constraint.
This may have been caused by the last few lingering writes on the table as we were transitioning into maintenance mode.
Since the indices were dropped, we couldn’t query the table using id
to find out the records with null values. Instead, we used the pg replica to identify the records that were not migrated. In order to run updates on the db, we used another column that had an index on it to effectively identify and migrate those faulty records.
This can be avoided by checking for not null
values on the column id_bigint
before dropping constraints on id
. Once the Primary Key constraint is dropped on the table, there is no way to query the table to check for null values in a timely manner since the index no longer exists.
I hope this blog post is helpful for estimating and planning a schema migration on a column in a big table. How I wish Postgres had a built in mechanism to do a migration like this, since this scenario seems common as companies grow.