Reaching the Max Limit for Ids in Postgres
Databases and their limitations can be the source of interesting engineering challenges. One such challenge we recently faced as a team was dealing with the limitation of the int
datatype on an auto incrementing primary key for a fast growing table in Postgres.
In this post, I’ll discuss this issue in detail and share learnings from 4 different strategies we drafted and tested before rectifying this in production.
The maximum number of available ids given the int
constraint is +2147483627
, which is approximately 2.1B records. The growth rate of this table was on the order of 80M+ records per month which gave us approximately 3 to 4 months of runway before catastrophic failure. The irony of this situation is that this indicates steady and healthy growth of our company.
To estimate the monthly growth of ids in the table, we used the following query.
SELECT COUNT(t.id), DATE_TRUNC( 'month', t.time_stamp )
FROM big_table AS t
GROUP BY DATE_TRUNC( 'month', t.time_stamp )
ORDER BY DATE_TRUNC( 'month', t.time_stamp );
The implication of reaching the hard limit in Postgres is
- Inserts would no longer be permitted (which is the default behavior in Postgres).
ERROR: nextval: reached maximum value of sequence 'big_table_id_seq' (2147483627)
- If
id
wraparound was set up on the sequence,id
s could silently rollover which could massively corrupt data and foreign key relationships that depend on this table.
Neither scenario is ideal.
The main recommendation here is to have a system that monitors the growth of tables and setup alerts to trigger well in advance.
The solution is a migration to convert the datatype of the column
id
in the table from data typeint
tobigint
. Sounds simple, but not quite.
In the next few sections, I’ll share details of 4 different strategies that could be used to solve this, how we tested/timed them and why we chose Strategy 4.
Size of the dataset used : 1540336116 (~1.5 Billion)
Strategy 1
Change the datatype by using ALTER COLUMN
.
Risks:
- Acquires a access exclusive lock which can be problematic for big tables.
- Causes table rewrite where Postgres rewrites every single row of the table, which is a very expensive operation (both in terms of disk I/O and wall clock time).
To perform this, we would require a maintenance window where writes to the DB are stopped.
Steps:
- Identify all Foreign Key constraints to the
id
column of big_table (1s) - Drop all Foreign Key constraints that reference
id
(1s) - Drop the Primary Key constraint on
id
(1s) - Run alter table to change datatype from
int
tobigint
(20 hours and counting, this query was aborted.) - Add back Primary Key constraint to
id
(1s) - Add back Foreign Key constraints to reference
id
(1s)
Strategy 2
Create a new table with the same schema butid
with type bigint
and use INSERT INTO
.
Risks:
- Bloating the disk usage on the database server.
- Bloating the CPU and memory usage on the database server during the
SELECT
operation.
To perform this, we would require a maintenance window where writes to the DB are stopped.
Steps:
- Create a table with the same schema as big_table except for
id
. (< 1s)
CREATE TABLE big_table2 ( LIKE big_table INCLUDING ALL );ALTER TABLE big_table2 ALTER COLUMN id TYPE bigint;
- Drop all Foreign Key constraints that reference
id
in the new table. (1s) - Load the data from big_table to big_table2. (10 hours and counting, this query was aborted).
INSERT INTO big_table2 SELECT * FROM bigtable;
- Add back Foreign Key constraints to reference
id
.(1s) - Rename table big_table2 to big_table. (2s)
Strategy 3
This strategy is very similar to Strategy 2, except that instead of copying over data in one SQL query, we slowly copy over chunks of records over a longer period of time.
Risks:
- Manually syncing updates on records that were already migrated.
- Manual validation to ensure data integrity.
- Possibility data corruption due to race conditions.
To perform this, we would not require a maintenance window but this is an engineering heavy, high risk method.
Steps:
- Create a table with the same schema as big_table except for
id
. (< 1s) - Create a job that copies over records 100000 at a time. (10–12 days at best)
- Create a post-save and post-update db hook, that re-writes the data in the new table to keep updates in sync. (10–12 days at best)
- Switch reads to the new table. Code change and deployment required.
- Switch writes to the new table. Code change and deployment required.
Strategy 4
Add a new column to the table with type bigint
. Copy over id
values to id_bigint
and then rename the new column to id
(by renaming the existing id to id_old
).
Risks:
- None.
To perform this, we would require a maintenance window where writes to the DB are stopped.
Steps:
- Add a new nullable column
id_bigint
. (1s) - Script that runs an update to
id_bigint
with values fromid
in chunks of100000
. (3–4 days at best) - Once in maintenance mode, drop primary key constraint, foreign keys constraints and sequence on
id
. (4s) - Rename
id
andid_bigint.
(1s) - Add back primary key constraint. (40 minutes)
- Add back foreign key constraint. (45 minutes)
- Drop not null constraint on
id_old
(the previous primary key). (1s)
After weighing the risks and time involved, Strategy 4 was a clear winner.
In conclusion, monitoring of databases is crucial. In case you are closing in on limits, I hope this post gives you a clear idea of different approaches to evaluate.
In my next blog post, I’ll go over the details of Strategy 4; SQL queries involved, sample validation tests, backup plan that was prepared, one issue that was encountered and how to avoid or resolve it.