“DataError: Integer out of range” on Postgres. What to do next?

Ann Molly Paul
Coffee Meets Bagel Engineering
3 min readJul 30, 2018

Yes, you read that right.

One fateful evening at 7p.m. our error rates began to spike. Inserts on our Postgres database were erring out and we were shocked to find our error logs overloaded with one particular message - DataError: Integer out of range.

We reached the maximum limit of integers on a column on a couple of our Postgres tables.

As a team we were well prepared for this scenario and took the necessary steps to avoid it. Ironically, I even wrote two blog posts about this:

  1. We audited the Primary Key growth on our Postgres tables and identified which tables were closest to this violation and sketched out a plan to prevent this.
  2. We successfully executed this plan.

If we were so well prepared, how did this happen?

The answer, unfortunately, had been staring us in the face. We were so focused on the Primary Key violation that when we audited our tables and performed the migration, we overlooked migrating the Foreign Key columns in the tables that had relationships to the tables that were migrated.

For the table we migrated, the Primary Key IDs rolled over correctly into the bigint key space, but the Foreign Key relationships began to fail inserts since those columns were not migrated to type bigint.

Steps we took:

  1. Freak out, but only for a minute before coordinating a response team and brainstorming solutions.
  2. We immediately brought our services down for maintenance to prevent data corruption as a precautionary measure. Additionally, we stopped replication and isolated our database slaves so that they could be used as backups in case our fixes on the master database didn’t work out.
  3. We first attempted the following DDL query
ALTER TABLE table_with_fk ALTER COLUMN fk_col TYPE bigint;

to those tables that had a foreign key relationship, but as we learned previously, this takes several hours or can even take days. We aborted that query.

We then decided that the safest route, with regards to data integrity and the quickest way to bring back our services, was to create a new column and backfill data into that column.

ALTER TABLE table_with_fk RENAME fk_col TO old_fk_col;
ALTER TABLE table_with_fk ADD COLUMN fk_col bigint null;

We also added an index to the new column before backfilling data into it.

CREATE INDEX CONCURRENTLY table_with_fk_new_fk_col ON table_with_fk USING BTREE (fk_col);

This ensured that data is indexed as we proceed to backfill those columns.

This was sufficient to bring us back up from maintenance mode and continue servicing our users.

We used the following script to backfill data by stepping through records from the max id on the tables and decrementing 1000000 ids at a time. This was done asynchronously and took a couple days to finish.

def copy_fk_ids(counter):
while counter >= 0:
with connection.cursor() as cursor:
cursor.execute(
"update table_with_fk set fk_col=old_fk_col
where id >= %s and id < %s",
[counter - 1000000, counter]
)
counter = counter - 1000000
print counter
copy_fk_ids(max_id_on_table_with_fk)

Key takeaways:

  1. Success for us was reacting with appropriate urgency, delegating ownership, and working calmly though intense pressure. We communicated through Zoom and used Google Docs to coordinate tasks.
  2. It was a big learning experience for me technically and as a teammate to learn from the grace that my manager and co-workers demonstrated during this incident. This is a testimony to a healthy engineering team where we can celebrate wins and work through issues together.

--

--