Learning from an incident, hitting the 2,147,483,265 limit at Shipup!

Simon Duvergier
Shipup blog
7 min readMay 24, 2022

--

How Shipup improved its incident response playbook from one integer overflow incident in our PostgreSQL database

Back in 2021 we had a major incident related to our PostgresSQL database. It was the first and only time that @Shipup could not import critical customer data into its system for a long period of time.

At this time, we already had some diffuse ideas on how to react to an outage, which helped us handle the crisis with minimum impact for our users. Nevertheless, we discovered a few tricks along the way that made us solidify our incident response playbook.

The goal of this article is to humbly share something useful we learned during this outage

The issue and its consequences explained

On March 5th 2021, our on-call engineer noticed the following error message:

ActiveRecord::RangeError: PG::NumericValueOutOfRange: ERROR: integer out of range: INSERT INTO “addresses”

We soon realized that we were facing the classic primary key hitting the maximum possible id: 2,147,483,265. In other words, we were not able to add any new addresses to Shipup’s system.

Oops …

Indeed, the address in itself is not that important for our business logic. What is really important is for us to be able to add orders and trackers shipped by our customers.

Well, the problem is that every order has a shipping address … And every tracker is also shipped to a destination address…

Therefore, we were facing a major issue. Not being able to import addresses meant:

  • No more orders could be created in our system
  • No more trackers could be created in our system

Luckily, all of our existing data was still available for reading requests, and remained available for the end-users. No new information could be added, but the already existing information was displayed.

Useful learning for any production incident

Before continuing the story of this specific incident, let’s take a step back and try to describe the path of any major production incident.

It often starts with an alert being triggered, hopefully from an automatic alerting system that you use for your infrastructure, or in the worst case, just from a customer mentioning it to your team. Then, you start analyzing the alert. The majority of the time you will have a false positive, or maybe a minor edge case not correctly handled, that you add to your backlog. In some cases though, you understand that you are facing a major issue.

That’s when everything starts to get messy:

  • You start communicating to an internal channel of your team: “we are in big trouble”
  • At the same time, the support team may start receiving complaints from users and come to ask what is happening
  • In parallel, your boss comes to see you, asking what the implications are behind this issue. Are there other side effects linked to the situation?
  • If you are working in an open space, you may start listening to people whispering “our infrastructure is down!”, whereas in reality, your issue is major, but probably impacting only part of your infrastructure.
  • This leads to your boss coming to see you again, asking you to communicate with the rest of the team a clear message on the scope of the issue and what services are impacted

Unfortunately, during all this time spent communicating about the issue, nobody is working on a hotfix or trying to deeply understand the cause of the issue and if it jeopardizes other parts of your infrastructure, which should be the main priority.

Before looking at how we handled this incident on integer overflow — given a title which has of course has been exaggerated as 2,147,483,265 incidents would indicate an extremely bad infrastructure — we would like to humbly share what we think is useful to learn when dealing with any major incident:

  • Split the efforts into two parallel actions:
    - One effort focused on fixing the issue asap, deeply understanding the root cause of the issue, and checking if it jeopardizes other parts of your infrastructure.
    - Another effort focused on communicating with the rest of the company.
  • The persons handling the fixing effort should be different from the persons handling the communication effort

This split is very important ! If you want to be efficient in the resolution of your issue, you should partition the fixing effort from the communication effort.

For the communication effort:

  • Start with a general internal message that focuses on the fact that an issue is ongoing, with a short explanation of the current known impact. Share this message with the majority of the impacted team
  • Then, at the same time, start a dedicated channel with more details and updates about the issue. Share this channel on the general message previously sent to allow the affected teams to stay informed
  • Your main goal is to communicate the gravity of the situation without creating a feeling of panic
  • Your second goal is to summarize the current discoveries of the team fixing the issue. The idea is to be sure that other teams in your company can adapt their actions, such as answering possible customers’ complaints

For the fixing effort:

  • Write everything on shared documents rather than keeping the information in your head. This will allow the communication effort to stay up to date without being disturbed too much
  • First, focus on imagining a hotfix that could mitigate the criticality of the situation in a rapid way
  • Then, spend time on an action plan of the different steps needed in order to fully fix the issue

Timeline of the incident

To illustrate this learning, here is the complete timeline of the incident and its resolution:

Friday:

  • 10h24: First failed insertion triggers a ActiveRecord::RangeError and sends the error to Sentry
  • 10h24–10h50: Live discussion, to assess the situation, do some research and figure out an action plan
  • 10h58: General message sent in #general slack channel. Creation of a dedicated slack channel to share updates.
  • 11h15: 10min hangout call with someone from the technical team to answer all questions the other teams may have.
  • 11h51: Action plan ready with hotfix to import again orders and trackers without address
  • 13h35: Remove address import from tracker import ⇒ Trackers can now be imported again
  • 17h27: Remove address import from orders import ⇒ Orders can now be imported again

Saturday:

  • 18h-20h: Migration to add relevant indexes for Monday

Monday:

  • 10h-10h30: Discussion to find the best way to reimport the addresses
  • 12h: Decision to make a maintenance at midnight to resolve the incident
  • 0h-0h10: Maintenance during the migration time ⇒ Addresses can finally be imported again

Tuesday

  • 10h: Script to reimport not imported addresses from Friday to Monday
  • 11h: End of the incident

What is important to notice with this timeline is:

  • Splitting the resolution effort and the communication effort made it possible to have an action plan in less than 90 minutes
  • Having thought of the quickest way to mitigate the issue with a hotfix made it possible to import back data before the end of the day. This allowed us to not spend all weekend working on the issue

Fun facts about this incident at Shipup

Before concluding the article, here is a list of fun facts about this incident. I am sure you have already experienced at least one that will bring you some fun (or not so fun) memories :)

  • We were a small team at this time, the next highest position above me was the CTO, with who I always worked in a pair on deep issues. The week of the incident was his first week of vacation in over a year. So I was on my own for this issue.
  • Of course, the incident started on Friday, just before the weekend
  • It is because of this incident that one of our developers took the time to build our API status page in order to enhance outage communication with our customers
  • At this time, there was not that much literature on the bigint topic. We found our procedure to switch from the int to a bigint column on Stackoverflow. One month later I came across this great article from buildkite about avoiding integer overflows with zero downtime. Too bad it was not written one month before.
  • After the incident, we wrote a postmortem about the issue. Never would I have thought that I’d be using it for a blog more than one year later

Conclusion

No matter your skills, technical maturity, or your technical stack, major production outages will always arise. They are never fun and often intense, but they are definitively better lived when supported by an efficient response playbook.

We hope that sharing this experience can help you solidify your playbook, as it has for us.

Regarding the issue itself, it was not really finished after the incident. Indeed, we also needed to migrate all other tables to bigint for their primary key. This had to be done step by step, without any downtime, over the many months that followed the incident. Along the road, we came across another postmortem related to the famous AUTOVACUM PostgesSQL process, but I’ll save that for another time.

Interested to start a new challenge at Shipup ? We’re hiring! Check out our job openings at careers.shipup.co

--

--