We were running out of numbers

Dominic Humphries
Adzuna Engineering
Published in
5 min readJun 29, 2023
Photo by Susan Holt Simpson on Unsplash

One of my daughter’s favourite TV shows, Numberblocks, has a song about how you can reach any number if you keep on adding one. Unfortunately, there are times when you can’t add one.

We have a busy database. Many new things get added every day. As is often the case, one of our busier tables has an auto-incrementing id column which a number of other tables are linked to via foreign keys.

Unfortunately, it’s such a busy table that the ID numbers were getting rather large. Large enough, in fact, that they were approaching the limit of the INTEGER(10) data type we had defined the column with.

In theory, a simple fix: Make the column bigger. There’s even a type for just such a case: BIGINT — twice the bits of the mere INTEGER type.

In practice, however, two problems. Firstly, the table is so big that the update would take hours to run. Secondly, you can’t change the type of a column that’s used as a foreign key, because it’ll conflict with the type of the corresponding column.

You can work around this by dropping the foreign key constraints, updating all the tables to BIGINT , and then re-establishing the foreign keys. But that adds even more time in which the database can’t be updated. We were looking at a day of downtime to make this one change.

But there’s a way to avoid this! And we used it before, when we upgraded the database. Because it’s in AWS RDS, you can simply take a snapshot of the current database, then spin up a new copy based on that snapshot. Set it to replicate from the original, and you now have a clone that’s fully up to date.

Then you make your changes, taking as long as you need, and when you’re ready, you switch your software to talk to the new version of the database. No downtime. Even allows for rollback. Awesome, right?

Right?

Well, yeah, in theory. In practice, we hit so many roadbumps it took the best part of a year to actually make this happen.

Our original plan was to let the clone catch up with the original; stop replication; update to bigint; then re-enable replication and let it catch back up.

It didn’t work. Replication couldn’t restart. Then I discovered that slave_type_conversions needed to be set to ALL_NON_LOSSY in order to actually allow INTs to be converted to BIGINTs

We started again. It still didn’t work. Replication still can’t restart.

“It’s the Mysql version,” we were assured. “You need to upgrade.”

Oh well. Upgrading from 5.6 to 5.7 was already on the roadmap. Let’s do that a bit earlier than planned.

We upgraded. Then tried again. Still no. Replication can’t restart.

Out of ideas, we consulted the nice folks at Percona. They very helpfully found the problem: Despite the fact we were copying unsigned integer IDs to unsigned bigint IDs, the binlog itself was using signed data. And signed values differ when you change the size of the data type.

No problem. We just had to set ALL_UNSIGNED on the slave_type_conversions along with the lossy stuff. That would fix everything.

At least, it would if you could set that value. I checked and rechecked the AWS Console’s allowed values. For non-Aurora Mysql 5.7, we could set it as needed. For Aurora Mysql 8, we could set it as needed. For our specific Aurora 5.7, forget it. It only allows the lossy/non-lossy setting.

Having hit another wall, we contacted our AWS account team and asked them “WTF?!?” and they helpfully got back to us to acknowledge that, yes, we couldn’t set it; yes, we should be able to; no, they didn’t know when this would be fixed.

Sigh.

But! They had a (sort of) workaround — the AWS UI doesn’t allow the setting to be applied, but AWS support can go in and assign it themselves. A support ticket was created, and a few days later, the setting was in place! We were go!

We halted replication. We migrated to bigint. We put the foreign keys back. Breathlessly, I re-enabled replication.

And it caught up! We finally had a bigint version of the database! Woo!

We took the plunge, and switched the website over to the new database. Everything was working. Everything was looking good.

For about two minutes. Then a creeping sense of dread began.

“Shouldn’t we be seeing new entries by now?” I asked nervously.

A bit more searching and we found the error. The conversion to bigint had dropped the “auto increment” on the ID field. Argh.

Could we quickly re-add it? No, because (again) of the foreign key constraint. Argh again.

We reverted to the old database.

We sadly blew away the bigint database, and made a fresh copy. We went back to AWS support and asked them, again, to make the manual change to the parameter group. We migrated the clone to bigint, this time making absolutely sure we didn’t drop anything in the process.

Replication resumed. Everything looked hopeful. We threw the switch again, and migrated to the new database.

And it worked! New entries created successfully, everything was looking good. Finally, we had done it!

Now we just had to switch over the replicas. Another little complication to the process I hadn’t mentioned: We have our main writable RDS instance in one region, and read-only replicas in other regions. It gives us better performance. However, we couldn’t have the replicas ready to go: Having the cloned database be both replicating from the original database, and be replicating to the read-only replicas was, we had discovered at the start, a non-starter. Replication just lagged further and further behind.

So instead, we had to put the new writable version of the database live, confirm it was good, and only then create the new replicas from it.

This takes a couple of hours.

But once our replicas were up, we could update our config once again to put them into use, and the website started to get updates again. Everything was looking good.

Except…

“We seem to be getting more 500s on the website than usual”

That’s weird, not what I’d expect from this change.

“Looks like it’s caused by Oauths”

That’s even weirder. A change to the type of an ID column shouldn’t break logins.

“The User table is too lagged to work”

Ah! The two hour lag caused by the time it took to create the replica! That explained it. Nothing to do with the database changes themselves, just the inevitable lag they had lumbered us with.

All we had to do was wait for the lag to catch up. Which it duly did, and all was finally well. We had migrated! Errors returned to normal levels. All was, finally, well.

And so my work with the database was done. I now move on to the tedious process of updating links to and documentation about our database, and clearing out the old versions, etc.

And writing a blog post about the process, just in case anybody else is hitting some of these snags like we did…

--

--