Server failure post-mortem

A number of FATdrop promos were offline on Thursday and Friday. This is an explanation of what happened.

Alex Stacey
Feedback Loop
2 min readFeb 19, 2018

--

As part of our hardware monitoring process, we regularly scan our servers’ hard drives for errors. Last week we were alerted to early signs of wear on one of the disks and so we arranged for the disk to be swapped by the company who runs the data centre. All hard drives have to be replaced eventually and this is normal practice.

Our server disks are organised in RAID arrays which means that all of the data is backed up on more than one disk at all times. That makes it possible to hot swap the disks — i.e. remove one disk, insert a new disk, and rebuild the data.

Unfortunately, on Thursday afternoon when they carried out the swap, the engineer pulled out the wrong drive, then took about 30 seconds to realise his error and then reinserted it into the server. This messed with the organisation of the RAID array, and we were left with no way to rebuild the data held on the other drives. That meant that we couldn’t access data (mainly audio and images) from about a third of our clients’ accounts.

After trying a number of different solutions to rebuild the array, the only option we were left with was to rebuild the drives using nightly backups. (We save nightly backups of all data to servers in a different physical location in order to protect against worst case scenarios, and this turned out to be one of them.) We started re-loading the data from the nightly backup, and because of the sheer amount of data, this turned out to be a slow process — taking approximately 28 hours before the last files were restored.

This was human error. It was outside of FATdrop’s control directly, but we take full responsibility. It doesn’t match the high level of service that we aim to offer — and we are sorry to our clients whose promos were affected and to the recipients who couldn’t access their promos on Friday.

We are working with our server hosts to make sure that more care is taken to avoid errors like this in the future, and we will also be working to minimise the time needed to restore data from backups if we need to do that again.

Thanks your your patience and understanding. Feel free to drop us a line via Twitter or email if you have any unresolved issues.

--

--

Alex Stacey
Feedback Loop

Words about projects I’m working on and other projects I don’t have time to.