Macrostrat Server Failure Postmortem

On Monday February 13 at approximately 9:30 PM CST, we became aware of a system failure on our production machine that powers macrostrat.org, geodeepdive.org, and rockd.org. We quickly realized that this situation was extraordinary and went into Weeks Hall to manually reboot the machine, as it had become completely unresponsive. The system was unable to reboot, and we immediately began the process of migrating our data and applications to a virtual machine on UW Madison’s Campus Computing Infrastructure (CCI).

Background

The machine in question is a 2008 Mac Pro that has been in continual service as our production server since late 2008. Until December 2016, it had effectively run 24/7 without incident. However, in November 2016 the RAID battery died. We put off maintenance of this issue for about a month, as the server was also connected to a UPS that would gracefully shut the machine down in the event of a power failure. Once we procured a new RAID battery, we also decided that we would take the opportunity to replace the 9 year old (spinning) hard drives. The machine was purchased with enterprise-grade drives, and we decided to replace them with Western Digital Black drives.

In late December the system was upgraded and everything appeared to be functioning normally.

The Failure

On Monday February 13 at approximately 9 AM CST, we noticed that one of the new hard drives had experienced a disk write fail in mid January. At around 11 AM, we rebooted the machine and attempted to restore the RAID. The system indicated that the drive and the RAID were functioning properly, but the RAID was in the process of being rebuilt.

At approximately 9:30 PM CST it was brought to our attention that our various APIs were not responding, even though websites were functioning properly, albeit slowly. Upon logging into the machine, it was apparent that things were awry. For example, running top returned bash: /usr/bin/top: Operation not permitted, and running sudo lsof -i :5000 to check the status of an API returned Bus error: 10. A quick search of the internet revealed little, but this post hinted at what we feared:

“FYI, to anyone who comes across this, my problem ended being hardware related. Not exactly sure what, and it was a work computer so I was just given a new one rather than troubleshooting the old one.”

The Recovery

Fortunately for us, we have been in the process of vetting CCI for our computing needs, and as an exercise, had largely configured a VPS as a development environment. The moment it became apparent that we were dealing with a hardware failure we immediately began staging the CCI machine as a replacement. Because we had already installed all necessary software dependencies on this machine, including Postgres, MariaDB, PostGIS, NodeJS, etc, the majority of the process involved cloning and setting up applications, as well as properly configuring the webserver.

Around 11 PM CST, we began reconfiguring the DNS records of macrostrat.org, geodeepdive.org, and rockd.org to our new virtual machine. Around 12 AM CST, the DNS record for geodeepdive.org had propagated, but the others had not. At this point, we decided to call it a night and finish troubleshooting in the morning.

Around 8 AM CST on February 14, we noticed that the DNS records of all domains had successfully propagated, but were only accessible from within the UW campus network. After discovering that we were not able to alter the firewall preferences ourselves, we emailed our contact at Campus Computing Infrastructure at around 8:30 AM, and by approximately 9:15 AM CST all of our services were restored. Depending on your network’s DNS caching, you may not have been able to access our domains for up to 24 hours after the campus firewall was properly configured.

What went right

For a small (two person) operation consisting of a geologist and a mapper, things went pretty smoothly. First off, we had numerous recent backups of all data. The primary Macrostrat database is set up as a streaming replicate to our development machine with MariaDB, our geologic maps database is backed up on to our development server every time a few maps are added, and Rockd is backed up on a near-daily basis. All application code, including our nginx configuration, is hosted on Github, so restoration simply consisted of cloning all of our application repos and running any install scripts (like npm install).

Fortunately, the hardest and most time-consuming part was taken care of weeks ago while prototyping CCI’s infrastructure. We spent most a day installing all the proper versions of software dependencies, which allowed us to get things up and running quickly when it came time to switch machines. Had we not done this, it would have been a much slower process, and an all-nighter. However, had we not had a CCI machine effectively on standby, we could have used our development machine as a stand-in while finding a suitable replacement.

What went wrong

Our backup process for our geologic maps database and Rockd is suboptimal because of the manual intervention required. If we forget to backup, there is no backup. The only advantage of this method is that we know it works because there are no silent errors being thrown to a log we never check like there would be if it was scripted with a cron job (see: Gitlab outage postmortem). Because we knew in the morning that there might something very wrong with the system we immediately did a backup of Rockd, which resulted in us losing only one checkin when service was restored (sorry Ron!).

The largest data loss we experienced was two months worth of user-uploaded avatars to Rockd. While we were regularly back up the database and checkin photos, our backup script accidentally omitted the avatar directory. When we installed new hard drives in December, we did a full volume backup of the server, so we were able to restore avatars to their late December 2016 state.

We also discovered that it would be beneficial to backup all configuration files for our APIs, as it makes restoring service much easier. While it is not very difficult to reconfigure things like database connection settings, it is more time consuming to go fetch various API keys and other parameters that allow our services to function.

This migration of applications also brought some sloppy handling of dependencies to light. On our production machine, we usually git pull changes to our APIs and restart the service. This is usually fine, but in the long run becomes problematic because you are running the versions of dependencies that were installed when the application was originally installed. Upon reinstalling the Macrostrat API on the new machine, bizarre errors that we had never seen before started to appear, and it was later discovered that slightly different versions of dependencies affected the application code. This required us to make a few changes to the API to make sure everything functions properly. Sorry Flyover Country and Mancos!

On the topic of applications Macrostrat supports, we also forgot to migrate a couple of deprecated databases that certain API routes still rely on. While these were not a high priority, we did not have backups of them ready to move to a production system.

While significantly less important, we also did not have backups of our database configuration files or error and access logs for our databases and webserver.

Lessons Learned

We will continue to do daily dumps of the Rockd database, but in order to prevent the type of data loss we experienced with one checkin, we will be looking into streaming replication of the Rockd production database. While this is not a perfect solution, it gives us a much better chance of restoring the most recent version of the database.

The loss of avatars was very regrettable, and we plan to adjust our backup scripts to simply backup the entire application directory instead of certain folders. To all those who lost their avatars, we sincerely apologize for the inconvenience.

Additionally, we plan to backup our nginx SSL certificate directory, database configuration files (postgresql.conf and my.conf), and log files. While not 100% necessary, having a backup of these files would have made the restoration of service quicker.

While developing applications, especially APIs, we will get in the habit of reinstalling all dependencies (rm -rf node_modules && npm install) every time we commit to the master branch and pull into production. This should help prevent issues arising from mismatched dependencies across development and production environments.

Why did the server die?

We are not sure what caused this catastrophic failure yet, but we suspect the RAID card died. When we booted on to an external drive, none of the internal hard drives were recognized.

The bright side

The good news is that we no longer maintain our own hardware. What this means for users is that we will have better uptime, greater reliability and redundancy, and even faster services. If we have an influx of traffic, we can scale vertically and/or horizontally with a couple of clicks to handle it. In the past, we simply wouldn’t have been able to handle an influx of traffic after a certain threshold. If you’re still reading and haven’t realized it yet, we are not professional systems administrators; fumbling around in the dark and Googling problems is our modus operandi, and the more separation we can create between us and hardware the better. CCI allows us to rely more on professionals who know what they are doing, and allows us to focus on what we are good at — building applications and services for science.

Thank you for bearing with us! If you experience any data or performance issues with Macrostrat, Rockd, or GeoDeepDive services please reach out to us and we will do our best to resolve it as quickly as possible.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.