How becoming a pilot made me a better engineer

Maciej Kozlowski
Inside League
Published in
6 min readOct 11, 2022

A retrospective of a data migration at League

Photo by Chris Leipelt on Unsplash

Flying is often touted as the safest method of transportation. Many small planes that we continue to use for general aviation are over 50 years old. When I decided to pursue my childhood dream of being a pilot a few years ago, I didn’t expect to learn the reasons behind these statements. I certainly never thought I would be able to apply this new knowledge to software.

A few weeks ago, our team at League successfully migrated a dataset between two stores hosted using Google’s Cloud Healthcare service. The migration required five hours of down time, during which, some of our features weren’t usable. We needed to stop several of our services and revoke various permissions in order to prevent any writes that would impact data integrity. When restoring services, we needed to throttle some data ingestion since we had accumulated a small backlog during the five hour time frame. I believe that a large part of why this process went so smoothly was due to key factors that also keep aviation so safe and reliable.

Checklists

A selection from a Cessna 172N checklist [source]

My favorite take away from aviation is the use of checklists. Pilots use checklists for nearly everything that they do; inspecting the plane, starting the engine, preparing for takeoff and landing. A checklist is simply a list of instructions or items to verify. They are such a simple yet immensely useful tool for staying organized, on track and ensuring all actions are carried out. By writing out all the required steps, we can reduce our mental load and focus on one item at a time. In order to ensure that the steps are followed, the point and call method can be used to be more mindful of the actions taken. We should of course strive to automate where possible and reasonable, but I would argue that’s just creating a fancy checklist for the computer.

A pilot verifying the runway by using the point and call method. Credit: Steven Foltz

When planning the migration, we prepared a checklist outlining all the steps that needed to be carried out. After testing with our staging data we were able to include time estimates, which gave us a rough schedule in addition to a checklist. Each item on our checklist included links to the relevant pull requests and resources that made the process even easier. Would it have been possible to automate this process? Certainly! Was it worth doing? Absolutely not! In addition to the time it would have taken to write the script — like any software — it would have also required testing and error handling. Since this was a one time event, it simply wasn’t worth investing the effort. One benefit of automation is that the process becomes less error-prone; however, when following a detailed checklist, the exact same results can be achieved manually.

Part of the checklist used for the data migration

Beyond this, I also believe that the concept of checklists can be applied to documentation in general. From my experience, most documentation is written in a “here is how this works” style. This is very helpful when it comes to maintaining and developing a product. However, it is equally important to have documentation that tells you “here is how you do things”. Some perfect examples are:

  • This is how you deploy a new version or roll back to a previous one
  • This is how you troubleshoot
  • Here is how you run this script

Redundancy

In aviation, everything is made and designed with redundancy in mind. Piston engines have two independent magnetos that each power a set of spark plugs. Fuel is stored in two separate fuel tanks. A magnetic compass is required as well as a gyroscopic heading indicator. Redundant and independent systems are highly important because they guarantee that a problem with one part doesn’t lead to overall failure. In software, redundancy is also everywhere; backups, multiple instances and fail-over systems. Pair programming is an example of redundancy that perfectly mirrors the role of two pilots on commercial flights. The “driver” in pair programming is the pilot flying, and the “observer” is the pilot monitoring. Redundancy is also, by definition, how you manage a company’s or project’s bus factor.

The bus factor is the minimum number of people on a project that can “get hit by a bus” before it puts the project in jeopardy.

During our migration, redundancy also played a major role. We ensured that there were at least two engineers capable of doing the work across all impacted domains. When executing the plan, these engineers were observing and monitoring to confirm that everything was performed properly. Finally, to allow for fast fail-over and rollback to the old data store, we continued to write to the old instance to ensure consistent data accuracy. If, at any point, there were some issues after services were restored, we could simply flip a flag and revert to using the previous store.

Emergency planning

Over half of the time in my flight training was focused on emergencies and recovery from undesirable circumstances. Practicing steep turns, stalls, spiral dives, spins and slips may seem like acrobatics, but these are all critical lessons that can some day save a pilot’s life. When it comes to software, many companies are often severely underprepared to deal with emergency situations. The Rogers Communications outage earlier this year prevented 9–1–1 calls and took down the Interac payment service. In 2016, millions of HSBC customers weren’t able to access online accounts, with service only being restored two days later. Having backups and a disaster recovery plan are a critical first step, but often, the second, equally important step is forgotten; testing and practice.

When planning for our data migration, we prepared for multiple adverse scenarios and created an emergency checklist, just like ones that exist in aviation. Our scenarios included low impact possibilities such as errors with the data import or the import process taking longer than expected. In these cases we would simply abort the process, recover to the old instance and try again on a different day after a post-mortem. We also considered higher impact scenarios where the migration would succeed but we would encounter problems after bringing back services. As mentioned in the previous section, in these situations, we could quickly toggle a flag and revert to using the previous data store.

Following any sort of incident is a step that many engineers are already familiar with — a post-mortem. This is what helps us improve and apply the lessons learned in the future. This, once again, has a corresponding step in aviation. When an incident is reported to Transport Canada, a CADORS (Civil Aviation Daily Occurrence Reporting System) report is created. Documenting and analyzing these reports can reveal gaps and lead to systemic improvements. It could be that some instruction is missing from flight training, additional regulations are needed or safety reviews are required on a particular aircraft.

A selection from a CADORS report. CADORS #2022O1912

As I continue my journey in aviation, I’m certain that I will discover other concepts that can be applied to my engineering career at League. One aspect that seems rife with possible learnings is the robust and regular maintenance that planes undergo, which keep them operational for such long periods of time. Similarly, situational awareness, which played a massive role in my flight training, can be very beneficial in many other fields.

In terms of our migration, thinking like a pilot made us well prepared with a detailed checklist that greatly reduced room for error. We had redundancies in place and were ready for any unfortunate circumstances with clear steps to recovery. The end result was a well executed and successful plan and a debrief that sparked the idea behind this blog post.

--

--