Always be prepared

Global Technology
McDonald’s Technical Blog
4 min readJul 25, 2023

McDonald’s recently conducted its largest systems-disaster recovery exercise in Europe to date, ensuring that the plans we have on paper will work if ever needed.

Members of the Global Technology Infrastructure Operations Team gather in London to proactively prepare for anything that could disrupt our technology through live tests of system resiliency.

by Pam Dorn, Operations Execellence Manager — Internal

It’s approximately 2 a.m., and ten members of McDonald’s Global Technology team are gathered around a table in the London office, watching as network and infrastructure traffic shifts seamlessly to failover sites.

Someone suggests they pause for snacks to help re-energize the group and focus. There’s a heated debate on whether they go for leftovers or biscuits (aka cookies). They land on biscuits with McCafé coffee, and everyone gets a much-needed break to enjoy the English treat.

You wouldn’t know it, but the team is in the midst of a disaster recovery exercise — a live test to ensure McDonald’s most critical network points can quickly and effectively be switched to backup options in the event one of our network sites fail.

More than just good on paper
It’s no secret that McDonald’s heavily relies on technology. However, the scale of our technology isn’t as widely known. There are more than 40,000 McDonald’s locations in over 100 countries. We serve more than 65 million customers every day, and there are approximately 200,000 people employed by McDonald’s Corporation and its majority-owned subsidiaries worldwide.

Technology powers all of that. From our Global Mobile App to our in-store kiosks to the servers in our corporate offices — every single technical component must be working smoothy and effectively to serve our customers.

That’s why we proactively prepare for anything that could disrupt our technology by testing our resiliency during a failure with disaster-recovery exercises. For some context, a disaster-recovery exercise is designed to accomplish two things: First, it brings teams together to prepare before a disaster occurs; and second, it’s meant to help discover and remediate gaps in failover plans before the failover is needed in an actual emergency.

In McDonald’s case, our approach is somewhat unique and more practical. Led by the Global Technology Infrastructure Operations (GTIO) team, the approach was to materially test failing over and ensure that the backup systems worked when actually put into use in addition to testing the plans on paper, meaning they were going to practice failing over while the business was actually running.

“There is no real down time when you are serving millions!” says Carol Glennon, GTIO director. “We failed over to actual backup systems while customers were still being served.”

The anatomy of a disaster-recovery exercise
To do this, the GTIO team followed a series of steps.

1. Prepare: The team spent three months preparing for the exercise. They had to painstakingly detail the recovery details and procedures and create runbooks that would be used during the event. They used information from architecture documents, past issues, interviews with specialists, and resiliency design best practices to uncover any and all gaps that might occur before the material test took place.

2. Collaborate: The GTIO and Global Technology Risk Management team worked with more than 50 teams across Europe to prepare for the test. This meant working within three time zones for calls and meetings.

3. Test: The day of the test, the team gathered in-person and on video conference hours in advance of the late-night test to pre-check all the materials, consider any late-breaking information, and warm up dashboards and other monitoring sources.

4. Learn: While the failover test was successful, there were significant learnings that uncovered opportunities to close additional gaps in the process. One of the biggest lessons was knowing the importance of testing. We have mechanisms we rely on to activate failovers, and we needed to ensure those would still work at an increased scale.

“When teams practice simulations and material exercises together, it has an exponentially positive effect when they are called on during incidents,” Glennon says. “This was the first material test of this magnitude we’ve done. Our systems have evolved with changing customer needs, so it’s vital to constantly test and calibrate the thresholds for failovers as our size, scale, and customer demands change and grow.”

Looking to the future
During the exercise, the team was able to bring the failover to full back up in less than five minutes — the gold standard for this type of practice.

“We knew we had our failover plans well laid out, but to see those plans materialize during an actual failover gave us renewed confidence that we are ready if a disaster strikes,” Glennon says.

Among other things coming out of the exercise, McDonald’s now has detailed runbooks and automation tested and ready for a real event. These procedures and plans will help the incident response teams to failover more successfully.

Most notably, all teams involved are now better connected, enabling smoother communications and interactions, which is critical during an incident and beneficial for day-to-day operations.

Read more from this author:
When the second choice turns out to be better than the first

--

--