Cloudy, with a chance of Meetup: Behind the scenes of moving Meetup to the cloud
Up until very recently, we had data centers we maintained in Philadelphia and in New York City, and we maintained and leased our own hardware in both of them (for redundancy purposes). When we needed a new server, or if something went wrong, we had to send engineers to manually set things up for us. The promise of the cloud was that we would be able to scale up and out, especially as we transitioned to a microservice architecture, with far fewer distractions and lower cost. For more on this, see our prior blog post: https://medium.com/making-meetup/moving-meetup-to-the-cloud-1416b66f82cb
Our systems engineering team worked for about nine months on this; the pace of their work accelerated as time went on, especially in the months leading up to the migration. We began to reap the rewards of their containerization work even before the official migration date — for example, we were able to run large parts of our notification sending subsystem in Docker containers in AWS several weeks before the migration. In mid-February, when the sending of our event reminder emails started to back up due to high traffic in the busiest part of our year, we easily doubled the number of email sending servers, preventing us from having to delay the messages too long, send someone to the data center to provision new hardware, or spend a lot of engineering time (that was already allocated elsewhere) on troubleshooting.
At first, we chose February 26 (just after our January rush of traffic) as a tentative date to complete our migration to the cloud and shut down and transfer all our remaining data-center services to AWS. When that date was fairly close, we found a showstopping issue related to an interaction between the new caching system we would be using and our API’s implementation of OAuth 2, so we moved the plans to March 5 to ensure there would be enough time to fix the issue.
The day of
All of our systems engineering team came to the office shortly before midnight on March 4 → 5. Select core and web engineers worked from home, as well as our entire QA team; two of our principal engineers were generalists on call, and additionally we chose senior engineers to be on call for our data pipeline, our API, and our notifications subsystem. (I was on call for notifications, as a senior engineer on the Notifications product team.) Our CTO, Yvette Pasqua, was also on call to make big decisions — especially those involving when we were ready to come back online with our AWS infrastructure.
We planned to announce milestones in Slack in our #production-status channel and announce the completion of the migration in #general (our all-hands slack channel) as well. In the moment, we discussed the details of any problems that arose in three slack channels: #aws-migration, then #qa during the final spot check, and finally #production-status after the site was live. This is something I would plan to centralize in one channel if we were planning a task needing similar coordination in the future.
We’d created a timeline as a part of our planning for the migration. I’ll go over this timeline and note when these tasks were accomplished, with a ✅ if all went well, or with more information when relevant.
11pm (March 4th): Go live with red “site will be down” band message on web. [✅]
12:00am (March 5th) — Check in at HQ. [✅]
12:05am — Change DNS to point to the maintenance page for *.meetup.com (robots.txt with disallow all). [✅]
12:10am — Change DNS to point api.meetup.com to “503” service api.meetup.com CNAME → [✅]
12:15am — Shut down App and API processes in data center. [✅]
12:30am — Shut down all producers jobs in data center. (We shut down almost all of our email producing jobs in the first pass of shutdowns and then began to shut down consumers of those jobs. We missed one of the email producers, though, and had to briefly restart one of our email consumers after stopping the producing process. This didn’t take long, however, so we were still on schedule.)
12:45am — Monitor RabbitMQ queues until they are drained. We can also attempt to speed up the process by running multiple consumers on AWS. This is the most uncertain part in the plan, as it is hard to predict how long it will take for all the queues to be drained. It could take less than 15 minutes, or more than one hour. (This took 30 minutes, which kept us on schedule.)
1:30am — Shut down all consumers in data center and disable crons. [✅] (We were ahead of schedule for this; since some of our email consumers use RabbitMQ, this happened in parallel with the 12:30 and 12:45 steps, rather than sequentially.)
1:45am — Shut down all services in AWS. [✅]
2:00am — Stop replication from MySQL and Redis in our old data center; take backup for Email DB. [✅]
2:30am — Promote Aurora to master, change DNS in VPC to point all write endpoints for DB, RabbitMQ, and Redis to AWS. (At the time, this went smoothly. We found out the next day that a command had been mistyped in such a way that one of our Redis write endpoints was pointing at the wrong location, which caused several mysterious errors early on Sunday morning. Having a clearer and more-reviewed playbook might have helped avoid this issue.)
2:45am — Start all services on AWS. [✅]
- Update DNS records to point to Fastly.
3:00am — Smoke test and E2E tests. (This took about 2 hours.)
We were satisfied with our manual QA and went live to our users in the new infrastructure around 5:30am on Sunday, March 5, and we kept working in shifts on Sunday throughout the day to triage and resolve remaining issues, with follow-up tasks ongoing throughout the next week.
The week after
We met on Monday morning, March 6, and identified our top three priorities for fixes, and owners for each, from a spreadsheet of all issues that had been discovered during and after the migration. Later that day, I also prepared a specific document on notification-related outages and defect statuses that I shared with my product team and with the community experience team (since they would be the ones handling any issues reported by our members). I highlighted “user impact” and “next steps” (with owners, when possible) for each issue in the document and received feedback that this was a useful focus for the document.
On Monday afternoon, rotating on-call shifts were set up for the rest of the first week so that the engineers in the rotation could all get enough sleep. (Normally we have a weekly on-call rotation; this was a daily rotation.) This was my first time being generally on call rather than for a specific event, and I found the daily rotation to be a good way to spread out the knowledge of the cloud infrastructure. I took a neat selfie with PagerDuty during my shift on Wednesday.
The present and future
We held a retrospective later in the month, after all of the dust of the migration had settled. Here, I’ll go over some of the highlights of our “positive”, “negative” and “try for next time” columns.
Things that went well (+)
- We are now migrated to the cloud!! ☁
- This process prompted us to do upgrades we’d been putting off for a while (e.g. A RabbitMQ major version upgrade occurred as part of the move.)
Things that didn’t go well (-)
- We could have done more preparation for an on-call rotation plan. For example, I was working from midnight Saturday night through mid-morning on Sunday*. On Sunday morning, a coworker of mine was asked to log on to help, and quickly diagnosed an issue that I’d been staring at for several hours with no success. We could’ve had a plan in advance for that handoff to happen instead of asking her to help on the day of the migration. (*Nota bene: I took time off later in that week. We don’t expect engineers to be tireless machines.)
- Systems team worked really hard on setting up the migration, but there wasn’t always clear communication on how their work affected product teams who owned the product they were moving. I think it’s likely that the systems team could’ve worked more efficiently and effectively with some help from core engineers on product teams.
Takeaways (try for next time)
- One theme that I saw over and over in our “try” column was: We are actively working and putting processes in place to have better communication and collaboration between all our engineers.
I’m proud to have been a part of our cloud migration. Thanks to everyone who made it possible!