The great migration: Chime’s move to AWS

Published in

Life at Chime

9 min readJun 15, 2021

Mike Barrett remembers when he first heard the term “undifferentiated heavy lifting.” His favorite example is General Electric (GE), a company that builds lightbulbs and electrical equipment, among many other things. “So why were they spending millions of dollars building data centers — why not give that responsibility to somebody whose core competency is, well, data centers?”

The term was coined by Jeff Bezos, Founder of Amazon.com, in a presentation he gave in 2006 at MIT. He explained that many companies do the heavy lifting to run web applications, like hosting, managing bandwidth, overseeing data center hardware, and general backend work — all of which he called “muck.” He also recognized that after growing Amazon, they knew how to make great muck — so why not share that? His vision was to give companies “the ability to compete based on the quality of their ideas rather than on their ability to create their own muck.” Today, we know that Amazon Web Services (AWS) is a leader in abstracting away the muck and enabling companies to scale faster and more efficiently. “By making every part of engineers’ jobs a little — or a lot — easier, AWS helps companies focus on our core competencies and deliver better products and services,” Mike says.

In 2019, Chime took a look in the mirror at our core competencies: Chime is a financial technology company, not a bank—our services are provided by, and debit cards issued by, The Bancorp Bank or Stride Bank, N.A.; Members FDIC.—and we’re growing quickly. “When we started to look to the future and the growth we were anticipating — both of our number of members and our team — we recognized a pressing need for a solution that would abstract away our infrastructure needs and let us focus on the work that matters to achieving our mission and serving our members,” says Mike. So we began a migration to AWS that was recently completed. Here’s a look at the team and the work behind our great migration.

Meet the team

Mike Barrett, our VP of Engineering Services, joined Chime in early 2020 after watching a Last Week Tonight episode on the state of payday loans. In his role, he oversees Engineering Services, the part of Chime Engineering that empowers all of our engineers to do what they love — write code and provide value to our members — with less effort.

Ethan Erchinger, a Staff Software Engineer on the Engineering Services team, has been a Chimer since 2014. In the early days, he helped build out Chime’s production and data engineering infrastructure and has continued to work across Engineering Services with a focus on infrastructure ever since. Once a director and manager, Ethan now focuses on providing consultative services across the Engineering organization.

Keerthi Nallani, a Senior Software Engineer, joined the Scale team in early 2019. She’s worked on scaling our infrastructure and improving application performance — a departure from her former days coding web application features. She loves removing bottlenecks and dependencies for fellow engineers and ensuring Chime can grow sustainably.

Kevin Olson was an Engineering Manager on our Infrastructure team at the time of the migration (he’s since become the founding Engineering Manager on our Financial Platform team). He joined Chime in 2020, a month before the pandemic hit. His role was to understand the needs of the different parts of our Engineering organization for the AWS migration and build cross-functional relationships to support the migration. He loves empowering engineers to step into the roles they’re needed in and helping all teams be more effective.

Satya Palani is our Director of Infrastructure Engineering and Developer Experience at Chime. His main focus is on making sure Chime is available to our members by building and maintaining our infrastructure. He partners with engineering teams to ensure they’re supported in their operations and have the tools and processes necessary to succeed. For the AWS migration, he managed one of our engineering teams that worked on it, helping them develop a solid rollout and rollback plan. He joined our team in August of 2020.

Making the migration a priority

In planning for the future of Chime, when we decided to build a financial platform, we realized there were some missing pieces in our infrastructure to build the systems we’d need for the platform to be a success. At the time, Kevin was new to the company in a leadership role and started doing one-on-ones with managers. “I noticed that they’d often call out things that would be so much easier with the AWS migration complete — but there wasn’t a clear roadmap for when it would be done,” he says. At the time, Chime was hosted in a vendor-managed data center, there wasn’t a team dedicated to the migration, and it wasn’t a priority on the team’s roadmap.

As usage of the Chime service was rapidly growing, we realized that we wouldn’t be able to scale our infrastructure in the data center at the same speed as our member base. We also wanted to ensure that our infrastructure was as resilient as possible, because availability is critical for our members. Kevin says, “If Chime goes down, members can’t access their money — that’s a core function of Chime and we needed to make it a priority.”

At that time, the migration to AWS was made a company-wide priority to provide scalability and self-service for all of Chime Engineering. “We have a team that lives and breathes our product, but we needed people who would live and breathe the tools that make building, deploying, and supporting that product easy — all of which results in better services for our members,” Mike says. “When we decided to complete the AWS migration, we were committing to making that a priority and a reality.”

Our goals

Gain the ability to rapidly scale to provide better service for members with uptime, speed, and reliability. The vast and dynamic capacity of the AWS cloud allows us to provision new infrastructure at a moment’s notice and provides more geographic resilience than traditional data centers.

A seamless migration for members. “A primary requirement for our migration was that our members do not experience any interruptions,” says Satya. Protecting data integrity, designing a seamless cutover process, and performing extensive pre- and post-migration testing were some of the critical components of the migration plan.

“By being able to quickly scale our infrastructure, we ensure that Chime’s members always have the best possible experience.” — Satya

Improve engineering productivity. Migrating to AWS is a lever that makes everything else easier — if every engineer can be 10–20% more efficient, that’s a massive impact at scale. What’s more, the public cloud makes it easy to treat infrastructure as software, which enables engineers to set up environments really quickly and gives us access to the many managed services that AWS provides, reducing the human cost of setting up and operating common services like databases and caching layers.

The process

The first step was to prioritize the project ruthlessly and then get the team in place. The team was split into smaller groups based on concentration. “We ensured that they were teams, not just single individuals, so we could all share knowledge,” Mike explains. Different groups focused on things like the Redis migration, database migration, Kubernetes, and CI/CD flow. This helped teams focus and become experts in their areas of the project. It also gave clarity on who to contact for questions and built accountability and responsibility across the teams.

Any large technical project is never without its challenges. Here are a few that we faced in our migration to AWS:

Company growth: from start to finish of the migration, Chime grew by up to 20x — our data, customers, and number of employees. We went from 2 engineering managers to 25! “It was a constantly shifting landscape and a challenge to keep the business running — and growing — in our legacy software while we evaluated new technologies,” says Ethan.
Coordination: Changing a car’s engine while it’s in motion is no small feat, and the team underestimated the interconnectedness of all of the systems at first. “There were terabytes of data to migrate, and we needed air traffic control despite our very cautious approach,” says Kevin. The team coordinated by reverse-engineering the system to understand how everything was connected, with a detailed battle plan and steps to follow. Then we focused on migrating 1–2 services at a time to see what would happen.
Moving solutions: “We’d grown deep roots deploying in a data center, so moving solutions wasn’t easy,” says Ethan. For example, when we moved to Kubernetes (from on host deployment to dockerized based deployment), the team spent 5 months moving from one caching solution to another and understanding how to support the multi-region consistency needed for the move. We took care in moving solutions, like caching, because we want our members’ experience to be consistent and reliable — if we were wrong about where data was written and how replication was happening, we might lose consistency and, for example, show a member an incorrect balance.
Making tradeoffs: “As an organization, we had to punt some features and optimizations until AWS was available for Chime Engineering,” explains Keerthi. “This felt frustrating at times, but it was necessary for us to have all hands on deck and to not further complicate the migration process.”

Migration: Complete

“Migrations like these are bound to have downtime,” explains Keerthi. “But we had zero downtime — it was both amazing and anticlimactic.” The reason why? Instead of doing a post-mortem, the team did a pre-mortem to understand, beforehand, what might go wrong. Usually, doing a migration of this magnitude would involve shutting down the system, but we were able to get the dual write latency down to milliseconds — which is how we achieved zero downtime.

The future of engineering on AWS at Chime

Migrating to AWS means several things for engineering at Chime. “The migration empowers our team to be less tied to the past and think more about the future,” says Kevin. “With AWS, if we want to deploy a new service or architecture, we can do it in hours or days rather than months of work orders.” AWS also gives us visibility into the system’s health with a plethora of logs and metrics to dig into in AWS that will help us scale.

Working with AWS also means supported services. From databases to caching, messaging, deployment infra, and more, it’s much easier to do a ton of things without in-house expertise because AWS offers a broad range of services. Finally, our migration to AWS removes infrastructure engineering as a bottleneck. Being a self-service model removes the pressure on the infrastructure team to provision and maintain systems. Instead of teams having to partner with infra to build a new tool or service, developers have greater control to build and manage infrastructure through AWS automation, security, and templates. “AWS unlocks our ability to move projects forward and makes scaling easier on our infrastructure and engineering org in general thanks to the breadth of services and ecosystem it gives us access to,” says Ethan.

Perhaps most importantly, the AWS migration helps us provide a better Chime for our members. A guiding value at Chime is to be member-obsessed, so everything we do is ultimately to serve them. The migration to AWS will help us provide better services, more uptime, reliable data, security, and a sound foundation for us to build the future of financial health upon.