The Cloud Journey!
Migrate Bank from one Cloud to another
Yes! We did something which most organizations would stay away from — Cloud to Cloud migration!
Cloud migration, or on-premise migration, is something people countermand and frown on. Here, we will talk about the Why, What, and How of our migration from one cloud to another.
∘ Phases and feedback cycle: POC → dev → staging → pt →prod
∘ Rollback plan
∘ High-Level Steps
∘ Migrating Compute layer
∘ Migrating Data layer
∘ Security considerations
∘ External parties — Intranet communication
∘ Order of migration for non-related systems
∘ On the D-day
∘ Prod support/ SRE
· What went well?
· What could have been improved?
Motivation: Answering Why!
The below timeline depicts Bank App’s journey across different cloud providers.
With the delay in the Largest Cloud Provider-(foo) footing in the Indonesia region (the team had more expertise in this); we started understanding, designing, and eventually started development effort in Asia’s Largest Cloud Provider-(bar). And then a new player, let's call this Cloud Provider-(baz), entered the regional(Indonesia) market in June 2020. The team got excited about multi-cloud and some clear benefits we got out of the box with running some services in Cloud Provider-(baz):
(Disclaimer: All provider names are undisclosed to maintain objectivity. This is based on our understanding and the setup we did with a multitude of constraints. We by no means intend to compare which cloud is better than the other; situations and constraints vary)
- High Availability (HA): Three Availability-Zones (AZs) over Two in older Cloud
∘ Improved HA of MongoDB (quorum — distribute nodes in three zones vs one and cater to one zone failure)
∘ Improved HA of Kafka: Zookeeper cluster for having a quorum
∘ Elastic-search HA
∘ Rest & Transport Data Encryption for all Data services
∘ Better threat analysis using tools like Cloud Armor
∘ Data Loss Protection (DLP) APIs directly hooking on certain critical data sources
∘ (Security Command Center) SCC availability and easy coupling with holistic services logs
∘ K8s: pods directly using Service Accounts(SA) over using static credentials before interacting with cloud services like cloud functions and others
∘ Better integration with VM, container scanning/ signing tools
∘ Fine-grained ACLs to access resources tied to SSO login — Least Privilege principle
- Developer experience
∘ Easy Tooling: increased options for SDKs
∘ Mature IAC (Infrastrcuture-as-Code) utils
∘ Wider Documentation/ Community support
∘ Greater testimonies
∘ Marketplace integrations with a wide range of tools and server technologies to integrate with third-party providers
∘ Wide integrations to third-party tools
∘ Auto logging of K8s workloads to Cloud logging
∘ Easy integration of monitoring agents on Compute resources
Being in the business of the most regulated market tied our hands on multiple fronts, namely:
- Data residence within the country of operation
- DC/ DR requirements meeting
- Strict SLOs and SLAs
- PII data usage Governance
With these existing regulators and compliance constraints, came the technical constraints:
- Writing IAC completely afresh mapping the service to new cloud’s ecosystem
- Coupling/ Marriage with blob storage of cloud provider → needed code change for cloud storage in new cloud
- Caching/ in-memory ephemeral storage which was a Cluster setup in the previous Cloud was no longer available in the Cloud being migrated to and we fell back to simple Master-Slave topology
- Alerting/ monitoring integration had to be re-done with our APM tooling
- Deployment pipeline changes catering to K8s deployments
- Elastic-Search license change resulting in loosing of X-pack features → code change for alternative approaches
- No downtime or stop the world approach for Engineers’ development time and sprint goals
- Minimal change/ learning curve around tooling
- Lesser out-of-the-box availability of managed services for our Data sources mapping to OS variants in the cloud being migrated to
We solve one problem only to pick another!
Anything in technology and distributed systems come with tradeoffs.
Hence, the risks:
- Data discrepancy while migrating data (working system on production with few lakhs of customers)
- Security risks on Infra, networking, system access
- Adhoc Surprises with some components: Thinking it would work in X way but surprisingly it works the Y way
- Downtime needed
- Key skill sets necessary for pulling this off
Answering the What: Approach for migration
With a unanimous decision of multi-cloud and migrating key services to new cloud, lots of questions pop:
- Will this need regulatory/ compliance approval?
- Do we have the bandwidth?
- Will there be downtime in sprints goals?
- Will there be code changes?
- Learning curve?
- Will it be full-blown migration or phased?
- Do we pay for enterprise support for the new cloud and will let go of the old one?
Cloud-agnostic Technologies: Choosing cloud-agnostic and open source tech stack for persistent Datastores, ephemeral Datastores, artifact management, CICD pipelines, Gateway, CDN/ WAF solutions, Inverted index stores, Compute resources, Deployments, Containerization, stateless applications, Clear demarcation between compute and persistent store, generalist Message Queues for event-driven use cases, security tools, APM, etc. helped us with minimal to no code changes when we migrated.
Seeders: We have seeder services (which caters and abstracts cross-cutting concerns like CICD pipelines, Static Quality Code analysis, SAST, DDD folder structure, Linters, Test frameworks, Open-tracing, middleware functions like Auth, swagger, common utilities(retries, circuit breaker, payload validator, enums, etc), data store connections) from which all of the MS(micro-services) are forked. So any change in seeder is super easy to pull from upstream remote of respective service.
So change once → pull from upstream→ updated code in service
Reusable Modules: We have wrapper modules for anything and everything that’s done more than once in different services like common configs as code, making HTTP calls, logging, retrying connection to data services, producing/ consuming messages from MQ, message validation, auth utilities, middleware functions, Caching utils, Serializing/ deserializing things around, etc, etc
So change once → upgrade module version→ updated code in service
Answering Phased migration vs Lift and Shift
We came to the consensus that Lift and Shift is the approach to be taken because:
- The overhead holistically(SRE support, maintaining two versions of modules, parallel deployments, maintenance, if and else in code in few services catering OSS/ ES, asset management, licensing overhead, cloud enterprise support, syncing between two clouds, data latency) was much more if we did phase roll-outs compared to complete shift since the data in persistent stores were in GBs and not TBs or PBs
- Satisfying the criteria laid down for go/no-go meant we handled all the risks already even for Lift and shift(we will come to the criteria set shortly)
Since our approach was Lift and Shift, the next question was How long will the downtime be? Let’s answer below with feedback from migrating to lower environments.
Other types of migrations are well explained in the article below.
What is Cloud Migration?
Cloud migration. What does it mean? Cloud migration is the process of moving digital assets - like data, workloads, IT…
Managed vs self-managed services
The next major question we had to answer was around should we go with managed services vs self-managed for the out-of-the-box ones not available in the new Cloud’s ecosystem (services like Kafka, MongoDB, Elasticsearch, Redis, Postgre). It was super clear that we wanted to be cloud-agnostic and don’t get married to any cloud. At least as much! (choosing pub-sub, Big Table, Cloud Datastore, etc. wasn’t an option)
With that, we need to look for accelerators, experts on these services if we had to self-manage. That meant more roles, more hiring like DB admins, Domain experts, System Engineers, etc
This is where the Cloud marketplace was a boon where there were multiple offerings for the tech stacks we needed like, for example, Bitnani managed Mongo or Kafka etc. in the Cloud being migrated to, and connected via VPC peering.
How did we do the migration?
Sigh! Need to take a deep breath. Lot went in here. Let us try to break in smaller chunks on things we did.
This served as a feedback cycle and go/no-go document for stakeholders and overall health, TODOs for migrating from one environment to another.
We came up with MECE (mutually exclusive, collectively exhaustive) checklist related to Infra components readiness, application services, Monitoring, CICD pipelines, Communications, Access matrix, DNS changes, Step by step order of data sources migration, and operation timeline.
Prep for the D day:
Labs: This was the crucial step. This is where we answered a lot of unknowns. We did lots of POCs, pilots to know things we don’t know. We had a
labsenvironment for the Infra team to test their IAC, pipelines.
Configs: We structured configuration as code and maintained multiple versions of dependent lock files (ex: yarn.lock) referencing previous Cloud’s and new Cloud’s configs in
module-config (single source of truth for all common configs as code).
Pipelines: Until we migrated completely in one environment, we deployed both to old and new Cloud at the same time in parallel. Catered BAU (Business as Usual) where things were deployed and tested out in baby steps and kept the sprint goals and feature building going on.
Phases and feedback cycle:
POC → dev → staging → pt (perf test) →prod
Since we understand how big of a change this is, the feedback cycle being as short as possible was critical.
If anything were to fail we wanted it to fail early, fail fast
With that in mind, the approach we took with staged environments migration with labs (especially for IAC and Infra components to do POC), dev taken as test environments and staging and above same as simulations for prod.
labs (IAC)→ dev → staging → PT(performance test) → Prod
Each phase/environment was given a time of X days to be sure of anything we missed out (configs as code, automation scripts to switch between old and new Cloud’s components, CICD pipelines for any red flags, data stores). For dev, this X day was quite long, we had to fix things out. We did see some red flags as well from integrations with external partners, routing traffic to on-premise use cases, firewall rules tuning, assets movement, etc.
Feedbacks from this was crucial on many facets for many squads to → making checklists extensive → regroup to multiple sections → PICs/ owners were added for driving/owning each item in the list.
Drills for exactly how migration would happen in production were conducted in
pt environments and a mock Drill in prod as well to be sure of the migration time followed by cleaning the environment later. This helped in:
- Getting an estimate of time for complete E2E migration including data
- Tweaking and tuning processes by identifying tasks to be executed in parallel
- Further improving Checklist, resulting in more like a Run-book
We structured our plan in such a way, that after migration we followed a quick smoke test (special app build) conducted by the internal selected team. Checking if there were red flags and if after X minutes things didn’t improve, we would switch back to the previous Cloud.
The key for this was when Lift and Shift were triggered, we only turned off processes (by firewall rules and DNS mappings) and could turn them back on in a matter of time if things went south.
- Notify customers of downtime well advanced
- Stop the incoming traffic (internal, external) → confirm the same from APM
- Make sure messages from MQ (Message queues) are completely drained
- Block all traffic to Datastores
- Double confirm on no activity on Datastores (can’t stress this enough!)
- Start migrating the Datastores in parallel
- In parallel, deploy compute resources (with access to Datastores, customers/ test users traffic blocked)
- Once data is completely migrated → Allow traffic from Compute layer to Datalayer
- Do sanity check on any errors in logs, APM, etc for any anomalies
- Switch IPs from key services calling Bank APIs over Intranet
- Allow traffic from special app build created for Smoke-test
- If All is Well!, increase test scope for complete regression
- All good! Test traffic from public customers
Migrating Compute layer
This was quite straightforward since most of our Compute layers are stateless and containerized. We have a single pipeline called Bulk-deploy for deploying all services at once. The scripts were tested in lower environments and deployed the entire stack in a matter of a few seconds.
The artifacts were zeroed on which version to be deployed for each service tested already multiple times including complete regression suites in lower environments.
Migrating Data layer
This is the elephant in the room! To get this right we came up with:
- Automated scripts run as jobs in pipelines in CD servers with special access which felicitated the data migration.
- Independent scripts were created for different Datastores and tested N number of times in lower environments and mock drill in prod
- Post-migration automated Test suite for checking Data Integrity and Durability for respective Datastores
- Using a dedicated checklist for respective Datastores migration
- For bigger data sets like blog storage which were in TBs were done incrementally
Entire Infra in new Cloud was built from scratch with the Network design being completely different from that of previous Cloud(VPCs wide constructs, Interconnect, Secondary IPs, Security groups over firewall rules, Logical NAT GWs, etc)
Once Network, Org/ folder/ project (a construct specific to new Cloud) structure were created each Compute, Data resources were POC’d in labs kind of setup where team understood nitty-gritty, best practices, what’s possible — what’s not possible, IAC tooling mapping to the specific technologies, etc
After reaching the comfort level of understanding/ implementing different components as IAC and Infra deployment pipelines for different environments were created in stages with feedback from each environment.
Pipelines were designed in such a way that a fresh immutable environment could be created in a few quarters of an hour.
It’s awesome to look at this!
Though the new cloud offered better security constructs in multiple dimensions like REST/ Transport encryption, it was necessary to re-do the pen-tenting for Infra and Application before heading migrating.
Team also catered to multiple dimensions of analyzing different components (new and old) of infra and application with the lens of CIA triad.
Our existing test model prevailed as tests were run from one environment to another through multiple cycles from a functional perspective.
One thing we were quite paranoid about was non-functional pieces on the behavior of when we scale, fault tolerance, zone failure handling, etc, or just anything we are missing which will only show up post scale. To address this paranoia, with help from our PT (performance automation team) we did extensive performance tests in our PT environment to feel comfortable as much.
Also for testing the app post-migration as smoke/regression tests, the team came up with a special build to allow only traffic only from that build to test before letting the full-blown customers traffic in.
External parties — Intranet communication
Many entities/support systems we integrated with communicated over Intranet via Interconnect/Express connect. That meant the IPs they resolved-to, had to be changed! (some whitelisting of NATs IPs for public traffic also changed)
These changes were coordinated with communications across multiple channels and exercising the same change in at least one lower environment. Also, as precheck validation communications between different external systems, we interacted with to and fro well in advance with simple
Order of migration for non-related systems
Some systems like the official website and some other streams of Business completely decoupled from Bank App journeys were migrated N days before the D day to increase the comfort level.
On the D-day
Here it comes!
D-Day was chosen with some objective and subjective considerations because of various reasoning. As you can guess we were doing this at midnight! Teams were asked to rest well enough for the single largest collaboration we were doing in a single go. For some, this was the first time they were involved in any kind of migration. Some were anxious, some were relaxed, some were doing last-minute prep!
As the time hit the clock, communication was sent out to customers for X hours downtime. The final go/no-go checklist was reviewed by stakeholders and pulse check of the team on how they felt.
We created a special war-room channel for everybody to jump in and off the calls since everybody was working remotely. Aside from helping the team to coordinate better, the channel helped record things for regulatory requirements.
It was a go! Run books were followed with a detailed checklist for each item. Migration was completed, smoke tests were completed with a special build well before the estimated time. The team posted test results on the collaboration page. Things looked ok, no blockers at least(some config mismatches, quickly corrected and deployed the services). Wait…
Did I tell you there is another go/no-go to open for public traffic or rollback? There was some monitoring/ alerting setup that got done at the last moment because of a multitude of reasons(will cover this in a while). Eventually, this was resolved with short-term/ long-term resolutions. It was a nail-biting go/no-go since some of the alerts were supercritical. It was a go!
The team was fully awake, watching monitoring, testing with full excitement!
It was almost dawn, we heard in the group call someone’s parent asking “How come you woke up so early today?”. We all chuckled!
Traffic ingress was enabled back for the public and monitoring continued with most of us opening dashboards with eyes glued. So far so good!
We did come across a few platform-specific issues and missed whitelisting which caused issues as traffic grew in the morning. One such highlight was about one critical API becoming ridiculously slow. Upon quickly doing an RCA we found that the VM NAT ports available were getting exhausted which meant any new requests(post exhaustion of ports) had to wait for the ephemeral ports to become free to fire an API to the external SAAS service we interacted with. This got quickly resolved. The reason these were not caught in Performance tests is that external entities were mocked using mock-servers that meant traffic did not go out and didn’t use ephemeral VM NAT ports at that scale.
Another functional issue that popped, failed in API response validation for a character length check which got exceeded for signed-urls for cloud storage object access in some cases.
And another issue was with referring to static assets in-app in which the Blob storage treated extra backlash (
/) in the URL differently between the old and new Cloud
Prod support/ SRE
We did not have a dedicated technical prod support/ SRE team. Few services were still in the previous cloud while most migrated to the new one. Even during development time, the team did the heavy lifting in supporting/troubleshooting issues of their peers for both clouds.
Teams were given enough time to get used to changes in APM metrics as a whole. The toolings were kept the same as much as possible to reduce the learning curve.
What went well?
Overall the migration went smoothly. The learnings were awesome from unscheduled learning opportunities. How the team members came together to solve and step in on ad-hoc issues was inspiring. The orchestration was well planned, tested across systems. Teams and stakeholders understood the tradeoffs of time.
Stakeholders gave kudos to the team congratulating them on the most effective migration they witnessed in their careers.
What could have been improved?
Communications! There is always something or somewhere that gets missed or misunderstood. I can’t stress enough the 3 Cs (Clear, Concise, Consistent) and proactive engagement.
T-1 day some surprises came in, one missed communications regarding alerting on log errors not working (the tool we used could no longer be able to snoop and alert from GKE logs) and missing Consumer lag metrics and alerts for MQ. Both risks were mitigated with quick alternative solutions from the team as short and long-term resolutions.
Our systems were migrated successfully. We did have hiccups, we did sip cold water and resolved! Cannot thank enough the entire team who helped us in this journey and the unplanned learnings!
In love with the process and result.
So not all migrations are painful!
Like what you’re reading? Why not join our team and meet the like-minded team behind this? Join us!