DKatalis
Published in

DKatalis

The Cloud Journey!

Migrate Bank from one Cloud to another

Motivation: Answering Why!

Background

Timelines of foo, bar, baz Cloud-Providers
  • High Availability (HA): Three Availability-Zones (AZs) over Two in older Cloud
    ∘ Improved HA of MongoDB (quorum — distribute nodes in three zones vs one and cater to one zone failure)
    ∘ Improved HA of Kafka: Zookeeper cluster for having a quorum
    Elastic-search HA
  • Security:
    Rest & Transport Data Encryption for all Data services
    ∘ Better threat analysis using tools like Cloud Armor
    ∘ Data Loss Protection (DLP) APIs directly hooking on certain critical data sources
    ∘ (Security Command Center) SCC availability and easy coupling with holistic services logs
    ∘ K8s: pods directly using Service Accounts(SA) over using static credentials before interacting with cloud services like cloud functions and others
    ∘ Better integration with VM, container scanning/ signing tools
    Fine-grained ACLs to access resources tied to SSO login — Least Privilege principle
  • Developer experience
    Easy Tooling: increased options for SDKs
    Mature IAC (Infrastrcuture-as-Code) utils
    ∘ Wider Documentation/ Community support
    ∘ Greater testimonies
    Marketplace integrations with a wide range of tools and server technologies to integrate with third-party providers
  • Monitoring/Alerting
    ∘ Wide integrations to third-party tools
    Auto logging of K8s workloads to Cloud logging
    ∘ Easy integration of monitoring agents on Compute resources

Constraints

  • Data residence within the country of operation
  • DC/ DR requirements meeting
  • Strict SLOs and SLAs
  • PII data usage Governance
  • Writing IAC completely afresh mapping the service to new cloud’s ecosystem
  • Coupling/ Marriage with blob storage of cloud provider → needed code change for cloud storage in new cloud
  • Caching/ in-memory ephemeral storage which was a Cluster setup in the previous Cloud was no longer available in the Cloud being migrated to and we fell back to simple Master-Slave topology
  • Alerting/ monitoring integration had to be re-done with our APM tooling
  • Deployment pipeline changes catering to K8s deployments
  • Elastic-Search license change resulting in loosing of X-pack features → code change for alternative approaches
  • No downtime or stop the world approach for Engineers’ development time and sprint goals
  • Minimal change/ learning curve around tooling
  • Lesser out-of-the-box availability of managed services for our Data sources mapping to OS variants in the cloud being migrated to

Risks

We solve one problem only to pick another!

  • Data discrepancy while migrating data (working system on production with few lakhs of customers)
  • Security risks on Infra, networking, system access
  • Adhoc Surprises with some components: Thinking it would work in X way but surprisingly it works the Y way
  • Downtime needed
  • Key skill sets necessary for pulling this off

Answering the What: Approach for migration

  • Will this need regulatory/ compliance approval?
  • Do we have the bandwidth?
  • Will there be downtime in sprints goals?
  • Will there be code changes?
  • Learning curve?
  • Will it be full-blown migration or phased?
  • Do we pay for enterprise support for the new cloud and will let go of the old one?

What helped?

Answering Phased migration vs Lift and Shift

  • The overhead holistically(SRE support, maintaining two versions of modules, parallel deployments, maintenance, if and else in code in few services catering OSS/ ES, asset management, licensing overhead, cloud enterprise support, syncing between two clouds, data latency) was much more if we did phase roll-outs compared to complete shift since the data in persistent stores were in GBs and not TBs or PBs
  • Satisfying the criteria laid down for go/no-go meant we handled all the risks already even for Lift and shift(we will come to the criteria set shortly)

Managed vs self-managed services

How did we do the migration?

Checklists

Prep for the D day:

The plan

Phases and feedback cycle:

If anything were to fail we wanted it to fail early, fail fast

labs (IAC)→ dev → staging → PT(performance test) → Prod

  • Getting an estimate of time for complete E2E migration including data
  • Tweaking and tuning processes by identifying tasks to be executed in parallel
  • Further improving Checklist, resulting in more like a Run-book

Rollback plan

High-Level Steps

  1. Notify customers of downtime well advanced
  2. Stop the incoming traffic (internal, external) → confirm the same from APM
  3. Make sure messages from MQ (Message queues) are completely drained
  4. Block all traffic to Datastores
  5. Double confirm on no activity on Datastores (can’t stress this enough!)
  6. Start migrating the Datastores in parallel
  7. In parallel, deploy compute resources (with access to Datastores, customers/ test users traffic blocked)
  8. Once data is completely migratedAllow traffic from Compute layer to Datalayer
  9. Do sanity check on any errors in logs, APM, etc for any anomalies
  10. Switch IPs from key services calling Bank APIs over Intranet
  11. Allow traffic from special app build created for Smoke-test
  12. If All is Well!, increase test scope for complete regression
  13. All good! Test traffic from public customers

Migrating Compute layer

Migrating Data layer

  • Automated scripts run as jobs in pipelines in CD servers with special access which felicitated the data migration.
  • Independent scripts were created for different Datastores and tested N number of times in lower environments and mock drill in prod
  • Post-migration automated Test suite for checking Data Integrity and Durability for respective Datastores
  • Using a dedicated checklist for respective Datastores migration
  • For bigger data sets like blog storage which were in TBs were done incrementally

Building new-Infra

Security considerations

Testing

External parties — Intranet communication

Order of migration for non-related systems

On the D-day

Prod support/ SRE

What went well?

What could have been improved?

Conclusion

In love with the process and result.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store