AWS Account Migration Journey — Part 2

Published in

MiQ Tech and Analytics

6 min readApr 18, 2022

In the previous blog, we have gone through the option finalized for migrating the MiQ AWS production workload to a new Virtual Private Cloud. In this blog, we would cover the migration journey.

Approach

This was really important to figure out the order of migration of services so that the services get migrated one by one and the communication between the VPCs remains unaffected.

Moving all services at once would have reduced the complexity but it would have increased the downtime. The order decided was as follows:

Migrate the independent services to NVPC that do not interact with any other service
Migrate dependent services to NVPC and let them communicate with the base services via AWS PrivateLink
Migrate the rest of the services to NVPC and change the endpoints to talk to each other via load balancer instead of AWS PrivateLink.
Only under exceptional circumstances that could not be mitigated in the NVPC → OVPC design like cyclic dependency, we created VPC endpoints in OVPC for NVPC Kubernetes load balancers.

Here we take an example of one of the services (notification-service) and discuss steps taken while doing the migration.

Generic Migration Steps

Before executing the migration, all the teams made the changes to the codebase of relevant services to allow communication between the VPCs (eg: Updating the endpoint of the accessed service).

Downtime expected ~ 30 minutes

Bring service up with 0 replicas in NVPC
Create a new instance of RDS
Shutdown service in OVPC
Run dump and restore via S3 for database
Bring up replicas in NVPC
Change load balancer mapping and verify functionality

Migration Based On App Categorization

From a communication standpoint, applications in MiQ are divided into 2 categories.

Applications that are consumed only via API Gateway
Applications that are consumed via direct route53 endpoints.

Having discussed the high-level design, the approach for both these applications had to be different.

We use Tyk as an API gateway. Discussion on the same is for some other time. For now, we will talk about how we manage migration for services communicating via API gateway.

Via API Gateway

Approach 1

Clone and create a new private endpoint in the OVPC API Gateway with all settings intact. In the NVPC API gateway, in the respective API gateway endpoint set the target as the newly created OVPC API gateway endpoint.

Once everything is moved to NVPC change the API gateway target to the internal kube address.
(ex:- http://xxx-service.analytics.svc:port)

Approach 2

Create an entry <app>.prod.miqdigital.com (xxx-service.prod.miqdigital.com) in the private Route53 domain prod.miqdigital.com(bound with NVPCs) with the target as the NVPC endpoint service(which internally links to OVPC LB).
Upstream services(in OVPC) will add the newly created endpoint their ingress to allow being called as <app>.prod.miqdigital.com from the NVPC.

In the NVPC API Gateway, in the respective API gateway endpoint set the target as the respective Route53 entry <app>.prod.miqdigital.com.

Once moved to NVPC change the NVPC API Gateway target to your internal kube service name.
(ex:- http://xxx-service.analytics.svc:port)
Delete the earlier created Route53 entry to restrict any calls through the direct endpoint.

Depending on the infrastructure of the teams(Some legacy services were still not on Kubernetes) and the amount of work involved in the options, teams took the call on whether to go ahead with option 1 or 2.

Via route53 endpoints

All upstream services in their ingress will add prod.miqdigital.com to allow being called servicename.prod.miqdigital.com from the NVPC. This same configuration will be used in the NVPC as well.
prod.miqdigital.com DNS domain will only be functional in the NVPC. The OVPC will not resolve prod.miqdigital.com.
All upstream services will create a record in prod.miqdigital.com in the new VPC and map it to the load balancer.

Common Services’ Migration

Before services started migrating to NVPC, we made sure that basic infrastructure was available in NVPC.

VPN access to new VPC in place
EKS security groups by default allowed ingress and egress only for 80 and 443.
Select Old VPC k8s private load balancers made accessible in the New VPC using endpoints.
The IT team created a single ACM certificate that was used by OVPC/NVPC for SSL.
Existing ingress controllers were updated to use the single ACM certificates created above
Kubernetes cluster created along with a new rancher
Ingress controllers for public and private access were in place with the use of a single ACM cert that was created before
New Jenkins was in place with everything same as the current prod Jenkins
New Vault was in place with everything the same as the current vault with auth and injector configured in Kubernetes. The vault was to be migrated last
New monitoring was in place with existing dashboards and Prometheus data. There were two monitoring dashboards, one in OVPC and another in NVPC. People had to use two separate monitors to view the data for a given time period until the service metrics caught up with the new monitoring tool.
New Sensu go was in place for functional monitoring — teams had to migrate their URLs with new Sensu
LogDNA was in place with the same tags as OVPC k8s in the NVPC k8s
API Gateway was in the new EKS cluster in place with all the data copied from existing/old prod EKS
API gateway public and private URLs were shared with everyone

Database Migration Strategy

Teams were responsible for migrating their databases to the new VPC. We were using self-managed MongoDB at that time along with AWS-managed relational databases. As MongoDB data wasn’t huge, this migration was done with mongoexport/mongorestore utilities. We exported the data to S3 and restored it to the new VPC.

Migration of AWS RDS was tricky and we had the below options available:

Change VPC directly — This was the most widely used option as it took the operational stress away from the teams and AWS was leveraged to get it done quickly. (This doesn’t support the Aurora cluster). More on this can be found here.
DMS — Database Migration Service is provided as a managed service by AWS. This approach was elaborate and as we didn’t have a requirement of live data migration, this option was dropped
A new database from snapshot — For small applications where data wasn’t changing frequently, this approach was taken. A database snapshot was taken after bringing down service in OVPC. Before bringing up the service in NVPC, the database was restored from this snapshot in NVPC.
Clone RDS — For one of the applications where data was in terabytes, this approach was taken and it drastically reduced downtime that would have otherwise happened with other approaches. The database cloning process took 10–15 minutes for 2 TB of data. We created a clone of the database in NVPC and promoted it to primary after the migration of the application.

This execution wouldn’t have been possible without a proper process. We did follow regular standups, information sharing via wikis, and review of run-books to name a few.

Conclusion

If you have made it here, I would recommend you to check out other blogs by the MiQ team (never leave an opportunity to promote the good work 🤪 ).

Bringing back the attention to the migration journey, as much as this was a technical challenge, I still have to acknowledge the criticality of a collaborative approach to successfully complete migration activity at the organization level.

About how it went, I would have been writing a “Learning” blog, had this approach faltered!