ShareChat’s cloud infrastructure, completely based on Amazon Web Services(AWS) from compute to databases, was faced with a formidable challenge of migrating to Google Cloud Platform(GCP). Any cloud to cloud migration is a long and demanding process where we had to tackle varying multiple problems, some of them might not even have been heard of before. In this post, we are going to discuss how we addressed the issues and challenges faced for our traffic migration from AWS to GCP, right from initial experimentation level to final Big Bang cutover.
We maintain a microservices architecture at ShareChat which runs a vast number of services and hundreds of jobs running parallely. To give you an idea about the scale of it, we serve more than 5 billion requests on a daily basis. These are just external requests, with total internal request count exceeding 15 billion.
At this scale, it becomes a very high-risk process to even bring a minor change in the system, let alone shifting the entire infrastructure to another cloud provider. In this case, routing of external traffic becomes a key element because the migration strategy of every other cloud component revolves around it. So we had to come up with a smooth and reliable solution to deal with this, ensuring zero downtime.
Abstraction of routing layer
With a large number of microservices serving billions of requests everyday in the system, it is only safe to assume that there is an api-gateway at the top level working as an intermediary between the client and the backend services. This api-gateway is implemented in a spring boot application which uses Zuul as a request proxy. Zuul is an edge service open sourced by Netflix which is used for Layer 7 (application layer) routing. Here is a basic representation of this:
As mentioned earlier, the course of action of routing traffic between different clouds becomes an important aspect of the overall migration approach. The first requirement that arrives for the testing phase is to send an asynchronous copy of the request that is being received on AWS to GCP. This will allow us to pre warm the system and to assess the functionality and scaling of the respective backend service that the request belongs to.
A simple solution to this would be to just duplicate the request at the api-gateway and make an asynchronous call(without waiting for the response). This would have worked for this simple use case but with other requisites like affinity based routing, ramp up of sync/async traffic etc. (which will be discussed later), it only made sense to abstract this splitting logic between data centres into another layer.
Ramp up and affinity based routing
In order to assess different cloud modules, they need to be tested at different scales. With many components running such as nginx ingress, compute nodes, databases(like BigTable and Spanner), Redis, GKE workloads you can’t expect every functionality to work in the same go at full scale. Apart from this, we also have to maintain consistency among users whose asynchronous request will be sent to GCP. It means that a user who is selected for async testing phase will always be included in further ramp up. This will also be helpful when we start the synchronous stage as a user once being served from GCP should always remain in the same cloud environment.
For this, we took the modulo of the user-id. This value was then compared to the percentage threshold value at which the current testing was going on and an asynchronous request was sent on that basis. Module threshold was chosen as 10000 because it can give us finer granularity on the percentage async traffic due to its large scale.
Dynamic config loading
As we progressed with the migration process, the frequency of ramping up the traffic also increased. Besides this, several other complications had to be taken care of.
- How do you ensure that a user does not get multiple redundant notifications from the asynchronous copy of a RESTful API which sends notifications to the user.
- There are various payment related APIs that should not get duplicate hits (for obvious reasons).
- Implementing beta testing for specific users to test above APIs and overall application behaviour.
The only plausible solution was to ignore these APIs in the async testing phase. But with the development process going on, ignored APIs kept coming up at a pace. So we had to find a way to dynamically load these configs in real time as this will also speed up the testing part.
Netflix Archaius provides a DynamoDB client library with a FixedDelayPollingScheduler, which basically polls the config in fixed intervals. This was integrated to fetch sync/async thresholds, ignored APIs and similar other configs. For beta testing, a list of specific users was selected which was also maintained using this dynamic config.
Service and Database Audit
During asynchronous testing, along with latency and scaling it was also important to examine the functionality of the backend services. This was done by evaluating the response of the same request from both the cloud providers. For this purpose, an audit service was implemented which would take these responses from both AWS and GCP and dump them to BigQuery where these can be compared(with expected variance). With this objective, a request-id was generated at the dc-splitter API Gateway which would then be used to compare the above responses. Request meta data such as path, query string, response, request body was also sent along with request-id to the audit service for further analysis.
For databases, we had closed off DB writes in GCP during async phase to avoid data corruption. Tables of Spanner and BigTable were instead being synced with their DynamoDB counterparts using DynamoDB streams. But due to huge architectural differences in Spanner and BigTable from DynamoDB, it was really important to test whether GCP databases will be able to handle the load that DynamoDB supports, specially given the fact that DynamoDB is a NoSQL and Spanner is an SQL database. This was not possible by mere beta testing.
To evaluate database’s performance, we had to enable DB writes for the time of async traffic. DynamoDB streams hold data updation info for up to 24 hours. This allowed us to enable DB writes for a short period as any data corrupted during this time period by async traffic can be overwritten by running the stream from time the writes were enabled. So we cut off the stream updation, enabled DB writes and evaluated Spanner and BigTable performance at peak async traffic. During night, DB writes were disabled again and streams were rerun to bring back the tables data in sync. This process was repeated over days until we got an adequate performance from both Spanner and BigTable.
After establishing performance of all the cloud components, it was time to move to synchronous routing phase. We were now ready to start serving users from GCP. Maintaining affinity in routing, logic used for the sync phase was the same as for the async phase.
We started with the lowest possible figure(0.01%) and gradually ramped it up. Some anomalies were still found in business metrics which became clearer as we increased the sync traffic. Patches for them were released as we progressed to 25% traffic. When everything looked good, the entire traffic of ShareChat was shifted from AWS to GCP (including the APIs that were ignored above) in a single night. We called it Big Bang.