Migration to Google Cloud Platform — Overview
In this post we will give an overview of the migration of our platform from AWS to GCP as well as the new systems which were built to support it. Subsequent posts will drill down into some of these new systems.
At the start of 2016, we had several issues that were making it difficult for us to evolve our platform quickly and also drove up its operating costs. These issues were covered in previous posts but will be summarized below:
- Elasticache for Redis as a Datastore — we were using it as a datastore in our PHP stack and were not sharding across instances. This meant that we needed large, expensive servers and suffered outages due to replication sync latencies.
- AWS DynamoDB — What you should know… — provisioning tables and indexes that were too granular, read consistency gotchas made things more complicated and re-reading of the same data due the way microservices were structured all added complexity and cost.
- AWS Code Deploy & Auto Scaling.. — the bursty nature of some of our workloads, the latency to start a new instance and AWS’ 1 hour minimum pricing gotcha complicated the latency and cost profile of our system.
- Why we moved to Go (Golang)… — our PHP codebase was written as the authors were learning PHP, PHP’s dynamically typed nature made it hard for them to find and fix issues early, it was inexorably tied to a geo store that used Redis as a datastore and attempts at cleanups tended to cause a new set of issues which were often discovered in Production.
For full disclosure, we had started to address some of these issues in the AWS version of the stack already. For the AWS Code Deploy & Auto Scaling issues, we went with Moving a platform into Kubernetes. For the PHP code base issues, we took a two-pronged approach: we standardized on Go for all new code and we took a Test it Now or Regret It Later… approach to build a test suite in Go covering the existing public PHP APIs.
So Why Move to Google Cloud Platform?
Even though we had started to address some of the issues in our AWS-based stack, we went ahead and made the decision to tackle the remaining issues as part of a migration to Google Cloud Platform. We were driven by two major factors: Isolation and Performance & Cost.
We knew that we wanted to address the remaining issues with the existing stack, but we could not afford to impact the existing service if we got it wrong. We also knew that we wanted to experiment with a more advanced geo store and other techniques, and felt that we would definitely hit bumps early on. The best way to make sure not to impact the existing system in those cases was to have a new, completely separate one.
Performance & Cost
Yik Yak stores and serves a lot of geo-tagged data and we needed a low-cost, persistent store that would lend itself well to this use case. This is where Google Bigtable really shined and made the migration more appealing to us. For example, you can stand up the default, 3-node cluster, that supports around 30,000 QPS for about $1,500 per month, including storage. Since capacity is provisioned at the cluster-level and not at the table-level, you don’t need to worry about wasting money over-provisioning use-case specific tables.
Requirements for the New Platform & Solutions
We had REST APIs that could be called by Android, iOS and Web clients as well as backend servers and had already spent lots of time tracking down bugs due to inconsistent definitions amongst them. Examples of some of the inconsistencies were clients passing fields as strings sometimes and as ints or nulls some other times, and even cases of mix-ups on field names.
We wanted to be able to have a single, canonical API definition that could be used across all of them or that would allow us to at least verify the correctness of the clients.
We met these requirements by standardizing on gRPC for defining service APIs, standardizing on gRPC inside of the “logical” datacenter, and then using grpc-gateway for producing REST adapters that the clients could call. Any client idiosyncrasies would be dealt with in this adapter component.
Fast & Live Migration to the New Stack → Isomorphic APIs & Selective Traffic Redirection
We needed to migrate from AWS to GCP as quickly as possible and we had to support ongoing feature development while we worked to execute the transition. This would be particularly tricky if we found ourselves having to support parity across systems for an extended period of time.
Having made the decision to migrate, several key components needed to be built and validated. We had to be able to bring up the new stack in useful partials to validate the key subsystems as the migration progressed.
We met this requirement using two techniques. First, we leveraged our Configuration and Experimentation System to have the clients selectively send specific types of traffic to the new stack or the old stack. So, we could try things in Production and fall back if we ran into issues. Second, we decided to keep the REST API isomorphic — in other words even though we could make it much more efficient, we chose not to make any significant changes to it through the migration thereby controlling that variable.
Horizontally Scalable & Lowest Operating Cost Possible → Kubernetes & Bigtable
We did not want to have to spend excess money on over-provisioned systems to deal with load spikes and we did not want to make adding capacity an onerous operation. Everything had to be able to scale horizontally.
We met this requirement via three items. First, removing Redis altogether as a datastore. This saved us around $10,000 per month for the cluster, not including other operational overhead. Second, we went all in with Kubernetes and made all services containerized which allowed us to deal with load spikes while maximizing our efficiency. Third, we standardized on Bigtable as our persistent store due to its cost to performance profile (again 30,000 QPS for ~$1,500 month).
Geostore → Google S2
We needed a geostore that was inexpensive to operate yet still allowed us to efficiently answer the question: give me the most recent N messages posted within an adaptive radius of my current location (Yik Yak’s key use case).
We met this requirement by using the Google S2 library along with Google Bigtable. This library uses some advanced math to map latitude and longitude coordinates into 64-bit integers that have some very interesting properties (I promise, I will dig into this in a separate post) which pair very well with NoSQL-type stores. This way we were able to address the adaptive radius use case as well as the get me the set of messages posted within this closed region.
Logging → Log to Stdout & EFK
We wanted all new code to have a standard way to perform logging and support for log searching. This is particularly valuable when you are trying to determine the root cause of an issue in Production.
We addressed this requirement by having all log lines go to standard output in JSON format and then pairing it with an EFK stack (ElasticSearch + Fluentd + Kibana). This produced a particularly clean solution to logs… which we used a lot, specially during the migration.
Metrics → Expvar & Datadog/Prometheus
We wanted to make sure that we had a standard way to expose metrics for dashboarding as well as alerting to keep things simple.
Since we had already standardized on Go for development, it made sense to expose any metrics via expvars which could then be scraped by Datadog or Prometheus. As an added bonus, since we had standardized on gRPC, it was fairly straightforward to make sure all gRPC services exposed a standard set of metrics, such as response time percentiles per RPC endpoint. Developers could also easily expose additional, service specific metrics using the same approach.
Last, but not least, we needed to have a standard way to support product analytics. We wanted to avoid the classic problem of a query execution taking out your serving platform because there is no separate analytics pipeline.
We have not yet completed the full product analytics pipeline, but we plan to push this type of data to Bigquery, which has some very useful query and retention properties.
Putting it all together, you get the high-level diagram below.
This basic structure allowed us to pick a specific API, like getMessages (which returns the set of messages posted within a given radius of your current location), build the corresponding version in GCP and then send a portion of the traffic to it to validate it. If it worked well, we left it switched over. If we found an issue, we would send the traffic back to the old stack and then dig into the logs and telemetry data to determine what went wrong.
We created a highly performant, simpler, low cost platform which allowed us to incrementally validate the new systems and deprecate the old while still rolling out new features. In the end it had plenty of horsepower to address the objectives with lots of room to spare.
But it is one thing to describe it, another to understand it, and a completely separate matter to have the force of will to see such a migration through in the face of adversity. You will always run into unexpected issues and unforeseen consequences. For example, you might be on the leading/bleeding edge of a technology like we were with gRPC and Bigtable or simply working your way through tricky corner cases associated with live migrations.
In the next few blog posts we will go into more detail on the issues we ran into with gRPC, Bigtable, issues we ran into while creating a new Geostore based on S2 and actual logistical issues surrounding the migration process itself.