From Kubernetes to a fully managed, serverless platform
TL;DR: Find out how we migrated our ad real-time bidding service from a self-managed Kubernetes cluster to a fully managed Google Cloud Run.
Howdy! My name is Boštjan Lasnik and I’m a member of the Ad Technology backend team at Outfit7. I primarily develop and maintain the tools, services and data pipelines that power our digital advertising ecosystem.
I also like to take deep dives into DevOps, and today I want to let you in on the process, challenges and benefits of migrating our ad real-time bidding (RTB) service from a self-managed infrastructure to a fully managed Google cloud platform, empowering our developers to do what they do best — spend more time on development and less time maintaining infrastructure.
The bidding process
Ad real-time bidding is a mechanism for buying and selling ad space through auctions (similar to item auctions). Publishers attempt to sell ad space when opportunities to show an ad occur. Multiple bidders (ad space buyers and advertisers) make their bids with the goal of winning the auction and showing their ad.
The process is called ‘real-time’ bidding because it starts as soon as an ad opportunity occurs and ends when the ad is delivered, and it all happens fast enough for the end user not to notice.
One of the ways Outfit7 sells its ad space is through a modified, self-hosted Prebid Server which is an open-source implementation of a real-time bidding exchange.
The Prebid Server project already provides an easy-to-use Dockerfile that builds and packages a HTTP-based application. All that’s needed is a runtime appropriate for running dockerized applications and one that offers a rich feature set to support production-ready development workflows, such as:
- Deployment (service) versioning
- Canary deployment (staged rollout) and traffic migration
- Automatic fault-tolerance and scalability
- As little infrastructure to maintain as possible
- Pay-per-use billing model
At the time, we had a self-managed infrastructure in the form of a Google Kubernetes cluster with external load balancer to expose the service. We used Knative to adopt a cloud-native serverless development paradigm to further abstract away the infrastructure from application development.
We employ continuous deployment, with each merge to master branch triggering a new deploy job for a new revision of the service on the target Kubernetes cluster.
It was a good enough trade-off. For the cost of some infrastructure management and maintenance, the setup offered a simple, pay-per-use environment that supported our production development workflow needs.
But, as with any additional component in the system, the maintenance effort is not zero: Kubernetes nodes need to be upgraded, service auto-scaling parameters need to be tweaked, load balancer reconfiguration is needed, default service domain needs to be provisioned and maintained, and logging and monitoring needs to be set up.
The complexity of the system grew even more when we started planning for high-availability and failover scenarios.
One day, without any changes to the application or infrastructure, alert policies started triggering during what seemed to be a normal traffic spike:
This effectively caused the performance to degrade, seemingly randomly, p95 request latency increased and a portion of requests started timing out. Some requests were dropped outright.
We tweaked the scaling. We investigated possible causes. We took the developer approach — adjusting alert thresholds and resolving alerts as one-time occurrence :). We spent a considerable amount of time investigating and collaborating with the Knative community. It seemed that, for some reason, something was not scaling correctly between the external load balancer, Knative networking layer, and the service caused in-flight requests to time out. We never got to the bottom of it.
We went back to the drawing board and re-evaluated the fully managed Google compute platform Cloud Run. It introduced a new not-per-request billing model so we decided to give it a go.
Migrating from Knative to… Knative?
True, under the hood, Cloud Run is a managed Knative service. It lets you run stateless HTTP-driven containerized applications without having to worry about the underlying infrastructure.
It is a fully managed compute platform so it is production-ready by having out-of-the-box integration with common Google Cloud products and services, such as monitoring, logging, security, gcloud CLI, etc. It guarantees regional fault tolerance, offers SLA with 99.95% monthly uptime and responsive technical support.
At Outfit7, we use Terraform to manage our infrastructure on Google Cloud. It is a robust infrastructure-as-a-code tool that uses a declarative language to define the infrastructure objects. Google provides a Cloud Run resource definition, so the migration from custom scripts and Knative service definitions is simple and fast.
Applying the Terraform resource automated the provisioning of the necessary infrastructure:
- Kubernetes cluster and node pool
- Networking features, such as load balancer with TLS termination, routing rules, domain names and DNS registration
- A Cloud Run service running specified docker image, exposed on a unique URI
This infrastructure is fully managed by Google, so a developer does not need to maintain (or even know about) it.
We provide the service URI as part of client service discovery configuration, so zero-downtime migration was a matter of a re-configuration and single rollout to production.
CI/CD workflow needed to be updated to no longer use custom scripting for connecting with an external Kubernetes cluster, but use a streamlined CircleCI Cloud Run orb to deploy to Google Cloud directly.
Flexible revision management
Every time a Cloud Run service is deployed, a new immutable revision is created. It offers a robust mechanism to control revision tagging and traffic assignment at deploy time.
We tag the production deployments with the custom revision tag and perform a gradual traffic split between the old and the new revision. This way, we achieve a gradual rollout to allow the underlying Kubernetes cluster to scale and have the option to rollback in case of unexpected problems during deployment.
Cloud Run revision tagging is an extremely powerful mechanism for supporting development workflows. It enables developers to expose custom-tagged revisions on the same cloud infrastructure without assigning production traffic to it.
Cloud Run also offers the concept of “latest” deployed revision. This implicit tag can be targeted by integration tests so they always run against the latest deployed service revision.
After the migration, it turned out that the infrastructure costs were comparable between the old and new setups. In addition to reducing the costs of infrastructure management effort, we were able to get additional discounted prices using committed use discounts.
Using a fully managed environment introduces an obvious drawback: You cannot directly access or discover underlying Kubernetes nodes or pods.
This makes it impossible to add sidecar containers to pods, which posed a problem for us as we used a Prometheus scraper for custom metric collection.
Because of this, we were forced to abandon the Prometheus poll-based scraping approach and instead implement a push-based custom metrics collection via the Google Cloud Monitoring.
Working on this migration was an interesting challenge, but now Outfit7 backend takes advantage of Cloud Run more and more, gradually migrating all applicable backend services to it. It helped us significantly increase developer velocity and reduce the infrastructure maintenance.
Cloud Run is still a relatively young offering within the Google Cloud product suite, so I am excited to see what features get added in the future.
What’s your experience with Cloud Run? Feel free to share your thoughts in the comments!