The shift from what at Greenpeace we affectionately call the ‘classic’ stack to our new Planet4 Wordpress Kubernetes hosting infrastructure has been a long time in the making and has been both challenging and rewarding.
What follows is a brief examination of the journey, an overview of the infrastructure solution we’ve implemented thus far, a few takeaways, and perhaps your next volunteer opportunity?
From VM to K8s
The first three Planet 4 deployments (International, Greece, New Zealand) happened via what we used to call the ‘classic’ Greenpeace approach: Virtual Machines created with Puppet and Packer, deployed in a blue/green autoscaling Google Cloud Platform (GCP) instance for each site. It worked well, upgrades involved no downtime and the deployment process ensured considerable confidence prior to going live.
The infra team, though, was not 100% happy with the ‘classic’ approach: we felt it unnecessarily loaded the CPU at times, it took far too long to deploy, was slow to respond to transient load, and was not in keeping with the P4 ethos of open-source and current industry best-practice. What if we could take the security of blue/green version-controlled deployment, but use another open source, quicker and slicker infra? The shiny toys.
That’s why we explored, experimented and chose the ‘Kubernetes way’, deploying the following eight P4 sites (India, Netherlands, Canada, Brazil, MENA, Colombia, Denmark and Luxembourg) on Google Kubernetes Engine, using purpose-built containers, in a Continuous Integration pipeline built on CircleCI, via the Kubernetes package management software Helm.
Ah, and we also migrated the first three P4 sites to the new model.
What went well
It hasn’t exploded!
Whereas each P4 site used to be on its own dedicated hardware, now all sites are delivered from an auto-scaling, multi-tenant GKE cluster. This allows us to benefit from not only the native increased efficiency of containers versus virtual machines, but also to pack more deployments onto the same physical hardware, while maintaining deterministic control over resource consumption.
On the ‘classic’ stack, we were hosting one Wordpress site on 2 to 3 application VMs comprised of Apache, Varnish and mod_php, with another VM used solely for redis caching, and a dedicated Google CloudSQL instance per site.
With this old configuration, average memory consumption on the webheads never exceeded ~45% (where most of that was java running NewRelic application monitoring), with CPU load hovering around 20–30%, and the redis application used only ~100Mb of its 4GB host!
Comparing the two solutions, we noticed that with the ‘classic’ approach to infrastructure provisioning we were effectively wasting 75% of the processing power and 60% of the purchased RAM. While one might argue it’s not ‘wasted’ per se — it’s reserved in case of load spikes — this is by modern standards an inefficient use of computing resources, especially when said resources are limited to a single site for which the traffic profile is fairly consistent.
Switching to Kubernetes-backed, Helm orchestrated deployment processes, we are now hosting many more deployments on equivalent hardware, alongside additional services including an ElasticSearch cluster, Consul KV store, Traefik as ingress controller and Lets Encrypt certificate provider, and NewRelic infrastructure monitoring across the entire cluster.
Sure, the CPU is still sitting idle most of the time, but that’s primarily a testament to the effectiveness of full page caching provided by Redis-backed OpenResty. More to the point, this spare CPU power is now shared amongst dozens of deployments, such that if any one site experiences an outlier load spike, it can ‘borrow’ the spare cycles of all the peer deployments.
This increased density of applications on comparable hardware translates directly into savings — substantial savings. The team is now able to supply each Greenpeace National / Regional Office (NRO) with professionally supported hosting at less than 1/5 the cost we were estimating for the full VM solution.
Rebuilding VMs with Packer and deploying ‘classic’ instances took up to 20 minutes to build and deploy each P4 site.
Now, by way of base container images ( greenpeace/planet4-docker ), custom CI images, intermediary application images, and client-specific images, the time from commit to deploy is usually less than 8 minutes, including our as yet rudimentary integration tests.
CircleCI is employed to build containers in a hierarchical manner, building on an Ubuntu base image derived from Phusion’s very useful baseimage-docker. Yes, that Phusion — the company behind Passenger.
Our CI pipeline runs a series of integration tests on BATS, ensuring a high degree of confidence in our containers before they even reach our development servers.
From there, our Helm charts are deployed to development environments of every NRO (individual, distinct Wordpress sites), giving the team near real-time insight into changes on dependent repositories. There’s still much more work to be done, but even this degree of continuous integration is proving helpful in diagnosis and debugging.
Next, code is promoted to release environments and finally to production, according to a modified git-flow model. This degree of isolation and control is key to ensure the confidence required in automating deployments and maintenance of what will soon be 50+ individual Wordpress installs.
Performance and Scalability
The new container stack on OpenResty/PHP-FPM responds on average ~100ms faster for full stack requests than the old VM Apache/mod_php solution, which we primarily attribute to lower virtualisation overhead.
Of course, every application container is also part of an auto-scaling Kubernetes Horizontal Pod Autoscaling Deployment, with a minimum and maximum number of containers allowing it to scale on CPU load, and these containers are added or removed to the cluster in mere seconds where it once took minutes to bring up and down our virtual machines.
Finally, every deployment has at minimum 8 CPU cores to schedule requests at both the reverse proxy and PHP layers, scaling out across nodes as required, yet still constrained by Kubernetes resource limits in the event of runaway processes.
All of this provides unprecedented elasticity in the face of load spikes, and a baseline level of performance that couldn’t be matched with similar hardware in the ‘classic’ approach.
We need help!
Where the process has been a great success (very nice), there’s always room for improvement. We believe that the interlocking nest of repositories could use an impartial eye to assess procedures and audit the code, and the thoroughness of our test suite needs considerable attention.
Any software team, regardless of experience, would benefit from a team of experts auditing the current setup and recommending improvements. While an audit may not advance the near-term goals of the project, it would consolidate the platform and the practices we have in place. We’re aware this might not be considered a romantic job for expert engineers, but it’s a critical precondition to ramping up deployment rates to the level we anticipate in the coming year.
The help of professionals to implement solid behaviour-driven development (BDD) suites and refining our Selenium tests for core products would be an exceptional contribution. If you read this far and understood what we’re talking about, perhaps you want to join us? Again, it doesn’t necessarily advance the project’s functional objectives — it’s consolidation — but it’s hard to overstate the confidence a robust test suite would instill in our team.
Key-Value and Secrets Management
Currently, most of our secrets and per-NRO configuration is stored as CI environment variables, and requires considerable ongoing maintenance. Coupled with the inherent benefits of dynamic secrets and secure storage, we foresee great promise in investing engineer time in this direction.