Three Years In — Microservices, Containers and Kubernetes at Hootsuite
At Hootsuite, over 120 microservices support the core dashboard product, of which the majority run on Amazon’s Elastic Kubernetes Service (EKS). This is an incredible change from three years ago, when the majority of the code at Hootsuite was running in a monolith codebase alongside a few microservices, all of which were hosted on EC2. It’s been a long and challenging journey; being on the bleeding edge often meant going it alone. This article documents this journey, outlining the steps taken along the way.
Whenever possible, this article will provide links to the source material for relevant concepts to make the article easier to digest. Having stated this, this article is intended for readers who are already familiar with Kubernetes and cloud computing.
Where We Were
Over three years ago, Hootsuite saw the industry begin to move away from running software on virtual machines. The replacement was containers. Containers are a technology that virtualizes the operating system (OS) as opposed to virtual machines (VMs), which virtualize the hardware along with the OS. Running services in containers made it easier to run many different services on the same server. This paradigm shift coincided with the early stages of Hootsuite’s transition from a monolith architecture, where all API calls went through a single service, to a service-oriented architecture, comprised of many services with clearly defined domains and boundaries.
At this point, only 12 microservices had been built — all of which were running on EC2 instances. When a development team wanted to create a new service, one of the major blockers was getting the necessary infrastructure and monitoring in place. The amount of manual labour involved in such a request placed a lot of strain on the small Operations team and, in some extreme cases, could take months to be completed. The lengthy wait time for these requests meant that instead of constructing microservices with clearly-drawn boundaries, developers would construct large, over-engineered services. These services would quickly become difficult to maintain.
An architectural transition to containers was identified as a promising solution to this issue. Containerization would reduce the operational complexity associated with spinning up many new services. It would make it easier and quicker for developers to ship new versions of services — rolling out a new container version is fast and there are minimal environmental inconsistencies to address. Recognizing the improvements that containerization could bring to the development lifecycle, the team at Hootsuite decided to transition to containers. The Mesos orchestration framework was selected to manage the containers.
Hootsuite’s adoption of Mesos would prove to be short-lived. By 2017, it was clear that Kubernetes had established itself as the industry standard orchestration framework. The open-source community support for Kubernetes was exponentially higher than that of Mesos. Not wanting to invest resources in a technology without strong community support, Hootsuite opted to abandon the migration to Mesos and instead focus on a complete migration of all services to Kubernetes. While no one thought the switch would be easy, the migration would prove more challenging than anyone expected. At the time, Kubernetes was still bleeding edge. Tools like kops were in their infancy — they lacked critical functionality and only worked in a narrow set of use cases. The tools were inflexible and made assumptions that were often incompatible with Hootsuite’s architecture. Because of the limitations of the tools at the time, the team decided to set up Kubernetes from scratch.
Setting up Self-Managed Kubernetes and Migrating off Mesos/EC2
Kubernetes is complex. While the documentation is incredibly thorough and while there is a large community to help troubleshoot, Kubernetes has many intricacies that are easily overlooked. As such, there is no better way to get to know Kubernetes from an operations standpoint than by setting up a completely self-managed cluster. In fact, all new hires to the Compute Platform team at Hootsuite (the team responsible for managing Kubernetes) go through this process by setting up a self-managed cluster. They accomplish this by working through Kelsey Hightower’s Kubernetes The Hard Way (adapted for AWS) as their first task.
When Hootsuite began setting up self-managed Kubernetes clusters to run production workloads, the Compute Platform team did the following:
- Created three separate Amazon Machine Images (AMIs) — one for the etcd nodes, one for the controller nodes and one for the worker nodes — provisioned using Ansible and Packer.
- Wrote Terraform that used these AMIs to bring up the Kubernetes cluster on AWS
- Set up an autoscaling group with three etcd nodes
- Set up an autoscaling group with three controller nodes
- Brought up worker nodes in an autoscaling group, to be scaled as needed to accommodate the workloads running in each environment.
The diagram below is a visual representation of a Hootsuite self-managed cluster on AWS.
Once the self-managed Kubernetes clusters were operational, containerized services were migrated to Kubernetes from Mesos and the Mesos infrastructure for running services was shutdown. The few services left on EC2 were containerized and moved to Kubernetes. A pre-existing in-house service mesh, built with Nginx and Consul, was helpful during the migration. The service mesh was configured to route traffic into Kubernetes if a service was not explicitly registered with Consul. This made it possible to deploy the service to Kubernetes, shut the service down on EC2 and automatically have the traffic go to the right place.
Of course, there were a few stumbles along the way. Services had to be configured with the appropriate number of replicas in Kubernetes so there were enough pods to handle incoming requests. Resource limits for CPU and memory had to be tuned correctly so containers would have adequate resources to run. Specific JVM options for memory and CPU usage needed to be set for containerized Scala services as Docker sets memory and CPU limits that the JVM cannot detect. Fabric8io has a great script from that takes care of setting these options. Finally, developers had to learn the basic concepts of Kubernetes and how to use kubectl in order to monitor their services.
Despite these challenges, the move from microservices on EC2 to Kubernetes was a huge success. To facilitate rapid service development, a service skeleton was created to generate all basic scaffolding (build and deployment code, monitoring, health checking) for new stateless services. Services generated from this skeleton would build and deploy to Kubernetes out of the box. This skeleton become the standard for all services running on Kubernetes (it was decided stateful services would not be supported on Kubernetes yet). The new service skeleton and Kubernetes had a profound effect on development at Hootsuite — new services could be generated and deployed to a cluster in five minutes.
Risk of Maintenance Operations on Kubernetes
Hootsuite is the industry leader in the social media management space, with thousands of businesses across the world relying on it for crucial business operations. With the majority of Hootsuite’s microservices running on Kubernetes, the clusters need to be up 100% of the time. Even small issues in the clusters can result in outages of the Hootsuite dashboard.
Fortunately, Kubernetes itself is robust and fault-tolerant. During deployments, Kubernetes uses a rolling update strategy so that updates can take place with zero downtime. If an update has an issue, Kubernetes will automatically pause the rollout. Even if the entire control plane goes down the workloads on the worker nodes will keep running. However, as durable as Kubernetes is, there are still high risk cluster administration tasks that need to be executed. As one example, in early 2018 the team performed an in-place upgrade of the Kubernetes version of the clusters from version 1.7 to version 1.10. Luckily, this upgrade was successful and there was no down time. But what if something had gone wrong?
In the worst case scenario, the upgrade would have failed and completely bricked the cluster. This would have caused a full outage of the Hootsuite dashboard. Because of the existence of cluster backups, it would have been possible to restore the cluster in a disaster recovery scenario, but this would have involved significant downtime. Sticking with the current disaster recovery strategy going forward would mean a real risk of downtime, which could potentially jeopardize Hootsuite’s 99.9% uptime Enterprise SLA.
Heading into the second half of 2018, a lot of thought was put into the future of Kubernetes at Hootsuite. While the move to running microservices on Kubernetes at Hootsuite had been a game-changer, maintaining the clusters had become a full-time job for the four developers on the Compute Platform team. Coincidentally, Amazon had released a managed Kubernetes offering called Elastic Kubernetes Service (EKS). The promise of reduced maintenance overhead on EKS was attractive. Excited by this prospect, the team immediately began researching the plausibility of a migration to EKS.
The initial research uncovered that moving to EKS would require a lot of fundamental changes to the current cluster setup. First, EKS required Role-Based Access Control (RBAC). The existing Hootsuite clusters, on the other hand, had been using Attribute-Based Access Control (ABAC). There were significant differences between the two authorization paradigms, which would complicate any potential migration. As an additional authentication complication, AWS Authenticator would be required to authenticate to the clusters rather than the current approach of authentication with static user tokens. Finally, while Flannel had worked well as the CNI plugin on the self-managed clusters, there was an opportunity to move to the simpler AWS VPC CNI plugin on EKS.
These modifications would all be significant and high-risk. Rolling out these changes with no downtime would be next to impossible. Before the migration to EKS could proceed, the team needed to find a way to manage the risk of performing these changes.
Adding Multi-Cluster Support
The current disaster recovery approach of rebuilding a cluster in place would require significant downtime in the event of an outage (possibly up to 8 hours) — this was not acceptable. The team needed a way to validate the changes needed for EKS without impacting production traffic. Ideally, a multi-cluster solution was needed so that these changes could be introduced to a completely new self-managed cluster, without affecting the existing production cluster. Risky changes could be trialed on this new cluster by redirecting traffic to it from the old cluster. If there were problems with the new cluster, traffic could be dialled back to the old cluster.
The Compute Platform team came up with two strategies for running multiple clusters. The first strategy would involve running one active cluster and multiple passive clusters, with passive clusters being promoted to active as needed. In this scenario, only one cluster would ever be actively serving traffic. The second strategy would instead run multiple active clusters simultaneously. The latter strategy was attractive because in the event one of the active clusters was broken, the broken cluster could simply be terminated and the other active clusters would automatically handle all traffic. However, one question remained unanswered — how would cluster configuration and workloads be synchronized across multiple active clusters? It became apparent that were was no simple answer to this question. The complexity of any solution to this problem would outweigh the benefits of the multiple active clusters strategy. For Hootsuite, it seemed as though an active-passive cluster configuration was the way forward.
The first step to a multi-cluster setup was bringing up the infrastructure to run multiple clusters; however, the existing Terraform for bringing up Kubernetes clusters was written with the assumption that there would only ever be one cluster in each environment. The Terraform configuration would need to be reworked so that it was possible to bring up multiple passive clusters alongside each active cluster. Another challenge was that passive clusters needed to be isolated from the active cluster so that they would not receive traffic without manual intervention. This would require creating new tooling to dial the network traffic off to one cluster while dialling the traffic up to another.
The Terraform was refactored to support bringing up multiple clusters in each environment. This was a complex undertaking that required a lot of support from Hootsuite’s cloud engineering expert. Once the Terraform refactor was complete, it was time to create the tooling to send network traffic to the right cluster. The tooling to redirect traffic from one cluster to another leveraged the in-house service mesh built with Consul Template and Nginx. To redirect traffic to a specific cluster, Consul Template was used to weight the nodes ingress traffic was being sent to. A developer could then update these weights in order to slowly dial traffic from one cluster to another.
At this point, the infrastructure needed to run active-passive clusters and the ability to manage traffic between them was in place — multiple clusters could now exist side by side. The obvious catch was that the passive clusters weren’t running anything yet. The only remaining issue the team faced was how to deploy all of the workloads running on the active cluster to the passive cluster.
Deployment Strategies on Active/Passive Clusters
The deployment strategy already in use pushed new deployments to Kubernetes using a kubeconfig defined on Jenkins. The team thought one approach of getting all workloads on the passive cluster could be to change the way the deployment strategy pushed changes to Kubernetes.
The proposed replacement deployment strategy involved maintaining a cluster registry of active and passive clusters that were running in each environment. The current deployment strategy would need to be modified to use this new registry to roll out deployments to both the active cluster and the passive clusters. This sounds simple — but what happens if a deployment fails on a subset of those clusters? How are retries of these failures handled? Should the deployment strategy be changed from “push” semantics to “pull” semantics, where the deployments across active/passive clusters are eventually consistent? These were hard questions with no easy answers. Using a cluster registry wasn’t the right fit; the added complexity of the solution was likely to cause more problems than it would solve.
After much thought, the team decided to leave the current deployment strategy as is. Instead, workloads would be deployed from the active cluster to the passive cluster using a newly implemented backup/restore process. The obvious disadvantage to this approach was that a backup/restore process would be a manual operation. However, the team accepted this trade off, as high risk cluster administration changes would not be needed as often once the migration to EKS was complete. Apart from regular tests to ensure that the approach still worked, a backup/restore process would not require the overhead of maintaining a cluster registry.
To implement the new backup/restore process, the team used Heptio Velero to take a backup of all non-bootstrapped workloads running in the active cluster. As part of the process, all deployments to the active cluster would be blocked before triggering a backup. This ensured the deployments would be consistent across both clusters. Next, in the passive cluster, all critical cluster dependencies were bootstrapped, including Velero. Velero was configured to have access to the backup of the active cluster that was just taken. This backup was then used to trigger a restore in the passive cluster. Once all restored workloads were running, network traffic was dialled from the active cluster into the passive cluster. When the traffic cutover was complete, the old cluster could be torn down. If the new cluster was not working as expected, the old cluster could be used as a fallback.
After performing multiple successful cluster cutovers, the team had high confidence in the ability to restore from the most recent backup in an outage scenario. In the event of a disaster resulting in complete cluster loss, it would be possible to recover within a couple of hours. The downside of a cluster cutover is that it is a manual operation. It would be possible to automate this process, but as GitOps tools such as Argo CD and Weaveworks Flux gain traction, they are likely to be better options for mirroring two clusters. These tools use source control as the source of truth for how clusters should be configured and for what workloads should be running on them. Not only do these tools ensure cluster configuration and deployments are reproducible, they also make it easier to audit and revert changes.
Dialling Over Non Service Mesh Traffic
The approach of dialling our service mesh traffic off to one cluster and dialling it on to another was effective at managing external traffic. However, services that consumed events from message queues were still generating internal traffic in both clusters. Draining the cluster as part of the teardown process could break event processing if some services’ dependencies became unavailable. Most Hootsuite services were already tolerant of temporary network failures by means of retries, however, they needed to be updated to gracefully handle the case where the network failure was permanent.
Handling permanent network failure was possible, but two criteria would need to be met. First, the services would have to commit back to the message queue that an event had been processed only after it had successfully been processed (essentially, “at-least-once” event processing semantics). Second, message processing would need to be idempotent. If only half of the event processing had been done and the rest failed, the failed event processing would be retried later.
An audit revealed that most services already met these two criteria. Only a few were found that were using “at-most-once” semantics — these services were updating their high water mark as soon as they read the event from the queue, rather than when it was done being processed. This meant that during cluster draining operations, some events would only be partially processed and then dropped — the processing would never be retried. Dropping events would ultimately impact Hootsuite’s customers — important data may be incomplete, or missing entirely.
The Compute Platform team worked with the development teams that owned these services to swap them to an “at-least-once” semantics approach. These services were updated to only commit their high watermark after events had been processed, not before. The team also verified that all event processing would be idempotent. This ensured that cluster draining operations would not result in dropped events.
Migrating to EKS
The cutover strategy was validated multiple times in all environments, proving that the process was stable. Next up was the step from self-managed Kubernetes to EKS. As a first step, the team modified the existing Terraform to support bringing up both self-managed clusters and clusters hosted on EKS. There were a few fundamental differences on EKS compared to self-managed — the EKS clusters used the AWS VPC CNI plugin, required AWS Authenticator to authenticate to and had RBAC turned on. The team validated these changes by bringing up passive clusters and testing them with restored workloads from the active clusters. The diagram below is a visual representation of what a cluster on EKS looks like.
There were a few issues on EKS that were not present on the self-managed clusters. The first issue was that EKS assigns the IAM role that creates the EKS cluster as the default cluster admin — there is no way to override this behaviour when creating an EKS cluster and there is no way to change this IAM role being a cluster admin after the cluster has been created. Because the EKS clusters at Hootsuite were brought up using Terraform and Atlantis, the IAM role Atlantis used to apply the Terraform became a cluster admin in perpetuity. The aws-auth configmap is used to grant additional IAM roles and IAM users access to EKS clusters. To apply changes to the aws-auth configmap, it is necessary to assume the same IAM role that was used to spin up the cluster. Unfortunately, this means that anyone with the ability to assume that IAM role is effectively a cluster admin for all clusters. Until AWS supports specifying the IAM role/user to set as the default cluster admin when creating an EKS cluster, this is a security risk that will persist throughout the lifetime the cluster.
The next issue was determining how accessible the EKS API server endpoint should be. There were two accessibility options for the endpoint — public, meaning the endpoint can be resolved from the Internet, or private, meaning the endpoint can only be resolved from inside the cluster’s VPC. Public access would, of course, be secured by restricting access using AWS Authenticator. However, even with the addition of this safeguard, the cluster would still be vulnerable to DoS attacks and other Kubernetes authentication vulnerabilities.
The team decided to make the API server endpoint private. Unfortunately, only being able to resolve the endpoint from inside the cluster VPC meant that services in other VPCs would require additional infrastructure before they could access the API server endpoint. Initially, it was thought that the endpoint could be made accessible with simple DNS changes, similar to how this was accomplished with the self-managed cluster setup described above. On the self-managed cluster setup, a CNAME DNS record was created to point to the API server endpoint. When a cluster cutover was performed, this record would be updated to point at the new active cluster, redirecting external traffic to its API endpoint.
This method could not be used on EKS. Creating a CNAME DNS record to the API server endpoint on EKS was not possible because the zone the endpoint DNS record is created in is AWS-managed and not exposed. This means you can not create an alias record for the API server endpoint. AWS did, fortunately, provide an alternative path forward — the endpoint could be made accessible through a complex set of Route53 resolver endpoints. These endpoints can be set up with rules to forward requests to the API server endpoint across VPCs. This process is described in more detail in an AWS blog post.
The resolver endpoints allow the private API Server endpoint to be resolved from outside of EKS, but it also presents a challenge when a cluster cutover is performed. Developers and CI/CD both have their own kubeconfig — how do these get updated after a cutover? The team wrote a custom Golang CLI tool to abstract this away. When a new cluster is added, a change is pushed to the tooling that includes the endpoint for the new cluster. The CLI tool can be run with a
—-refresh flag, which will reconfigure kubeconfig on the machine it is run on. CI/CD runs this tool as part of its initial setup at the beginning of each job. When a cluster cutover is performed, it is announced to developers so that they can run the tool to refresh their access to Kubernetes.
With everything more or less functional and working as expected, it was time to cutover to EKS for good. The first cutover performed was in the development environment. It went smoothly — the only hiccup being Pod Priorities. Pod Priorities were enabled on the self-managed clusters to prioritize the scheduling of critical deployments. In Kubernetes 1.10, Pod Priorities were still an alpha feature. They did not work on EKS because EKS does not have alpha features enabled. On Kubernetes versions before 1.10, the self-managed clusters had been using the rescheduler, a deployment that uses annotations to ensure critical pods are running. With Pod Priorities not enabled on EKS, it was necessary to revert to using the rescheduler.
Having resolved all of the EKS-specific issues in the development cluster, staging and production cutovers soon followed — both of which were flawless. The Hootsuite dashboard was 100% functional during the production cutover and no microservices experienced downtime. Looking back on the process, it was gratifying to know that the team’s work to ensure a zero-downtime cutover with no customer impact was ultimately successful.
Kubernetes Version Upgrades on EKS
With the clusters running on EKS, it was now possible to use the built-in support for Kubernetes version upgrades to upgrade from version 1.10 to version 1.12. The last version upgrade from 1.7 to 1.10 required upgrading the etcd and controller nodes and ensuring that their configuration worked with the other etcd and controller nodes as well as with the worker nodes. A misconfiguration could have led to a disaster scenario, with major downtime impacting the SLAs for Hootsuite’s customers.
With EKS the stress of such an upgrade is gone — a click of a button takes care of upgrading the etcd and controller nodes. The upgrade procedure for other key components, while not as instant, is well-documented by Amazon. With most components having either an automated upgrade or a well-defined upgrade procedure, the only piece left to the cluster administrator is upgrading the worker nodes with the correct versions of the kubelet and docker. Normally, this would be a simple matter of using the appropriate version of the Amazon-provided worker AMI. Internal requirements at Hootsuite, however, prevent the use of this AMI. Amazon has, however made the config used on their worker AMI open source. This config can be used as a reference for making sure custom worker AMIs have the appropriate settings.
As an example of the increased efficiency of EKS-managed version upgrades, the upgrade from 1.7 to 1.10 in 2018 on self-managed Kubernetes at Hootsuite had taken two developers several months. The upgrade from 1.10 to 1.12 on EKS took one developer less than two weeks.
Today, most new code is created in services and only rarely is new code added to the monolith codebase. There are over 120 microservices running that serve up the features of the Hootsuite dashboard — 90% of these are run on EKS. With the decreased overhead of maintaining the Kubernetes clusters at Hootsuite, the team has had time to build out a roadmap to leverage the full feature set of Kubernetes. Some of the roadmap items planned for the near future are:
- Using Network Policies to lock down service-to-service communication inside of the clusters.
- Using Pod Disruption Budgets to enable the introduction of cluster autoscaling, including supporting scheduling with spot instances.
These changes will affect most services, so investigation into decoupling CI/CD from services is planned. This will likely involve the introduction of a GitOps delivery tool such as Argo CD or Weaveworks Flux. Adoption of one of these tools will enable the DevOps team to iterate even faster on CI/CD features with minimal impact to the product development teams.
Kubernetes is complex, but it’s also powerful. Using it has transformed the software development process at Hootsuite. Well over 100 new services have been generated in the past three years — it’s exciting to think of what challenges will be faced next.