At the beginning of 2018 I've joined the Platform team at COUP, an eScooter sharing company providing scooters in the European cities of Berlin and Paris.
At that moment the company was expanding to more cities in Europe which meant on-boarding more scooters and users. That together with the need of more features to be developed increased pressure both on the infrastructure as well as the software delivery process. As is the case for most medium sized companies, COUP has a variety of different applications focused on specific needs: main api, internal tools, etc. In our scenario those applications aren't micro-services per se but they tried to be; a better term would maybe have been monolithic applications.
The team did a good job by writing infrastructure as code with Terraform. Everyone had access and was able to change and provision new resources on AWS. The knowledge spread fast among new engineers as our development practices encourage pair programming and pull requests reviews. Nonetheless our infrastructure was disorganised and started to show signs that it wasn't able to scale.
It looked something like this:
We provisioned everything with Terraform and baked our own AMIs with Packer in order to run docker containers with systemctl. We had an immutable infrastructure and wanted to continue following this paradigm. Nevertheless this architecture became clear when I joined:
- Each EC2 instance was able to run only 1 container per application.
- Environment variables were packed into systemctl service files and passed through Docker at runtime.
- Number of container replicas was forced to all applications.
- Scaling/Downscaling App A meant scaling App B.
- Some EC2 were under used, having free cpu and memory to fit another container.
Because of how the infrastructure was organized our custom deployment script had to take each instance out of its load balancer before recreating the container with the new image. We inherited a system with downtime deployments which blocked us to be able to deploy several times per day. Also the deployment was manual, serial and slow and because of those issues we didn't deploy as often as we wanted. Changing/adding an environment variable required packing a new AMI and re-provisioning the entire auto scaling group which could take several hours.
COUP has become an utility in the daily life of hundreds of people across Berlin, Paris and Madrid. It was unacceptable to have a system that had downtime with each deployment. Each downtime meant users weren't able to start or end rides, resulting in a lot of complaints from our customers.
With the arrival of new engineers the overall development-deploy cycle started to become a daily pain that everyone felt. Depending of the change and application it took hours to deploy it to production.
Our deployment pipeline required the developer to wait for CI to finish tests, manually build docker image and push to ECR, and manually run the deployment script. These manual steps looked like:
One of my first responsibilities was on improving my co-workers daily tasks by attacking and working on some issues as:
- Zero-downtime deployments.
- Decreasing deployment time.
- Improving monitoring and logging (NewRelic, ElasticSearch, DataDog, OpsGenie)
- Better usage of AWS resources.
Even after addressing all these issues it was still not enough. The engineering team had a goal to evolve the architecture in the next year into a distributed architecture (a.k.a micro-services) and the platform team had to find a way to support it.
My previous experiences with micro-services architectures taught me enough to know that there are some prerequisites for it. I won't go further on this because a lot of people have already discussed it. To know more on that I advice reading Phil Calçado micro-services prerequisites. For the sake of urgency and priority we focused on the following issues:
- Provisioning of infrastructure resources.
- Monitoring, logging and alerting.
- Rapid deployment.
With those priorities in mind it had become clear for me that we needed a better solution. I was responsible in taking some weeks to investigate and evaluate a few ideas:
- Improve custom tooling to allow rapid deployment and better usage of infra resources.
- AWS ECS.
- Kubernetes with AWS EKS.
- Kubernetes with Kops.
After several weeks of discussions and investigations it became obvious that our engineering team shouldn't focus development time in writing another orchestration tool. We couldn't use GKE due to some internal requirements that lead us to use AWS services. The ECS vs Kubernetes discussion didn't last longer after we provisioned apps on both platforms and compared features, resource management, community and how developer friendly it is. In the end Kubernetes allowed us to have more control in what, where and how we deploy our applications.
The final discussion was between EKS vs Kops. I’ve lead the effort to understand and battle test both with our needs. EKS is an amazing product but we had one business requirement: we need to run our AWS resources in Germany. Keep in mind that at the time EKS was available only in the US and Ireland zones.
With EKS out of the picture we took the time to provision different Kops clusters until we found a secure setup and felt comfortable in using it on a daily basis. Their Github repo provides several docs in how to setup and maintain a production grade Kubernetes cluster.
We also knew that Kubernetes by itself wasn't enough. In order to be able to scale our engineering team we had to adopt Continuous Delivery in our workflow, or something close to it so we could remove all manual tasks from our deployment pipeline.
If you search how to implement a CI/CD pipeline you will find several ways on how to implement it. AWS CodePipeline and Spinnaker are good examples of products used in this scenario. I suggest reading Martin Fowler's work on Continuous Delivery.
Instead of integrating a new tool we decided to look which tools we were using and if we could organize them in a way to provide some Continuous Delivery for our products. We heavily used TravisCI and Buildkite, in the end we created the following pipeline:
Although far from perfect this architecture improved a lot our engineering team workflow. We know that using Kubernetes is different than maintaining a cluster and that it requires constant learning and experimentation.
In the next article I'll go deeper in how to setup, monitor and maintain a private Kubernetes cluster with Kops on AWS and useful operators you may use.