Technical Vision Blog — Part 3
As our engineering team started growing, so did our number of microservices. Initially we started off with 3, now we have closer to 30. We believe in giving our engineers ownership over their services. This means that they have a high level of freedom on how to structure their application, which programming languages to use, which API to expose, etc.
In this part we will discuss how we plan on creating more consistency between services, without limiting the ownership of our engineers
Challenge 3: Scaling consistency between services
We have grown rapidly in the last few years. As a result, we have taken some technical debt when wanting to launch new services ASAP. Some of our stack is running on Heroku (mostly lower traffic services), others on Amazon EC2 Container Service (ECS) (direct or by using Amazon Elastic Beanstalk). This seems fine, however when auditing the services we found some irregularities:
- Some services did not have autoscaling.
- Other services relied on manual deployment to staging and dev.
- Monitoring and alerting was configured for most of the critical services, yet this was not the case for all of them.
- The flow for deploying to production and rolling back was very different depending on the service.
- Some applications use Cloudwatch for logging, others rely on Papertrail, others have no searchable logs, etc.
As a result some engineers need to adapt their applications to Heroku, others need to write Dockerfiles, some need to learn how to configure ECS, etc.
We have chosen to enforce consistency by moving all microservices to our own cluster:
- Applications need to run inside a Docker container
- Container orchestration handles by Kubernetes
- Automated deployment / rollback using Jenkins (and Jenkinsfiles)
- Custom app monitoring in Prometheus.
- Logging to Stackdriver, using the Google Cloud Logging driver.
Process for running new applications on the cluster
The infrastructure team will become responsible for maintaining and improving the cluster. Their responsibility however, ends at the container level. All containers run in an orchestrator, so in a way they could affect each other. For example if one container starts dumping GBs of data on disk, it could affect disk performance or even cause an outage of the node.
At Google, applications need to be engineered so they are compatible with the Borg cluster. For example a service like the Multiplexer (responsible for starting up other R / Python / SQL containers) will not be able to run on our Kubernetes cluster out of the box.
This means that when there is a new service that should be deployed on the cluster, the engineers should sync with the infrastructure team, explain the high level functionality, expected load, expected CPU / RAM / Storage requirements, how downtime will affect other services, etc. Given part 1 on how to scale communication, this info is ideally part of the Github Wiki of the service.
Which programming languages or frameworks do we use?
Given that the choice has a big impact on recruiting, making cross-team pull-requests, speed for people to switch teams, etc. We decided to limit our stack for web applications to Ruby on Rails, NodeJS, React and Python.
Does this mean we will never use Go? No. However this does imply that if there is a team that would like to start using Go for a new service, it’s not something they should decide by themselves. Hence the idea is to start a broader discussion with the engineering team, to see if we are going to expand our stack to include Go.
Enforcing rules or limitations is never an easy task, however we hope that what we just described will increase productivity for all engineers in the long run, without limiting ownership.