From Datacenter to Public Cloud
By Jason Witting
In June 2017 Oddschecker began discussing and formulating a plan to migrate their entire production workload from their hosted data centre to the public cloud. It had become clear that the company had outgrown current resource usage and compute provided in the current DC. Over the coming next 9 months we migrated our entire development and production stack to GCP and Kubernetes and though I would like to discuss the journey in detail, I’d rather share the meat and bones of it namely, the key philosophies and choices that led to its success.
You can only move as fast as your deployment pipeline!
The “Devops” term has been abused a lot recently, often used to describe various differing workflows, tools and skills. However, at its core it’s about the build and deployment pipeline. All good things stem from the efficient use and configuration of this.
To us at Oddschecker this is what a good pipeline consisted of:
· Multiple stages or gates for which a build or artefact should pass through, with each pass vetting it for suitability to be released. On any failure artefacts or builds should be prevented from being passed to further stages.
· Testing and building are automatic and done for every commit
· Releases are automated and can be rolled back
· Errors are logged and notifications are sent on failures and success
· Artefacts or builds are immutable and should not be rebuilt on their journey through the pipeline
Cattle not Pets!
All of the above points are great in theory but how can we go from where we are to where we need to be? Especially if our current pipeline does not match these ideals. This was the case with Oddcheckers original pipeline and required a change of mindset to get there.
Typically, the phrase “Cattle not pets” has been used to describe server infrastructure where they can be either described as disposable or unique.
An example of a unique “pet” might be of a manually installed and configured application server, that can never be switched off or automatically upgraded in fear of the entire surrounding dependant systems services collapsing when in it became unavailable. Contrast this with say a fully automated web server that is one of many identical “cattle”, that can be simply destroyed, with a new one taking its place without any loss of service.
This idea of “cattle” vs “pets” is a very powerful concept and can be applied to more than just infrastructure. As an example of this, we had many different applications and products that had various “unique pet” releases and deployment methods, which ultimately meant a large proportion of time was wasted on maintenance and support. In order to resolve these issues, we focused on the commonality of these processes and came away with the following:
· Our current build and release processes took place over 2 different CI’s: Jenkins and Gitlab.
· 95% of all our Java applications were similar and the process to build, package and release our applications could be templated and reused.
· We had over 150 projects with differing release and build scripts that needed to be updated each time our deployment environments changed.
· Some releases were entirely manual and required a member of the infrastructure team to complete
· Build job configurations and locations were inconsistent across applications.
We ultimately did the following by leveraging the power Kubernetes API and empowering each development team to manage and deploy their own applications:
· We unified our CI and CD platform — We chose Gitlab as it was already the primary location for all our code base. It had an easy to use, configurable build/deployment pipeline that could be defined as a single YAML file that would reside within each repository.
· Created Helm templates that could be reused across all projects. This enabled us to easily update and manage deployments independent of their code base. This also meant all releases were now consistent across environments and could easily be rolled back.
· Created a standard CI Build file that could be utilised across projects. We now had a consistent build pipeline over every project making support and maintenance a breeze.
· Defined and documented a set of common global and shared CI variables that could use across Build and Deployment Jobs
· All Helm deployments are linted and tested before being deployed ensuring deployments were working before running.
Below is an example of what one of build and deployment jobs looked like after unifying and removing duplicate scripts.
By removing a lot of the uniqueness in our pipeline, we were able to spend more time working on more important challenges within the migration.
Infrastructure as Code
Another example this idea of disposability and uniqueness can be seen in the way we decide on how we would design, build and test our new GCP infrastructure.
There were two key end goals for our new GCP Infrastructure build:
1. All infrastructure would be built as code with no manual changes.
2. We could recreate our environment at will with some real-world expectations on recovery times in a DR situation
We had previously tried out Google’s deployment manager for some POC’s but ultimately decide that Terraform, with its wealthy of providers and declarative syntax, a much better fit for our needs.
This approach meant we could iterate our infrastructure build cycle, allowing for quick decision making and fast consistent changes when we would hit unforeseen blockers. This constant feedback loop of fixing then quickly rebuilding, allowed us to validate our designs and gain confidence in our workflow.
Ultimately, we now have the confidence and the ability to say we can rebuild and redeploy our entire platform from scratch in a few hours rather than weeks.
“Monitor, Log and Measure everything, you can’t fix what you don’t know about.”
This is doubly true when shifting your entire platform to a new infrastructure environment. In our new GCP environments we decided that we wanted to ensure all applications and infrastructure were monitored and logged by default.
We had previously used an ELK stack, Icinga (Nagios), Cacti, Influxdb and Grafana for logging, monitoring and metrics. Some of these tools didn’t naturally fit our new cloud architecture, the choice was made to migrate majority of our metrics collection over to Prometheus (with Grafana) as it was a better fit for Kubernetes.
Prometheus has a wealth of supported exporters and 3rd party integrations and coupled with the fact that there is already a fully functional stable Helm chart, meant we could easily get a production ready metrics system in place with very little overhead.
As most of our java applications are built using the Springboot framework, this allowed us to integrate Prometheus metrics scrapping endpoint into all our apps. This gave us a great internal view of how the JVM was behaving, directly helping us debug a lot of issues when trying to resolve pod request and limit issues on Kubernetes. For our non-springboot apps we simply integrated the jmx-exporter into our tomcat docker containers which allowed all metrics provided by JMX to be scraped.
Outside of our Java apps we collect metrics for all our shared infrastructure deployments such as Redis, Rabbitmq and metrics for all Kubernetes resources.
In terms of logging we wanted to capture logging output from all applications from the beginning. Stack driver seem like the obvious choice unfortunately did not meet our requirements both from a cost and integration perspective. Instead we used a combination of EFK (Elasticsearch, Fluentd, Kibana) Stack, with Fluentd deployed as daemon set to collect all container logs.
Integrating both of the above into Grafana as data sources, we can then create alerts, which can trigger PagerDuty or notify on Slack. Developers are able to create dashboards from both Prometheus and Elasticsearch data sources while also being able to create alerts that are meaningful to them.
If we were to summarise and distill the lessons learned, removing the tooling specifics of our particular journey these points stand out:
· Understand your problem domain: What are you fixing and why?
· Spend the time to build a good CI pipeline.
· Choose tools that allow you to iterate quickly.
· Observability is key, you can only fix something once you can measure it.
· Democratize all systems to allow developers to solve their own problems.
· Use technical debt and failure as direction to what should be fixed and improved rather than an anchor that weighs your team down.
We hope that you found this post useful.
Shameless Plug: If you want to work with great people and interesting tech, we are hiring!
References and Links:
https://www.amazon.co.uk/Continuous-Delivery-Deployment-Automation-Addison-Wesley/dp/0321601912
https://dzone.com/articles/the-anatomy-of-a-release-pipeline
http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
https://docs.spring.io/spring-metrics/docs/current/public/prometheus
https://github.com/kubernetes/charts/tree/master/stable/prometheus