From Developer to DevOps: Lessons Learned in 2018

Aaron Williams
The Economist Digital
7 min readFeb 15, 2019

Tl;dr: Key takeaways at the bottom

In April of 2018, I left my role as a Software Engineer to become a DevOps Engineer at The Economist. Whilst at first I was sceptical of “DevOps” being a dedicated role rather than general culture, I soon realised the necessity of having DevOps specialists embedded in distributed teams. Acting as part infrastructure, part development, part deployment engineers, our primary focus is to empower development teams to build and run high quality software, iterate with speed and instil a DevOps culture.

Jerry the DevOps Support Engineer. Photo by Jujhar Singh

Informally, the ethos of the Economist’s DevOps Practice is to do enough sharing and development for “DevOps” to become a universal responsibility. Personally I believe the following lessons and experiences have brought us closer to that goal.

Don’t just do the work, share it

A lot of the earlier months in the DevOps practice were spent fighting various outages as they occurred, cutting down on the time we had to deliver value. One of the ways we overcame this tricky period was by documenting anything and everything that we had reason to question. Post-mortems, tutorials and known-bugs were 3 common materials that we’d author to increase the knowledge sharing amongst our teams. These kinds of docs would be in, or referenced from the source code repository of the specific component they were relevant to. An early issue we stumbled on was that a lot of useful documentation existed, we just didn’t know how to find it. Cross-referencing and including useful docs in the repository README file allowed us to bridge that gap.

Obviously documentation alone isn’t capable of stopping outages, however it definitely helps identify what issues are important, allowing developers to prioritise and pass on information for the next time it is needed. Documenting an issue in the codebase can be better than creating a ticket, as Engineers have less resources to read through during the golden hour of an outage.

Another effective measure for cross-skilling and sharing is making sure all engineers pick up tickets that would be traditionally considered an Ops or DevOps task. Doing so allows the best-practices and pains that these tasks can bring to be experienced by the whole team, reducing siloing within roles. It’s also important to apply the same philosophy to any ticket within a sprint. Silos are a pretty common reason engineers get bored within their role; having the ability to move around and share ideas is a way to counter weariness whilst decreasing knowledge barriers.

Find the Problem, then the Tool

Those who attended the AWS Summit in London earlier on in 2018 will know that there are plenty of vendors out there offering tools that solve “common” DevOps problems. It’s very easy to get swept away by how easy [the vendor says] these tools can make our daily work. Not to mention the free stickers they give out. Some of the vendors we’ve seen advertised, we’ve started using here at The Economist, but not before making sure the tool can solve a valid problem.

Some tools stagnate and become legacy after they’re used in a project, only to be a crucial part of a deployment job that needs to be run in the event of a random outage (it’s happened). Writing up simple outlines of the intended purpose for our tooling and how it should be used within our projects can reduce the amount of vendor software we see in the legacy graveyard further down the road.

Where legacy tech lies. Photo by Eugenia Vysochyna on Unsplash

Trying to get the most out of our tool selection up front and sharing usage patterns across projects is a way to keep engineer knowledge on tools sharp. But it’s important to note that shoehorning a problem into an existing tool isn’t the right choice either. For example, using a incident/response tool like PagerDuty to trigger an autoscaling group event would probably cause more confusion than benefits.

Prove Problems with Data

Back in the earlier months of 2018, we experienced bouts of slowness in the range of 5 seconds for some of our application’s API calls. Although everyone was aware of the issue, we found it difficult to pinpoint the exact cause due to lack of telemetry throughout our application. Luckily for us, one of our caching layers reported as unhealthy during periods of high traffic so we began with implementing optimisations there. Whilst making this caching layer more maintainable, we applied simple load tests using vegeta against single containers and prod-like environments to verify how much traffic the Varnish cluster fronted by Nginx could truly handle in isolation. Applying this process enabled us to identify misconfigurations, such as a low number of available file descriptors before replacing the existing layer and potentially ruining the experience for consumers of the API. Since then we’ve continued to apply load tests to any new infrastructure that we maintain before rolling out, to catch any errors that would otherwise go missed. One of our long term goals is to run a suite of load tests as part of the deployment process for large scale caching layers to continuously understand how our configuration changes impact consumers.

The above case study didn’t necessarily prove the problem we were experiencing with data, however since then we have implemented more telemetry to allow us to understand our stack. Specifically for all of our Varnish caching layers, we have started forwarding varnishstat data to a centralised DataDog console. This has already paid off as we discovered we were storing mass amounts of duplicate objects due to a misunderstanding with the req.has_always_miss attribute not evicting old objects.

Prioritising fixes for issues becomes a lot easier when detail of the issue and short-term solution is shared in an organisation wide post-mortem. The previously mentioned caching layer issue was mentioned in several post-mortems before a temporary squad was created to implement a long-term remedy. If it wasn’t for the issue being proven with reports, it would have been much more difficult to assemble the resources needed to resolve the problem properly.

Experience the pains of the team

A key part of building up a DevOps culture throughout a business is situating DevOps Engineers in the trenches of the development team to empower the team to build and run their applications. Becoming a part of a squad’s agile ceremonies allows us to understand the development process from everyone’s perspective. Utilising valuable feedback expressed in technical huddles, standups, and retrospectives we can understand where the bottlenecks are in our process. On more than one occasion the CICD tooling has been the point of congestion.

Within the Content Platform squad at the Economist, one of the common complaints made in agile ceremonies was CloudFormation (our Infrastructure as Code tooling) changes couldn’t be properly tested before deploying to our stage environment. From this feedback, we developed a pattern for our newer services of being able to create cloud deployments during the development process to test infrastructure changes before submitting a pull request. This allowed us to shift the common errors left so they were solved before the code ever reached review. The “Dev Stack” pattern has allowed us to resolve a range of errors from networking to missing properties all before sharing a feature.

Don’t alert on everything

At the point of development, we should always consider the value lost when an application or feature breaks. Asking yourself, “Should I be woken up for this?”, is usually a way of getting an initial gauge of the importance of a possible outage. Creating sophisticated incident response around our systems is something we are working hard towards within the Economist.

We’ve struggled with receiving email notifications from monitoring tools based on alarms set up for legacy projects to the point where we’re not sure what emails are of valid concern. Unfortunately real errors get caught amongst the noise and we’re not made aware of a problem until it impacts the usage of an application and someone notices. By alerting on too much information, the alerts are deemed useless. Should we be alerted for a few 5xx errors amongst tens of thousands of requests in a 5 minute period? Probably not.

Taking a leaf out of Google’s SRE book, we are moving our alerting as close to the user as possible. Creating Service Level Indicators around important user journeys across our systems, and alerting on them when we think customer value could be impacted allows us to cut through the noise and only divert our attention when it is needed. I look forward to the SRE developments we make in 2019.

Key Takeaways

  • “DevOps” sometimes needs to be a role to create its culture
  • Empower teams by encouraging the sharing of typical DevOps knowledge
  • Don’t spend time implementing a tool unless you know it will solve your problems
  • Prove your problems and solutions with telemetry and reports
  • Embed yourself tightly in a team to feel and understand the pains of the engineers
  • Don’t alert on every metric. Monitor and alert as close to the user as possible

--

--

Aaron Williams
The Economist Digital

UK Regional Director of Engineering @ Econify. Talk to me about: Cloud, Web, TV Apps, Mobile, Graphql