DevOps and Standards

Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.
- Nassim Nicholas Taleb, Antifragile: Things That Gain from Disorder

Viki Engineering is a small and lean team of ~30 engineers, entirely based out of our Singapore office. I am part of the DevOps and Infrastructure team which is 2 person strong. Our goal as a DevOps team is to architect our infrastructure for robustness and more importantly, to create a set of best practices for engineers which leads to increased developer productivity. This team primarily came into existence in the end of 2014 when we realised that we had so many different ways of doing the same thing and at every corner, there was a surprise in the system.

As we grow this team in 2016, both in terms of experience and size (+1), I would like to highlight some important points about having a DevOps team, and give DevOps a formal definition, at least at Viki. Hopefully, this can be applicable to other companies as well which are roughly the size of Viki in terms of scale or engineering resources.

DevOps includes —

  • ability to handle and debug any service level issue.
  • architecting services, in line with standards, best practices for reliability, scaling and cost optimisations.
  • automation for the infrastructure, from scratch to service (CI/CD tooling).
  • automation for developer productivity.
  • maintaining infrastructure components such as load balancing, DNS, databases, queues etc.

I get asked the question often about what am I looking for in a good DevOps engineer as I interview candidates. A good DevOps engineer must have the following skills -

  • Good problem solving capabilities with a knack for not giving up on tough problems.
  • Good understanding of the underlying operating system, network and dependencies.
  • Interest and love for reading man pages, documentations for open sources systems which are ubiquitously used everywhere.

These ideas stem from the SRE (Site Reliability Engineer) role that is popular in larger companies. SRE roles tend to be more embedded with the services as each core service in a large company usually has a dedicated SRE. Given that we have 25–30 microservices, having an SRE for each is not possible. But having a team that can understand common patterns and look at the architecture from an efficiency, performance and cost standpoint is really valuable. In my opinion, an operations team that is disconnected from the developers is a bad idea. Teams that only do deploys, handle machine level issues and have no idea what the services running on the machine are doing is not doing what it is capable of.

One of the important things that we focus on is Standardisation and Best Practices. These two terms are quite commonly used along with DevOps but it is important to make sure that these ideas, just like other ideas, are only executed to a certain limit. They may have side effects. Before we talk about the benefits (the good parts), lets talk about the side effects.

The biggest side effect is that enforcing standards too strictly, can lead to lack of innovation. Even though its not wrong to use standards as a base for reasoning — it should not limit the scope of thinking. For example, we use Docker with certain standards around how developers should write their Dockerfile, but the developers are not restricted with this standard about what they can run and what they cannot run in a container.

Benefits of Standards

  • No reinventing the wheel. Standard common libraries.
  • Cost savings across the board.
  • Defined security plan with regular security checks.
  • Element of least surprise while navigating the system.
  • Ease of development for developers, common language.

Examples of standards at Viki

There are some very basic standards pertaining to code at Viki which lead to happier results with engineers. We run an engineering on-boarding for new engineers where we explain these standards to the engineers.

The easiest aspect to standardise is naming (as that is the hardest problem in Computer Science), and we have taken that pain away from developers. This is a non-exhaustive list -

  • Consistent naming across the board for each service (e.g service name = repo name = docker image name = monitoring name). No funny name for services or servers. Servers are enumerated along with function. Service names succinctly describe what it does. Common terminology for all developers which helps in communication.
  • Load balancers as the source of all truth for service success rate and health monitoring.
  • Only load balancers should have public IP address, the rest are interconnected with a private network.
  • Statsd protocol for metrics, ELK stack for logging, Amazon SES for emails etc.
  • Regularly upgrading the docker base image for security updates.
  • Standard provisioning through Ansible — leading to standard kernel versions, packages etc.
  • Consul for storing all service configuration.
  • Docker container startup script with health checks.
  • Services integrated into CI pipeline with ease of writing and running tests.
  • Everybody should log to /var/log which has standard logrotate policies.
  • etc. etc.

As we continue forward with our DevOps journey, I hope our ideas prevail and we never end up in this state -

I would love to know your company’s plan for DevOps and what you think about standards. Tweet at me @angadsg

May the standards be with you.