DevOps Сookbook: How to Build a Smooth Process with Your Devs

Maria Kotlyarevskaya
Wrike TechClub
Published in
8 min readNov 8, 2021

Hi! My name is Maria and I’m a DevOps engineer at Wrike. In this article we are going to talk about effective collaboration between DevOps engineers and development teams. Collaboration includes determining the work process, its individual stages, and how to build up this process from scratch. I’ll tell you about the pitfalls that can come up on this journey and how to actually bring cool DevOps into developers’ lives.

The role of DevOps in project lifecycle

During my professional career as a DevOps engineer, I’ve had a lot of different projects from a small website bot to an entire Contact Center solution. Each project was unique in its own way and had unique people working on it.

You might join a project at an early stage or at the closure stage.

In a perfect situation, the DevOps engineer joins the project during first stages to make the needed impact on the process, to help with CI/CD pipelines, among other things. Usually, however, our help is really needed during the deployment and maintenance stage.

But I received the most valuable experience working on projects at the final phase. It was quite challenging for me because all the developers had already left and I had to migrate those projects to the new infrastructure without a major downtime.

In such a situation, it’s crucial to know how to communicate with developers, prepare a plan, and audit the project.

How to set achievable goals

In most cases, a DevOps engineer is assigned to a project or specific team. This usually means that some problems have already arisen. They can be anything from a lot of rolled-back releases to a crippling lack of core development processes.

Before starting any task, prepare a plan with high-level goals. These goals should reflect the measurable solution for those problems. I’d personally suggest using the SMART approach.

Here are some examples of clarifying common mistakes during this process:

The goal “Audit the project” is unmeasurable: it’s impossible to understand the final result. It’s better to rephrase it to something more specific:
Solve 5 of the most critical problems in the infrastructure of service X.
This description will help managers, team leads, and your teammates to understand your plans and what the expected result is.

The second goal, “Configure deployment to the new AWS region,” seems okay. But if it happens that you are on vacation or reassigned to another project, this description will not be so clear to your teammates.

The last goal “Configure monitoring for service X” seems okay as well. But it’s not exactly clear what should be done during this task. It’s possible to deploy the whole Prometheus stack as well as configure some small parameters in the application configuration.

You can find more information about the SMART approach in the software engineering field in this article.

Audit application

Reviewing the whole service or a bunch of microservices can be hard. That’s why it’s important to understand your goal and define the current state of all sides in a project. If you need to improve monitoring and logging, it’s better to focus on these tasks and not to try doing everything at once (it can be tricky :).

Documentation

I’d like to emphasize the importance of documentation in the audit application. I suggest focusing on the goal — to have enough documentation to be able effectively support or hand over a project.

What to check:

  • The architecture diagram with main components and their interactions;
    This diagram will help you to gather the minimum required information to start the work.
  • A list of components with their description;
  • List of external dependencies (Cloud services, internal resources, etc);
  • Configuration management;
  • Local development;
  • OPS documentation (monitoring, logging, debugging, deployment).

It’s not necessary to have all these items in your documentation, but in some cases they can be useful.

If any part is missing, initiate the process to fill in the gap:

  • Ask developers to write docs and help your team with it.
    Writing documentation is boring, but your role as DevOps engineer is to get things started, lead the process, and explain why it’s important;
  • Be as specific as possible about the required information;
  • Explain how to start.
    It’s better to have some template or to show a good example.

DevOps audit

The goal of this step is to identify what bottlenecks might come up in the current development process.

What to check:

  • CI/CD pipelines;
  • Release processes;
  • Deployment procedure;
  • Escalation procedure in case of problems.
    All participants should understand how to act in critical situations, how the problem should be escalated, and what the expected result is (whether the task is assigned to the development team, a separate support team, etc).

You don’t have to fix all problems at once. It’s better to try and understand what actually bothers developers: maybe, for example, there are problems with the stability of CI/CD pipelines. You can mark the task to be taken into account right now or mark it as technical debt that you’ll resolve when the project ends.

How to check:

  • Talk with developers about their development process (how they deal with tech debt tasks, how they do code review, etc);
  • Check all CI/CD pipelines against missing steps and bad patterns. Probably you can implement some best practices to increase stability and the code quality;
  • Check build and deployment configurations (e.g., missing versioning, using latest tag for docker images).

This article is a good starting point for the best CI/CD practices. The Chart Best Practices Guide in Helm official documentation is an example of technology-specific best practices.

Infrastructure audit

This step is complicated because infrastructure is a huge topic. There are a lot of approaches and each varies depending on the specific case. I suggest concentrating on verifying whether an infrastructure meets the requirements (e.g., reliability, uptime, high-availability, scalability, etc.).

What to check:

  • Technologies and tools that are used to provision infrastructure;
  • Environment type: cloud, self-hosted or hybrid environments;
  • Environment configuration.

How to perform:

  • Check environments against the known requirements;
  • Check Ansible-, Terraform-, X-tool-related files against bad and good practices that you know or googled;
  • Check the process around the configuration management for these files;
    A lot of problems can arise if proper versioning or code review for infrastructure-related files are missing.
  • Check tool set versions
    Do they need to be upgraded for new features or security improvements?

To learn more about infrastructure audit: Terraform best practices and Kubernetes production checklist.

Security audit

We’re working with users and our users want to keep safe their emails, passwords, and other personal data. So the goal of this step is to minimize the security risk for the environment and its users.

What to check:

  • Dockerfiles, Kubernetes manifests / Helm-charts;
  • Environment configuration (e.d., do we have private Kubernetes clusters or not), firewall rules;
  • Users permissions and roles;
  • Secrets.

How to do the security audit:

  • Manual or automatic checks in the code repositories;
  • If your company has a security team, request an audit;
  • Understand how secrets are delivered to the application.

Observability audit

The goal of this step is to have all possible sources for investigating any problem. If the application is small, it’s enough to have just logs. But if the application has a microservice architecture, or a high-load and distributed system patterns, it’s better to have metrics and alerts as well.

What to check:

  • Logs;
  • Metrics and alerts for application;
  • Metrics and alerts for infrastructure, CI/CD pipelines, security issues.

How to conduct the observability audit:

  • Check that logs are well-formatted and properly configured, try to run a search in the log-aggregator (e.g., Elasticsearch, Graylog);
  • Check Grafana’s dashboards and application metrics if they contain all the important information;
  • Check whether alerts cover critical problems and are actionable;

If a lot of microservices and complicated chains of requests come up, you may propose integrating tracing.

To read more about this step: overview of popular monitoring strategies and the example for application performance metrics.

DevOps toolset: first aid

There’re a lot of tools that can help to automate all audit steps. It’s possible to run them locally or integrate into CI so developers will see alerts if they make undesirable changes.

Documentation

  • It’s better to have a versioning solution.
  • Markdown / Github-, Gitlab-pages / Confluence wiki pages.
  • Notion. This tool with a friendly UI can be helpful if your project is not very large and it meets your requirements.
  • Backstage from Spotify. This tool not only has a nice UI but rich features, various plugins and can be implemented for the whole organization.

Deploy to Kubernetes

  • Helmfile — declarative deployments (especially for infrastructure related sources).
  • Helm-docs — auto-generated README for helm charts.
  • Helm-diff — helm plugin to show diff before chart deployment.
  • Chart-testing — linter for helm-charts against policies.
  • Kubeval — validates your manifests against various Kubernetes versions.
  • Pluto — detects deprecated API in the Kubernetes cluster.

More interesting tools can be found on the awesome-helm list.

Infrastructure

  • Molecule — testing framework for Ansible roles.
    If the infrastructure was built using Ansible, this solution is perfect to test Ansible roles to make sure that changes won’t break anything.
  • Terraform pre-commit hook — supports various Terraform tools (e.g., tflint, tfsec), can be integrated locally or in CI.
    This tool will help to keep Terraform files consistent across several repositories and speed up the code review process.
  • TFSwitch — convenient management tool for Terraform versions.

Demo scenario with Molecule and Github Action.

Containers

  • Hadolint — testing Dockerfiles against best practice.
  • Dive — looking at what is inside the Docker image.
  • Kaniko — Dockerless way to build Docker images

Security

  • Trivy — scanner for Docker images, Git repositories, and file systems.
  • Polaris — analyzes workload for best practices in Kubernetes.
  • Kube-bench — CLI tool with the same features as Polaris, but it performs different checks.

Local development

  • Minikube — simple to use, has a bunch of useful plugins.
  • Kind — Kubernetes in Docker, supports multi-node mode.
  • Skaffold — project that allows developers to easily build and deploy their apps to Kubernetes.
  • Telepresence — allows developers to forward traffic from the cluster to local machine and many more.

Conclusion

Now that you have a general understanding of how to perform an audit and know a bunch of tools that can help you in this journey, you have everything you need to dive into any project.

Happy sailing!

--

--