DevOps in a startup. A myth or a reality

Published in

Avito

9 min readDec 8, 2017

What is DevOps ? it’s a way of thinking and a way of working. It’s a blend of the tasks undertaken by the development and operation teams to make application delivery faster and more effective.

Adapting this new vision in a startup isn’t as simple as it seems to be. Why is that ? In DevOps we can state 3 principles that all the DevOps patterns can be derived from, those principles are :

Systems thinking : It emphasizes the performance of the entire system, as opposed to the performance of a specific silo of work or department. The focus is on all business value streams that are enabled by IT. The outcomes of putting this principle into practice include never passing a known defect to downstream work centers, never allowing local optimization to create global degradation, always seeking to increase flow, and always seeking to achieve profound understanding of the system (as per Deming).
Amplify feedback loops : This is about creating the right to left feedback loops. The goal of almost any process improvement initiative is to shorten and amplify feedback loops so necessary corrections can be continually made. The outcomes of this principle include understanding and responding to all customers, internal and external, shortening and amplifying all feedback loops, and embedding knowledge where needed.
Culture of continual experimentation and learning: This is about creating a culture that fosters two things: continual experimentation, taking risks and learning from failure. But we need to understand that repetition and practice are the prerequisite to mastery, because we need both of them equally. Experimentation and taking risks are what ensures that we keep pushing to improve, even if it means going deeper into the danger zone than we’ve ever gone. And we need mastery of the skills that can help us retreat out of the danger zone when we’ve gone too far. The outcomes of this principle include allocating time for the improvement of daily work, creating rituals that reward the team for taking risks, and introducing faults into the system to increase resilience.

For a startup choosing among these principles isn’t simple at all, it needs a lot of sacrifices in short term to have a good final result in long term.

The Beginning:

Hiring a DevOps isn’t just recruiting someone to do the operations and some developments, it needs the implication of the entire organization. In Avito.ma the idea of hiring the first DevOps came with the decision of migrating from a datacenter to the cloud, which in our case was AWS, so we needed someone experienced on the Cloud to manage our stack.

At first that person started by setting up the stack in the cloud using an in-house tool that took the description of a running system and had the possibility of running the stack of a website / service / application in a cloud platform of choice.

After that he did the setup of some basic monitoring using zabbix, but the problem is that the developers were still following the same pace, having the stack in the cloud didn’t change anything for them, PMs too kept dealing with the product as if nothing changed. This is not DevOps.

As the business started to grow, new features needed to be deployed, and we started having problems, bugs were increasing, performance was going down, time to push anything to production was increasing, deployment was manual, we had a lot of rollbacks, everyone was complaining: Devs, CS, PMs, Salesmen, Managers … We were literally on the pitfall of hell, when someone said “ Let’s go to prod ” it was like suicide for us.

That’s when our DevOps Engineer took the leadership to change the current state, but a change isn’t done overnight, it takes time and it needs the help from everyone.

The Reorganization:

The first step towards heaven was reorganization, which helped stabilize the product on the long run. So we started by creating teams with different scopes, this new change created a sense of ownership for the products among the team members (developers, tech lead, product manager). In the meantime we created another team consisting of 2 DevOps whose clients were mainly developers who started delivering quicker than before since now they own a specific scope. The main issue was that with time the technical debt for the DevOps team started to increase, and our job became just fixing problems, dealing with performance and not improving anything. (For those of you who don’t know, technical debt is a debt that you acquire when you don’t do any analysis for capacity and demand before accepting work, which makes you always scrambling, having to take shortcuts which means more fragile application in production, and that implies more unplanned work in the future, and like financial debt the interest cost grows over time.) But like Darwin said:

It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change.

Measure anything, measure everything:

The second step after the organizational change was the assessment of the current state, as a result we suggested that we need to work on projects with added value to the developers instead of dealing just with support, and that required some compromises from all the teams that became more implicated in the troubleshooting and resolution of all the issues, and whenever they needed a request from the DevOps team, the Teach Lead summited the request and needed to follow a predefined process that had a reasonable SLA.

Once the DevOps team had enough time, we started doing our real job. And of course the first project that came to our mind was the setup of a good monitoring tool, which needs to cover all the necessary metrics with alerts of course. And when we talk about metrics we need to differentiate between 3 kinds of metrics:

As for the tools, we chose Datadog and Prometheus/Grafana since each solution has its own usage.

Automation everywhere

The monitoring is working without problems now, and the next project that comes to mind is automating the deployment, which can be considered as a seed for CI, because we can have beautiful dashboards, but if the deployment/rollback is done manually, everyone will still be afraid to deploy.

After a little brainstorming DevOps team presented the pipeline to the developers and we discussed with them some important points:

Difference between flavors during the build
Ports used for communication between services
Https and SSL termination

Working on automating the deployment would require the existence of a QA and a Pre-production environment, with local configuration similar to the Production. The tool that we used for the pipeline was Jenkins that we hosted in our own cloud, we put it under sources control so that we keep track of the changes in the pipeline, that had 7 stages:

Preparation: prepare the environment used for the build
Checkout: get the specific branch that needs to be deployed
Build: build the packages according to the environment variables
Publish: publish the artifacts to S3
Deploy: Installed each package in its specific server and restart the services
Clean up: Clean up the old installed packages
Notification: Send a notification about the status of the deployment (success or failure)

But since we weren’t going to deal with a lot of deployments we needed just one master and no slave. And in order to launch the pipeline we needed of course some parameters:

The environment (QA, Pre-prod, Prod)
The type of deployment (platform, feature, fix)
The branch to deploy
Finally the country, because we were deploying our code for 2 different countries.

Are we on the right track ?

Now that the automating project is finished, our KPIs improved exponentially and our feedback loop with developers became constant:

Frequency of code deployment: Increased from 1 time per month or 2 weeks, to an average of 3 times per day
Code deployment lead time: Decreased significantly, since users started receiving rapid bug fixes, and new features started hitting the market right on time and have a big impact.
Change failure rate: Also decreased, since we started having less and less failures during deployments.
MTTR (Mean time to repair): Now that developers can deploy quickly, they can fix bugs without delays which decreased the MTTR.
MTTF (Mean time to failure) and MTBF (Mean time between failure) : We started having less and less failed deployment, and that is caused by the fact of having QA environments close to the Prod environment.

But having an automatic deployment isn’t everything, we need to reach the final stage of Continuous Integration by integrating more automatic tests, (unit test, integration test, regression test, acceptance test, performance test) because we don’t want our developers to start having the IWOML syndrome (it works on my laptop), even if it’s just with the QA environment.

Monolithic vs Micro-services , are we going back to hell ?

Since our platform is monolithic, it was very hard for the developers to add new functionalities, so they started adding micro-services for everything. And for a DevOps handling a lot of micro-services would bring us back to our old nightmare. So we found out that we need to work on a solution that would allow us to :

Have a dev / QA / prod environment for the microservices
Allow the micro-services to be discovered and added to the CI server
Allow the developer to ship the micro-service in a container to a private registry following a naming convention
Restrict access to authenticated users via VPN
The provisioning of the platform needs to be automatic
The platform needs to scale in and out
We need to monitor the platform : health, metrics, logs
Secure the communication between all the services
Assure the High availability of the service.

After some benchmarks we came to the conclusion of using Kubernetes with Docker:

The implementation should allow us to create, update and manage multiple environments of the K8s cluster infrastructure automatically, and we decided to use Terraform as a tool for Infrastructure as a Code, etcd for service discovery, and flannel in order to manage the network. I’ll go more in details about this setup in another article.

We in the DevOps team didn’t have a guideline of projects that we should develop, we started from what we had and then we followed our needs, or more like developers’ needs. We still have a lot of things to improve, but as long as the DevOps mindset is living in the heart of every developer, we won’t have any problem. And like the Godfather of DevOps said:

“DevOps is not a goal but a never-ending process of continual improvements.” Jez Humble .