8 Devops Lessons learned from coronavirus crisis

Published in

TUI MM Engineering Center

7 min readJun 8, 2020

At the end of a project, it is a good habit to spend some time thinking and reviewing everything that has been done during the project. Usually, the decisions taken, the deadlines assigned to the tasks, are reviewed to detect what could have been done better to avoid making the same mistakes again in the future. The conclusions of this analysis are the lessons learned.

This practice can be extended not only to projects but also to any type of situation such as the exceptional crisis that has occurred due to the coronavirus crisis. It is always good to analyse what was done to improve in the future.

I will now comment on a series of points that I think are relevant from the Devops’ point of view that have emerged as a result of the current crisis. You can see the description of the actions we have had to take in TUI DX in the article we published recently:

Cost control on Cloud. Covid-19 experience

I still remember me talking with David Garcia (TUI Destination Experiences IT Director) the weekend when the Spanish…

medium.com

1. That will never happen

Did anyone really think that a global pandemic could occur that would force a virtual shutdown of the company? Clearly not. From the point of view of disaster recovery a pandemic is not a scenario normally contemplated. It is not even a disaster recovery scenario, since it has not caused loss of service, loss of data, etc. but it has caused a situation that has forced us to make extraordinary modifications to the infrastructure in a very fast way, and that is one of the characteristics of the Devops, the flexibility to make changes as fast as possible when business require.

Our loads are mostly running in containers, so it seems easy to reduce the load: lower the minimum number of containers per service, lower the auto-scaling groups to the minimum to reduce the clusters. Etc. But when it is necessary to reduce the costs to the maximum, additional measures have to be taken: stopping production services during certain hours or starting them automatically only in the deployments, reconfiguring clusters only with the services that have to be maintained, etc. Fortunately, some procedures were already being used in DEV/TEST environments, others were not and new processes had to be created.

As a conclusion, it is necessary to design the systems as robust as possible against disasters and at the same time, simple and flexible to be able to make changes quickly when necessary.

2. Working remotely

Working remotely is just as effective (or more effective) than working from the office. From a systems administrator’s point of view, the physical location where the servers are is not relevant, and even less so if the infrastructure is in the cloud, as is our case. Moreover, to develop in a Devops team the code repository and the CI/CD pipeline, together with a log management system, avoids the need to access the servers at all. However, in order to work remotely without problems, different configurations must have been made. VPN, firewall access, configuration of security groups in the applications, video conferencing for meetings. etc. etc. At TUI DX we have been working remotely some days a week for a long time, so all these configurations were already done and we simply remained working remotely like any other day we work from home. Companies that were not prepared because they do not believe in working remotely and have had to do it by force, have had to make a great effort to adapt. Remote working is going to become widespread from now on because the companies that adopt it are simply more agile.

3. Fewer meetings

When we started working to adapt our infrastructure to the new circumstances, we focused only on that goal and many other projects stopped. That meant that the number of meetings significantly reduced.

If the meetings are useful, the time invested is worth it, but if the meetings are not useful it is a big waste of time. Just as we try to reduce costs by using the advantages of the cloud, we also have to take into account the other two main factors relating to projects, time and resources. To optimize the time we spend on a project the best thing we can do is to eliminate the time spent on useless meetings. To do this we must ask ourselves several questions:

Is the agenda of the meeting clear?

Do I need to be in that meeting, am I going to contribute?

Is it necessary to assist two people from the same team?

Many times the team leader assist to a meeting to make decisions but also the person who technically knows about the topic. This simply shows that the team leader does not know how to delegate, wants to centralize all the decisions and ends up being a bottleneck for the team. This is something to avoid in a Devops team. Team members, especially in a mature team with senior members, should have to authority for take decisions in the topics where they are expert.

4. Control garbage saves a lot of money

During this crisis, we have had to optimize the costs of our infrastructure to the maximum, looking for creative formulas to maintain the essential services with the lowest cost and maximum availability possible. We have reduced the number of clusters, number of instances, types of instances to smaller ones or change them to cheaper CPU types, we have also reduced the execution time of some services, we have stopped services, reduced data retention times where possible, etc.

All the actions taken could be grouped into two groups: reduction and elimination. Actually, the actions of eliminating services, resources or data that are not really useful reduced the bill of the cloud infrastructure a lot. In a normal situation, many times we do not delete data or resources “just in case”, other times we are not aware because we have created test instances and we have not deleted them when we don’t use them anymore, the retention period of the data has not been properly configured, etc. All that is incorrect, the resources and data that are kept in the infrastructure must be correctly identified and controlled, otherwise we are not doing a correct cost control.

5. Cloud usage

We have talked about modifying the infrastructure quickly to reduce costs. It is clear that this can be done mainly if your infrastructure is in a cloud environment, otherwise the cost reduction is very complicated. Many times devop practices are associated with the cloud, which is not mandatory, but as we have seen in this case it is very beneficial to have your infrastructure in the cloud because of the agility and elasticity it provides. Moreover, if you are using serverless services it’s even better in terms of elasticity.

However, the elasticity of the cloud has its limits. There are factors that have an influence and sometimes agility and elasticity are opposing factors. For example, it is possible to quickly create an EC2 instance in AWS simply by choosing the instance type but that in turn prevents you from being able to create an instance with a lot of memory and very little cpus. The reserve of instances is another case that is good in terms of the cost reduction but limits the elasticity by compromising the use for a certain time. Most companies that uses reserved instances never thought of a situation like this when calculating the required reserved instances.

6. Resilience

The current situation has forced us to make quick decisions with little information, change architectures based mainly on cost or take on tasks we don’t usually do.

In this situation we have been able to see the power of adaptation of our colleagues. Many people have taken on new tasks, learning quickly. Managers have also had to perform new tasks or help to other team members. Good managers should know the work the team does well so that they can make right decisions and, in this case lend a hand when needed.

Another important feature of Devops teams is their resilience, and this has been demonstrated in this situation.

7. Retire legacy applications

One of the problems we historically encounter in IT is the difficulty of removing old applications. Most of the times they are not removed “just in case” there is something to consult, or even for lack of knowledge of whether they are really being used. The covid-19 crisis has forced us to reduce costs dramatically and what can be reduced with the least impact is to eliminate applications and services that are no longer useful.

Removing applications is not so difficult if there is really the will to do so and the benefits are worth it, both because of the costs saved and because it allows the platform to evolve to more modern technologies.

8. Cost conciousness

The coronavirus crisis has forced us to be more aware of the costs. Cost control is a fundamental aspect of Devops’ features, closely related to optimal use of the cloud. Ideally, using the cloud gives you many tools to control costs: reports on what is spent on each type of resource, optimization recommendations, functionalities to manage resources efficiently in costs, such as S3 lifecycle management, etc.

But as the infrastructure grows, it becomes more difficult to control expenses and bad practices. The experience of this situation of crisis, where cost control has become very important, makes us realize how important these types of tasks are to maintain a cost-efficient, useful and manageable infrastructure.

So we can learn from everything, even from a pandemic crisis, and apply these lessons learned to improve our devops processes, to have a more agile, clean, cost-efficient and elastic services for our business.

Thanks to my colleagues in the Technology Architecture team for the effort done during these hard days: Guillermo Rey, Cristian Sacristan, Jorge Alvaro, Sergio Ramirez, Santiago Ponce, Javier Torres, as well as colleagues from all other teams in TUI DX and external providers who also worked very hard to adapt to the new situation.