Change and Incident Management in Trendyol Network Team
As Trendyol’s Network Team, we have multiple daily/weekly/quarterly/annually executed operations ranging from creating a simple firewall policy to performing a multiple-site disaster test. Having these operations go smoothly without any incidents or significant errors, caused by the change, is essential. But even in perfect conditions, incidents could occur. In times like this, having a steady approach to the problem and learning from the mistakes afterward can save us a lot of time and resources.
This article aims to provide insight into how Trendyol’s Network Team handles change and incident management processes.
Planning a Change
Atlassian describes a change as;
“A change is adding, modifying, or removing anything that could have a direct or indirect effect on services.”
Planning a change is a highly challenging and rather complex task. Incidents are caused mainly by planned changes to the infrastructure environment, and to avoid such risks, each change, regardless of its size and scale, should be evaluated thoroughly.
In our team, each change has an owner and a pair. This approach uses the 3-star rule utilized by scuba divers. Every team member has number of stars based on their experience in their field. To prepare and execute the change, at least one experienced team member must participate, and the owner and pair combined 3-star rule must be met.
And the last step is to have a detailed checklist for the planned change operation. We’ll discover the key elements for a clear change plan later in this section.
As Trendyol’s Network Team, our change operations fall under these categories;
- Standard Changes, have no impact on the production environment, no potential risk of failure, and operations that can be executed as daily tasks. (e.g., creating new firewall rules, configuring new interfaces)
- Medium/High Impact Changes, have a possible impact on the production environment, risk of traffic interruption or disruption of service integrity, and must be executed outside of active hours with the 3-star rule applied. (e.g., device version upgrade, scaling operations)
- Emergency Changes, might affect the production environment -traffic disruption, etc.- but not executing the change has a higher impact on the production environment; therefore must be immediately executed. (e.g., device reload)
Types of Changes;
- Device/Software Version Upgrade/Downgrade: By its nature, we can think of Operating Systems as living organisms. Bugs occur, vulnerabilities happen, or contrary to what was mentioned, our teams could simply upgrade our devices to use a new feature available with the latest stable release of that OS approved by the vendor. Unfortunately, OS upgrades come with the risk of bug occurrence depending on the configuration used in the data center environment. If the team decides to solve a possible problem with a version downgrade, it might be the best and most reliable solution to downgrade to the previous version.
- Scaling Operations: Our team might be executing changes on each domain and data center to meet the demand created by the increasing user traffic, especially on high-volume events.
- Physical Operations: Our team might decide to add or remove devices from data centers for scaling purposes or a new architectural design component. Also, due to unforeseeable reasons (such as hardware failure, cable/transceiver problems, etc.), our team might decide to remove the faulty device and replace it with a new one.
- Disaster Tests: Disaster tests are performed once the data center infrastructure is deployed and later annually by our team to ensure the reliability of the data center.
Importance of Checklists During Change Operations
In his book, Checklist Manifesto, Atul Gawande refers to checklists as;
“Checklists seem to provide protection against such failures. They remind us of the minimum necessary steps and make them explicit. They not only offer the possibility of verification but also instill a kind of discipline of higher performance.”
Checklists are great source of task organizers. They minimize errors, reduce complexity, and provide a perfect timeline for when to execute each step. It is easy to skip and miss out on some critical elements in the heat of the change operation or a problem occurrence. You can always turn to your checklists.
As Trendyol’s Network Team, we have a pretty standard checklist that we control before every medium/high impact or emergency change operation;
- Scope of the Change: Determine the change scope, specify the domain you’ll be working at and possible effects on other domains. Avoid planning a change simultaneously with different teams working in the same environment.
- Plan: Have your change, test, and rollback plans and checklists ready. Inform other teams you will be working with to review the change and test plans.
- Before the Change: Always check your service/device/VM state to be aware of a problem before you start to work on an environment. Test the service/device/VM access (console, management, etc.). Create the last service/device/VM backup before the change. Log the important command outputs. Post an announcement that you’ll be executing the change.
- During the Change: Follow the pre-arranged change steps, take note of the timeline, and monitor your services if necessary. Some logs, metrics and command outputs are essential to keep track of during the change. Each must be carefully monitored and logged.
- After the Change: After the change is completed, check the service/device/VM states you’ve worked on and log the outputs to understand better if your change caused any unexpected problems. Post an announcement that you’ve completed the change. Inform the team about the change details and status. Update the related documents if necessary.
Importance of Documentation
As Trendyol’s Network Team, we believe in the power of written and asynchronous communication. From a simple task update to the whole data center topology, every form of document is as important as the other. Documentation saves us time for its contents cover almost everything we need to know about the respective environment.
We keep a record of all environments and processes maintained and provided by our team. In addition, we make sure our documents are up to date, written in a simple and informative manner, and categorized accordingly.
How to react to an incident?
Our approach to incidents is Atlassian’s classic;
Detect, Respond, Recover, Learn, and Improve.
- Detect: Incidents can occur anywhere, anytime. Messaging channels, alerting tools, and periodic daily checks for all data centers can warn us about an upcoming or ongoing incident.
- Respond: Team sentinel for the day can quickly pick up on any service or traffic going south and inform the team about the problem; all related team members quickly gather in an emergency Zoom meeting to assess the situation.
- Recover: After the initial response; all our team works together to resolve the problem in the most accurate way possible. Escalation is a big part of the process.
- Learn: Postmortems and RCA reports should always be a part of the improvement process. And also, our team has lesson-learned and informational update meetings scheduled weekly to exchange ideas about ongoing cases and past incidents.
- Improve: Taking preventive measures for a past incident never to happen again is crucial. After a detailed assessment, our team lists all the preventative solutions and creates related issues.
Postmortem Documentation and RCA Process
We try to document the impact of the incidents with Root Cause Analysis(RCA) and postmortem reports. These documents provide a clear timeline of the incident and a perfect list of learning points for the team.
Network Team’s postmortem reports aim to summarize the incidents with the following questions;
- Incident Title
- Summary
- Timeline
- Impact
- How did we notice the incident?
- How could we notice earlier?
- Why did it happen?
- What went well?
- What could go better?
- How did we resolve?
- What are the learning objectives?
Results
With self-regulated change and incident management processes, our uptimes rose to %99 all year round.
Out of 140 change operations executed so far in 2022, 135 have been completed succesfully, four have been canceled, and one change was completed with errors because of an unforeseen bug occurrence, which was followed and resolved with an RCA and postmortem process.
I’m so grateful and honored to be a part of a team that constantly improves themselves, their team members, and their processes.
Thanks a lot for reading so far.