Next challenge: Automated network operations

Ezgi Küşüm
Trendyol Tech
Published in
7 min readSep 7, 2021

Network automation is no longer an option; it is a necessity. This article explains why this is so by sharing some facts and the way of thinking we embraced at the very beginning of the Trendyol network automation journey.

To automate, or not to automate: that is the question

Whether ’tis nobler in the mind to suffer
Repeating the low-level tasks manually over and over every day
Or to take arms against the sea of automation.

.. would say Shakespeare of the modern IT world.

Photo by Max Muselmann on Unsplash

What is network automation?

Let’s start with what is “NOT” network automation.

  • It is not pushing configuration to network equipment
  • It is not writing another piece of script for each operational task
  • It is not something you solely do to get things quick & easy

..Then, what is it?

Network automation covers the whole pipeline when implementing a change to your network infrastructure. The process includes configuration, testing, deployment, and post-deployment testing.

Not only changes but also the regular operations and maintenance activities can be automated.

Some facts:

  • Trendyol has 3 data centers where we use different vendor equipment.
  • Command set and configuration context differ between vendors.
  • Each vendor has its own network management tool where we pay for licenses and support every year.
  • We cannot customize those network management tools according to our needs.

The shortcomings of the current way of working

  1. Dealing with many inefficient tools

If you’d like to add new fabric to the Mars datacenter using Vendor B switches, you need to use Vendor B’s management tool to create configuration templates and deployment. If a similar fabric is added to the Venus, the same should be done using Vendor A’s management tool.

The network team should know how to use all these management tools and maintain them through regular upgrades, raising support tickets when things are not working, and wait for help.

When additional features are required, you can always raise a new feature request. However, there is no guarantee that the requested feature will be developed.

2. Vulnerable to human error

You can forget to take backup while doing the upgrade, overlook an error log after you perform a change, do a typo while entering VLAN configuration...

Not only negligence but a misunderstanding of the design and lack of deep technical knowledge can also cause the engineer to make mistakes.

It would take years of experience to gain a comprehensive understanding of telecommunication protocols and datacenter design. If you’re recently joined the team, you can easily make mistakes on complex tasks.

3. Boredom and burnout

Uniform, repetitive and monotonous tasks are boring, especially if you are an engineer. Such tasks are already being automated by the team using scripts or management tools. However, most of the time, the team does this by mimicking manual tasks. We do not have the whole pipeline, including pre/post-deployment tests.

Manual tasks cause boredom and may lead to burnout, as they will take too much time.

There is another potential cause for burnout. Here the teams usually consist of people with different levels of experience, and the burden of complex tasks is usually on experts. In rapidly growing companies such as ours, there are too many complex tasks that need to be done repeatedly when deploying a new fabric, testing a new design, or troubleshooting a problem. Dealing with never-ending complex issues with high priority may cause burnout of experts in the team.

Photo by Sebastian Herrmann on Unsplash

Trendyol network team’s vision for network automation

If we were to write a manifesto about network automation, it would rise on three pillars:

1. One tool to rule them all

  • We should be able to use the same medium to manage every kind of network equipment.
  • Deploying a new vendor B switch at Mars, upgrading a vendor A firewall at Venus, or adding a new configuration to establish connectivity between hosts at Earth and Mars… Regardless of the type of work, we would like to use the same medium.

WHY? — We won’t spend additional effort to prepare the same thing with different management tools for each vendor. Network engineers should specify the “intention,” and the automation platform should manage the rest.

Let’s say you’d like to add a new service to work across all data centers; therefore, a similar config should be added to all leaf switches. If you have a single tool capable of generating config for each vendor, you’ll only need to trigger the job from a single platform. However, if you have separate tools for each vendor, you should do the same job for each type.

2. No room for workarounds

That may sound like a disadvantage, but on the contrary, we believe that it’s an excellent thing.

  • Each type of service that network infrastructure provides will be defined and categorized.
  • A standard workflow will be created for each category

WHY? — If you do not define your services and create a standard workflow, you’ll need to do the same automation process again and again for each task which in return would become a repetitive manual process itself.

Standard workflows will also guide the team to avoid workarounds and do the requested changes according to the design. Sometimes implementing workarounds would be appealing; however, as the saying goes by Rod Michael:

“If you automate a mess, you get an automated mess.”

Standardizing workflows, being committed to standard design and the way of working could be more difficult and time-consuming than implementing workarounds. However, it will be simpler, and simplicity is the key to agility and reliability. Worth remembering, “EASY” and “SIMPLE” is not the same!

3. Create an entire pipeline

  • Operational activities such as upgrades, deploying a new device, increasing capacity, etc., will be defined, and the tasks will be placed in a pipeline.
  • The pipeline should cover configuration, testing, deployment, and post-deployment testing phases.
  • The rollback scenario will be generated for each activity.

WHY? — CI/CD stands for continuous integration/continuous delivery. First, let’s have a look at the wiki description:

“CI/CD bridges the gaps between development and operation activities and teams by enforcing automation in building, testing and deployment of applications. The process contrasts with traditional methods where all updates were integrated into one large batch before rolling out the newer version. Modern day DevOps practices involve continuous development, continuous testing, continuous integration, continuous deployment and continuous monitoring of software applications throughout its development life cycle.”

The definition is for DevOps as the CI/CD emerged for that domain initially; however, the idea applies to the entire infrastructure operations. Creating the entire pipeline will enable us to adopt CI/CD, which is a crucial milestone on our way to “Infrastructure as Code.”

Photo by Fahrul Razi on Unsplash

Trendyol culture is shaped around the motto: “One dream, one team.” One of those shared dreams for the system teams is to achieve “Infrastructure as Code” for sure.

Photo by Hannah Busing on Unsplash

Why automation is not an option but a necessity

The reason can be summed up in one word: Growth

  1. Business is growing, and the infrastructure should be growing even faster not to create bottlenecks. Automating infrastructure operations will provide agility and ensure scalability.
  2. The team is growing, and for new members, it takes time to learn the design, gain theoretical knowledge and operational know-how, and interpret how to manage individual tasks separately.
  3. Infrastructure is growing that would be too time-consuming and difficult to manage with the current way of working.

Some good side effects of automating operations:

  • Increase security

A high rate of business change brings new security threats. The threats could be prevented by constantly updating security measures within the pipeline for each service and activity.

Shifting from manual to automated work will minimize human error risk, which ensures a more secure environment.

  • Lower OPEX

Less time & effort will be spent on configuration & maintenance activities

  • Boost team motivation

Eventually, it will help the team rise in Maslow’s pyramid. Then, the network engineers will focus less on low-level activities and more on creative and strategic work.

Photo by Luke van Zyl on Unsplash

As a final note, let me highlight that network automation is a journey, not a destination. We set our mindset and began to take steps in this direction.

I hope you enjoyed the article, please let me know your comments.

ezgi

--

--