The secrets of Agoda’s uptime — A day in the life of a NOC engineer

Rui Gong
Agoda Engineering & Design
6 min readNov 15, 2021

Offering 2.5 million properties globally and witnessing millions of footfalls on its site each day, Agoda is the place to build and deploy innovative technology. At Agoda, you can create a connected world via accommodation, flights, transportation, and more.

Just like travellers are excited to explore new territories, so are Agodans in providing the best solutions. Travellers never sleep, and neither do we.

Being a part of the NOC team at Agoda over the last few years, I have encountered working in an agile and fast-paced team, especially in an organization that values data.

The NOC team operates 24/7 monitoring systems, integration, API (Application Programming Interface), web performance network, and business activities. NOC engineers are the guardians of all technology products within Agoda. We guarantee the experience of Agoda’s customers and partners is always optimal and trouble-free.

NOC engineers are the guardians of all technology products within Agoda.

As a Sr. Manager, let me give you a glimpse of a day in a NOC engineer’s life.

Our day starts by ensuring the booking trend, system, and overall application stay under the agreed SLA (Service Level Agreement). We check seasonal holidays and news for every country on the map with the booking trend on our dashboard. We have a standard process for every shift in direction to hand over all the information and anomalies that are okay to overlook.

The team also owns and manages Bugs@agoda (an internal tool). Agodans across the organization report and share their experiences when faced with an uncommon situation.

We use the information from Bugs to reproduce, identify the impact and escalate the issue to the owner to ensure problem resolution. Shortly, the team will start managing spot@agoda.com, where we will support the systematic product SLAs across the organization.

Let me take a real-life example to help you experience the excitement of working at this fast-paced company.

When bookings took a downswing

The day started with checking bookings via the real-time booking dashboard. Video wall — the central dashboard that we monitor for every data centre and platform aid us to ensure everything is on-trend.

But some things can be happening behind the scenes which might harm the system. We noticed a drop in booking on our master data centre.

When NOC finds an issue in the system, we start by asking five main questions.

Q 1. What impact will it make on Agoda?

Q 2. When did the issue start?

Q 3. Who should be the team or critical person to fix the issue?

Q 4. Why did the issue happen and identify its root cause?

Q 5. How much damage has/will the issue cause(d)?

Figure 1: An actual war room at Agoda.

Based on the drop’s impact on Agoda’s business, we immediately called for a NOC war room. We figured that the fall was coming from a single origin. Noticing the alarming size of the drops, we moved our discussion to our powerful collaboration weapon — SMS. Our SMS notifies 100+ people, including the senior leadership team, technology team, and dev managers.

Figure 2: Example SMS that NOC will send out when having a significant incident

We have multiple dashboards that help us identify the cause of the issue. For this case, we used the dashboard that filters booking trends for each data centre, country, partner, supplier, and more. We scoped down the issue and found more evidence to identify the root cause for the drop.

Figure 3: Example of the dashboard

Using the anomaly detection alert — an in-house tool built based on a machine learning model, we observed the strange behaviour of the booking trend.

Figure 4: Example of anomaly detection alert

If we have an issue with a specific data centre or platform, NOC mitigates the impact. We usually move traffic, panic-stop the experiment, stop the auto cancellation process, reprocess the impact booking, and more.

Figure 5: Move traffic tools

After recognizing the application that might be causing the issue, we gathered forty experts from the engineering team and one business site lead from the impacted country.

This team of experts oversees the problems. We used the project called ‘Minecraft’ to help us detect the issue and the ‘Timeline tool,’ which shares every Front-end, Back-end, Infrastructure, Network, Database data to support the owner check the change made during that time pin down the root cause. We were able to detect this within 10 minutes and process the mitigation.

Figure 6: Topology service in Agoda

After finding all information and help from the developer to fix the issue, Agoda was back on, and customer satisfaction displayed with the continuity of their bookings with us.

With every issue that makes a significant impact, there is always something to learn and room to improve. Each week we summarize all the problems faced to help enhance our performance.

The lessons learned from this issue, we implemented it a day later with a few upgrades for better detection, quick mitigation, and improvements to the product.

How NOC successfully works from home

If you see a NOC engineer’s desk, you will understand the gravity of working from home. A NOC engineers’ desk is laden with multiple screens displaying a legion of dashboards — something impossible in a WFH scenario.

Nonetheless, the remote work has not impacted our quality. We added more monitoring screens, non-stop conference calls and implemented easy escalation tools. In addition, Agoda’s work from home allowance gave each one of us to set up a comfortable workstation.

The essential thing for our team is knowledge. We never stop learning. We are bold in asking experts for help, and we gratefully pass on our knowledge. We value prioritization. It helps make the best use of our efforts, increase our work efficiency, and stay on track while keeping stress at bay.

This team is an invaluable support to the company, ensuring travellers enjoy the complete Agoda experience. Beyond maximizing customer experience uptime, the team initiates ideas and experiments that promote the organization to the next level resulting in state-of-the-art operation and incident management tools.

Our team grows stronger every day through talent development and rewarding opportunities. If you want to become a part of the Guardians of Agoda, we have options for you! 👇🏻

Acknowledgments

Big thanks to the teams who helped write this article — Kittiporn, Anantkant, and Thammarith, for collating the shared experiences of NOC engineers. Special thanks to Max Panasenkov for initiating and reviewing this idea.

--

--