Uptime measurement in FINN.no
This blog post gives a brief history lesson about our measurement of uptime from the start, why we think it is important, and how we work with improving uptime on a daily basis.
Our most important KPI?
For Norway’s largest online marketplace, and one of the Media Company Schibsted’s most profitable companies, it’s bold to consider uptime as the most important KPI. It has been upheld, however, even by members of the top management team.
High uptime is critical to users feeling that an online business is stable and predictable, and of course we would like our users to think that about FINN!
FINN, the company, was founded in March 2000, but the service was launched already in the mid 90’s as a project owned by 5 Norwegian newspapers — then market leaders for classified ads on paper.
Today, FINN has around 450 employees, a yearly turnover of close to 2 billion NOK, and a profit margin of approximately 43%.
For employees and managers in FINN it’s now routine to follow uptime status, but this has not always been the case.
The history of uptime monitoring started as late as 2010. I was the newly appointed Head of Operations in FINN, and I randomly met the then-CEO — Christian Prinzell Halvorsen — at our common bus stop in Oslo. Both of us were in a hurry, so it was a quick discussion, but I remember Christian half-stating half-asking, “It has been quite a lot of downtime lately…?”. This question immediately made me realize that 1) There was no measurement of uptime then and 2) As the Head of Operations, I didn’t know the expectations of the organization. Both were very uncomfortable realizations!
How we established the goal of 99.9% uptime
First, I needed to find out what FINN’s ambition for uptime should be. I decided that the best way to start was to determine the cost of downtime in FINN’s different marketplaces. In 2010 those were: Motor, Real Estate, Jobs, Torget (Bits and Pieces), and Travel. I asked the area responsible for these markets what the cost of one hour downtime would be. The answer at the time was approximately 300,000 NOK, and this was actually of big value for me when communicating FINN’s demands for services outside the organization (vendors).
When we knew the cost of downtime, it was also easier to find out what the ambition for uptime should be. I presented 4 different options between 99.0 and 99.99% with the corresponding cost calculations it would take to reach those ambitions. The conclusion was 99.9%, meaning downtime less than 45 minutes per month would be acceptable. This goal later proved to be quite ambitious, but it has given us a clear target to strive for.
OK, our goal is 99.9% uptime — what now?
First of all, we needed to start measuring. In 2010, we had no automated monitoring in place, so we had to start by using purely manual routines. The downtime calculations were done in a spreadsheet, and even though FINN.no consists of many services, we had to establish simple definitions to decide whether a problem should be defined as downtime. Examples of problems we defined as downtime: ad insertion does not work, payment for private customers does not work, or search does not work.
Second, we needed to teach the organization that uptime is important, and that 99.9% was truly something we wanted to achieve. While uptime measurements are perceived as important today, that was not the case back then. In order to drive this goal forward, we repeated both the goal and its ongoing status in the IT Operations newsletter for several years. We also had uptime as a bonus-goal in the IT Operations department. Slowly uptime was established as an important goal for the organization.
Time to improve the calculations
Early on, we measured each downtime-minute as “one minute” regardless of the time when the problem occurred. As traffic in peak hours is much higher than in the middle of the night, this was clearly misleading as to the damage done to our users, and thus to our business. This is why we introduced the expression value-weighted downtime — which adds/removes weight to/from downtime, depending on when it occurs. Example: 60 minutes downtime at 4am becomes 11 minutes, while 60 minutes downtime at 9pm (peak) becomes 114 minutes.
2014 — Automation and measurement improvements
After the last major rewrite of code in 2004, the technical architecture of FINN.no has gradually changed from a monolith to consisting of hundreds of microservices.
This has meant new challenges for Operations, but has given us a much more flexible and less vulnerable site. While before a minor problem could cause the whole site to go down, with microservices the problem is isolated and only affects a small part of the system. This is also something provided the opportunity to develop automated monitoring of the various services that FINN.no consists of.
Service Manager hired in 2014
We’ve gone from doing best-effort measurement of uptime in the early days, to hiring a Service Manager in 2014 — for the first time, we had someone who could focus full-time on improving our processes.
Core Business Services — CBSes
One of the first tasks for the Service Manager was to define the most important services FINN.no consists of. This was done in close cooperation with the business leaders, following two dimensions:
- Value for users and customers
- Economic and strategic value for FINN
The following 6 services were selected as critical based on these criteria:
- Ad insertion
- Ad Import (from external sources)
- Payment online
- Booking of airline tickets.
Create automated monitoring of services
A dashboard showing the status of the CBSes was a key feature for success. The development teams owning the code for the CBSes all became responsible for defining uptime criteria for their service, and for developing and visualizing the result.
The dashboard is clickable, and when problems occur in a service, log information is available. This feature also gives information about the duration of the problem. Example from the Login CBS here:
What affects a downtime calculation in 2019?
The downtime of an incident is calculated based on the following factors:
- Number of CBSes affected
- How traffic-intense the time of the incident is (peak vs low traffic)
- Duration from start to fix
- Which channels are affected (apps, web)
What have we gained by focusing on uptime?
- Increased awareness of service-quality in the organization, which in turn helps users
- Services are monitored 24/7 and warnings are automated. The responsible teams are instantly notified when incidents occur and can take appropriate action quickly
- Every incident is treated and documented in a standardized matter. The main focus is always to remove the root cause and learn from the incident
- Better transparency for external customers — e.g. real estate agents
Monitor, Measure, be transparent
As a part of our constant effort to increase uptime, we use a standard company framework, called “Incident Management in Schibsted”. Development teams are responsible for documentation and fixing incidents, and our Service Manager follow up and calculates downtime based on incident reports.
Uptime measurements year by year
Are we done now?
We have come a long way since the beginning. Our uptime measurements now give us insight on what the technical health of our system is.
But with FINN’s ~150 developers and 20+ development teams, systems are constantly changing. As a result, uptime measurement will also have to evolve. New CBSes may be added in the future, and we are always looking for ways to improve our processes.
Comments are welcome
If you have questions or comments about how we do uptime measurements in FINN, feel free to leave a note below!