Improve your logs, improve life

Nelson Marcos
3 min readDec 10, 2018

--

Scene 1: the website is not responding. When people type the url and hit enter, they wait sometime and an error page load to the user.

1 minute of outage.

After seeing the alarm on the dashboard, Ops starts do debug it. Since the app in running in a PaaS, they just check if the containers are up and if they can catch something, looking into the logs.

10-Dec-2018 - Operation timeout.

10 minutes of outage.

Ops team sees they can't do much since there is no procedure for this situation. It's time to escalate the issue for level 2 support or to the SRE.

15 minutes of outage.

The SRE starting doing everything the Ops team did. Not because he/she didn't trust the Ops team but because he/she must be sure no lead was missed. Even the smallest one.

However the Ops team did everything right. SRE looks for any recent change on the system but find nothing. Then He/She goes to the dashboard to see if something else can gives direction.

After checking some dependencies the SRE got something. The database servers are under a load higher than the usual. Since the log output is an "Operation timeout" it makes sense. The SRE asks the DBA for some help. The DBA identify some queries taking more than 60 seconds to finish. And also give some suggestions of how to improve then.

45 minutes of outage.

The SRE starts looking for some developers to inform the problem and to inform a better way to improve the queries. The developers stops working on a important project to fix the issue. He changes the queries, runs the tests and everything works fine. It's time to deploy to the QA environment to make sure everything runs smoothly. Nothing breaks. While the developer prepare to deploy on production, the SRE makes him/her way to communicate the change, using the appropriate ways.

90 minutes of outage.

Everything is fine. It seems that with the database growing the not optmitezed queries started get more and more slow and today they started to take more than 60 seconds, resulting in the "Operation time out", which is now fixed.

This month, availability will be 98,00%.

Scene 2: the website is not responding. When people type the url and hit enter, they wait sometime and an error page load to the user.

1 minute of outage.

After seeing the alarm on the dashboard, Ops starts do debug it. Since the app in running in a PaaS, they just check if the containers are up and if they can catch something, looking into the logs.

Here, is the difference:

10-Dec-2018 - The query "select preferences from users where user_id=98615" took more than 60 seconds. Operation aborted.

10 minutes of outage.

Ops team sees they can’t do much since there is no procedure for this situation. It’s time to escalate the issue for level 2 support or to the SRE. But this time, they have more info to give to SRE team.

15 minutes of outage.

SRE gets the errors and immediately check with the DBA if there is something he/she can do. DBA informs that the query needs optimization. He/She already sends the new query to the SRE and says he/she will check if there is something else to be done.

After hanging up with the DBA, SRE contacts the developer's team informing the situation and saying he/she has already the optimized query. Developers inform they can apply the fix after runs the tests and everything works fine.

60 minutes of outage.

This month, availability will be 99,8%.

30 minutes differ scene 1 from scene 2. Even though, imagine how much money Amazon or Youtube might lost during 30 minutes.

Logs REALLY matter. They can be the difference between a major issue and a minor one. If you're not sure your app has good logs, ask your Ops team or your SRE, if you have one.

If you don't have Ops/SRE team asks yourself: "If I see this error/exception, 1 year from now, can I identify it, without looking at the source code?". If your answer yes, then you have good logs.

--

--