Turning logs into bug reports

Maciej Brencz
Legacy Systems Diary
5 min readJan 30, 2017

Every second million of lines of code (from Perl maintenance scripts, API services written in Java, Node.js powering our mobile skin to MediaWiki running on PHP7) is involved in producing what Wikia users see on their monitors and mobiles.

Even with the most excellent QA team, regressions tests, great code coverage bugs will happen. An error hiding in one of those lines mentioned above will eventually hit production. Users will find the edge-case that nobody ever thought is likely to happen. Network and hardware issues will happen. To put it short: errors on production are inevitable. However what you can (and actually should) do is to report and store them in a centralised manner and analyse.

Meet The Nordic Cat aka Error Reporter

Error Reporter described below is available in our public GitHub account.

Step 0: make your code log what’s happening

How many times have you been debugging an issue and faced a wall of code behaving like a black box in a complete silence? How many times have you been investigating performance / load problems with no data from the past at all? In all these cases logs would be extremely helpful.

Adding logs in the code you develop or fix is as crucial as adding comments. Logs are the only way to allow the code to say what it’s doing and what went wrong and why. And when you get a report of a bug that took place two days ago, logs are usually the only way to understand the problem and find the root cause. Or you can use the nearby time machine, enjoy the journey back to the past and add logs before the issue will take place.

And last, but not least — adding error logs forces you to think in a pessimistic way about your code. What can go wrong will go wrong and it’s really worth to log such cases.

Step 1: store

Our logging infrastructure is quite typical — our app sends JSON-formatted logs to a local syslog daemons that push them to Logstash which sends them to Elasticsearch where they are stored. We use Kibana to browse ~2000mm log messages which are collected every 24h.

Step 2: analyze

Every day our MediaWiki-based application generates 650+ mm log messages. You can hire protein resources to process these logs or use silicon power in your servers and analyze them automatically. We chose the latter option 😉

Error Reporter is run every hour and performs the following:

  • Elasticsearch is queried for specific kind of log messages (e.g. MediaWiki app fatal errors from production data-center, SQL errors reported by MediaWiki, Java exceptions) that were reported since the previous run
  • This tool is not limited to querying Elasticsearch. For instance, we query JSON API provided by Anemometer to get database queries that can be improved.
  • Results are filtered (we can, for instance, remove entries that do not have all required fields set or come from development environment)
  • Remaining entries are normalised (we remove varying parts of the message, for instance, the seconds in timeout errors or URLs in HTTP requests)

These entries are then passed to map-reduce step:

  • Map part: hash is calculated based on normalized entry. We include the environment here as well, i.e. we report issues from production and staging environments separately.
  • Reduce part: entries are grouped by the calculated hash, counter of entries with the same hash is generated here as well
  • Unique entries are then passed to the formatter that generates nice human readable output with the message, information about host the issue has been reported from, backtrace, link to Kibana with these specific logs and to request trace.

Step 3: report

We use Jira API to report issues found in the logs.

  • Each unique entry that was generated as the result of step #2 has a hash. We use this hash to detect already reported issues (we simply print it in the footer of bug report description).

If there are no matches:

  • We check if the counter of occurrences of given entry is above a threshold which varies between different kinds of reports (e.g. for PHP errors it’s at least 5 occurrences an hour, while notice needs to be logged at least 2000 times to be reported). Based on the backtrace and issue type we identify the code component that is set when an issue is reported to Jira. The owning team can quickly identify issues that impact their work.

If there are matches:

  • This means that the issue has already been reported.
  • We take a look at the ticket status and reopen closed tickets as the problem is still there (we exclude tickets closed as duplicates or marked as won’t fix). We also update one of ticket fields (that we call ER Date) to indicate when was the last time given issue took place.

Once we have a list of unique entries we proceed to the next step.

An example of automated report

Error Reporter serves as well for two years now. It has reported 600+ tickets.. It prevents fatal errors from reaching production as we run checks on staging environment in addition to production.

At the beginning this tool was handling PHP errors, warnings and DB errors reported for MediaWiki only. The current version covers:

  • PHP fatal errors, exceptions, warnings and notices from MediaWiki application
  • This includes reporting possible CSRF issues (we automatically report cases when certain actions are performed on the backend, but CSRF token is not checked)
  • We also report API responses from PHP which do not have caching policy set and report them if they get heavy traffic.
  • Exceptions from our Java-powered services
  • Errors from mobile-wiki skin backend (Node.js)
  • DB errors reported by MediaWiki
  • Database performance problems reported by Anemometer (we query Anemometer’s JSON API to get them)
  • Issues reported by various maintenance scripts that check for instance if all video providers that we support are still used by our users when they create content
  • Issues found by index-digest that are pushed via ELK stack

So, as you can see, analysing logs is not only about finding errors in the code itself, but it also helps to find security / performance problems and legacy features issues.

Happy Logging!

Originally published at engineering-blog.wikia.com on January 30, 2017.

--

--

Maciej Brencz
Legacy Systems Diary

Poznaniak z dziada-pradziada, pasjonat swojego rodzinnego miasta i Dalekiej Północy / Enjoys investigating how software works under the hood