Semantic logging is great but…
Consistency is king.
Once a upon a time, I joined JustGiving. I was incredibly excited to hear they embraced interesting technologies and products like Elasticsearch, AWS, GOCD, Chef, Docker and Kubenetes.
I was also excited to uncover how they handled logging, since at one of my previous roles I helped introduce Elasticsearch. I was told they kicked ass (paraphrasing of course).
After going through a series of new starter on-boarding sessions, I was able to spread my wings at JustGiving. Being me, the first thing I did was to look for Kibana dashboards. I was interested in finding data that broke down errors by application, and ideally then further grouped by type of error.
I couldn’t find this mystical dashboard, so off I went to create it… This is where my shining vision of a perfect logging data oasis was brought back down to reality.
What did I find:
Inconsistency… inconsistency… inconsistency. Different fields that represented the same data. For example, application name took many shapes and forms such as “app-name”, “jg.application.name”, and “loggername”.
Not only was application name in multiple fields spread across multiple indexes. The contents of those fields lacked consistency. Some apps had the default “log4netlogger” used with Log4Net, some used assembly name, some used chef cookbook names.
Even worse were the logs that didn’t have an app name, which meant that generic log messages and errors just got lost in an Elastic stack black hole.
The application name was just the tipping point, there were other things wrong like log level not being consistent (value and name), exceptions floating about in messages but as level=Info and exceptions details missing so (level=ERROR) but without any additional information. By far the worst issue was error messages dropped from elastic search because of validation/JSON issues, so they didn’t even make it to the big black Elastic stack hole.
What did I do:
Raise & discuss. After bringing the issues up with development team leads, I was pointed to what was supposed to be the ideal way of doing semantic logging (in a relatively new app). I began my evening quest and it wasn’t long before I spotted problems, this was one of the many apps where error messagse were being dropped from Elasticsearch… not only that, but some log messages contained the app name and some did not…
After speaking to the dev team, I quickly found out that they weren’t using Elastic stack to debug issues because of all these problems, and they switched to using a service called Papertrail because all the logs were there… I was extremely hurt, me and Elasticsearch go way back (#sexysearch… wait until you get to the end) and I wasn’t just about to give up on it.. considering all papertrail does is log lines in a text file (just hatting)…. so off I went to demonstrate why the Elastic stack rules.
I like leading by example so I managed to squeeze logging into the project/story I was working on (week 1 in at JustGiving and I was coding, raising bugs and even dived into the existing monolith application), and I wanted to do it properly! I used Serilog (which was already used by other projects) and I re-used the request logging of another project which was a simple owin middleware, which of course meant nothing to papertrail.
I made sure exceptions were logged correctly. I debugged bad requests thrown by Elasticsearch from our logstash indexers. This was easily fixed by using the serilog exceptions nuget package.
I then made sure when we created the serilog logger that it was already enriched with the application details (and other useful information) so that every log message sent to Elasticsearch had these properties.
This app became our gold standard for CIV1 (AB Service) – the go-to project used by new projects as a reference when building on the old continuous integration platform.
For the new continuous integration platform (a.k.a. CIV2) I created a logging library with the Front End principal that returned a logger object which included all of the core fields that would help teams find their logs like app-name, log level and exceptions consistently.
Now having a few reference points and a sample app logging correctly (and a simple wiki page on the core values to log). I took it upon myself to show the value to the devs who preferred papertrail. I created sexy dashboards, grouping URLs with HTTP code, the visualisations identified users who where logged in and who logged in mid flow.
It wasn’t just about visualisations though. I showed that with the correct logging, I could explain why people experienced issues and why some of the visualisations were wrong (old cookies, missing data, bugs, same ID used for multiple flows etc…)
I was able to create funnels and explain with a lot of detail why users were within flows and we addressed many issues quickly because we found the root causes. I did sessions with some teams and individuals too. I haven’t got around to doing a pizza meeting yet… coming soon.
It’s not all rosey yet….
For the legacy platform we don’t have a dev/staging Elastic stack (we do for CIV2). So that’s a big risk for legacy apps. We were running ELK 1.x version: We have since upgraded to 5.3 and we are actually moving to Logz.io, a hosted Elastic stack solution (some great work by our Infrastructure team).
Using Elastic stack, I was able reduce JG error rates in production by 84%, as creating my dashboards highlighted which apps were showing errors the most. It turned out the app which was producing 1000 errors a minute (yes 1000 a minute) 24/7 was an old ops app which had lost its permissions required to perform tasks. The app was responsible for killing misbehaving apps… Ironic, isn’t it?
Not only that, but now spotting apps which start causing errors because of code changes is far easier, making identification and roll-back a simple task.
For CIV2 we have things a little more controlled:
In Elastic stack we track errors, per app, per environment. we even detect apps not conforming to logging specs (most of the apps are infrastructure related apps)
We also track deployments to environments etc…
And nothing would be complete without errors:
If you are getting to the end of this post and you are coming to the realization that this was written by an SDET and not a DevOps engineer, then yes, production feedback is one of the key metrics/pillars. Logging should be one of the many feedback loops of quality!
(In case you didn’t get the reference to #SexySearch… It was my technical description of ElasticSeach back in 2014… 3:14)