Solving problem of serving empty content in articles at random

This is a repost from my company’s internal blog — so some things might not make perfect sense without knowing the context.

In this article I would like to share a story how I managed to solved the problem of serving empty content in articles at random.
Let’s start with a description of the problem first. For at least last six months we’ve been “occasionally” presenting our visitors with pages that were containing everything except the actual article content. From our users we received handful of reports about the problem but we’ve never been able to reproduce it — when we were opening reported pages the problem was either already gone — or was disappearing after executing “action=purge” (which re-parses the article).

When I started working on the ticket it already had over 100 comments but still very little was known about the actual cause or the scope of it. We know we are serving broken pages but we didn’t know how often.

example of a broken page

I decided to learn more about the issue so I could rule out some of the potential causes and know how severe the issue is. That’s where Logstash and Kibana came very handy. I started logging occurrences where actual content (to be injected to Oasis [name of our skin]) was empty but the original wikitext was not. It turned out that:

  • issue was pretty severe — at least (very conservative estimate) 50k pages per hour were affected (some communities even built their own workarounds for this problem issue was not specific to particular browser or user preferences — because)
  • I was receiving events from our failover datacenter to which we only shadow production traffic (so it is agnostic to many variables)
clever workaround implemented by runescape.wikia.com community

Besides that now I was able to easily find broken pages and investigate them. I found out that all sort of pages were affected:

  • long pages with tons of content
  • short pages with just one paragraph and an image

And their content in HTML indeed was empty but the comment added by MediaWiki Parser with some statistics always was there, e.g.:

<!--
NewPP limit report
Preprocessor node count: 66/300000
Post‐expand include size: 1915/2097152 bytes
Template argument size: 112/2097152 bytes
Expensive parser function count: 0/100
-->

From that I knew that the Parser was actually running and doing some work so my next step was to find out where exactly during the parsing process content gets cleared out. In order to find out I built state of the art logger called SomethingToNothingLogger.

SomethingToNothingLogger exposes method called logIfEmpty which is intended to be called right after any operation that possibly could clear out content — then it check if content got cleared out and logs it (take a look at the diff if you haven’t yet). It turned out that content gets cleared out always in the method Parser::doMagicLinks:

function doMagicLinks( $text ) {
wfProfileIn( __METHOD__ );
$prots = wfUrlProtocolsWithoutProtRel();
$urlChar = self::EXT_LINK_URL_CLASS;
$text = preg_replace_callback(
'!(?: # Start cases
(<a[ \t\r\n>].*?</a>) | # m[1]: Skip link text
(<.*?>) | # m[2]: Skip stuff inside HTML elements' . "
(\\b(?:$prots)$urlChar+) | # m[3]: Free external links" . '
(?:RFC|PMID)\s+([0-9]+) | # m[4]: RFC or PMID, capture number
ISBN\s+(\b # m[5]: ISBN, capture number
(?: 97[89] [\ \-]? )? # optional 13-digit ISBN prefix
(?: [0-9] [\ \-]? ){9} # 9 digits with opt. delimiters
[0-9Xx] # check digit
\b)
)!xu', array( &$this, 'magicLinkCallback' ), $text );
wfProfileOut( __METHOD__ );
return $text;
}

At first this method looked pretty scary to me, so I decided to have a drink and give it another try. I analyzed all the regexes and didn’t find anything that would raise a red flag — besides content of that method wasn’t modified for years and from my experience software does not go bad with time. My next step was to log exactly what’s the input provided to preg_replace_callback and what’s the output. I noticed that the output wasn’t an empty string but it was a null — which according to PHP documentation for this method means an error that have to be retrieved by calling preg_last_error. There are few potential possible errors:

  • PREG_INTERNAL_ERROR
  • PREG_BACKTRACK_LIMIT_ERROR
  • PREG_RECURSION_LIMIT_ERROR
  • PREG_BAD_UTF8_ERROR
  • PREG_BAD_UTF8_OFFSET_ERROR

I was hoping that the error would be PREG_BACKTRACK_LIMIT_ERROR which would be easy to “fix” by increasing pcre.backtrack_limit however I couldn’t marry this with a fact that problem was affecting really short articles as well (and our apaches are not running under heavy load nor high memory consumption — at least not unless I log on them and start using vim to modify the code).

Actual error was a big surprise — PREG_BAD_UTF8_ERROR — which means that provided input is an incorrect UTF-8 (only if pattern is using modifier u — unicode).

Given that I already had a collection of inputs that are causing this error I created another state of the art script which allowed me to find out which exact part of the input is causing the problem. All the results were very similar:

title="H???"><img src="...
title="H?5^"><img src="...
title="H?k?"><img src="...

For some reason values of the title attribute were garbage. I tracked down that the surrounding code comes from the Thumbnails extension specifically from Thumbnail_image.mustache template. From there I was able to find out that we process those templates with MustacheService which is a fairly simple wrapper for a PHP module php-mustache which internally uses libmustache. At this point I added logging directly in our wrapper to see if we supply correct data to Mustache module and receive a garbage — and it exactly turned out to be the case.

So Mustache module most likely was to blame. After I got familiar with it I find out that we were using really old versions and since then there were plenty of fixes specifically around memory leaks — which could exactly explain garbage, not valid UTF-8 content. Given that I was still not able to reproduce the problem I decided to:

  • verify if the newer version is backward compatible (mustache syntax but also API)
  • verify that it works correct on devbox.

The next step was to try it on one of our apaches. After upgrading module on ap-s12 [one of our webservers] Kibana reported that problem was gone for that particular machine (yay!), however Nagios reported that Apache is down (nah!) — so it wasn’t quite what I was hoping for. Fortunately after a little bit of tweaking I was able to get Apache up — success!

After about 6 hours of no errors on ap-s12 we went sitewide with the newer version of php-mustache and libmustache.

There is quite a few things I was able to learn while working on it:

  • if you try really hard you can get hard drive on logstash-s3 completely full within minutes
  • instead of guessing it’s better to gather data and make informed decisions (even if it does not resolve the problem right away it still feels like you are actually making a progress and managing the situation)
  • root cause of the problems sometimes can be in a place that seem completely unrelated
  • do not blame other people’s code just because it looks bit funky (regexes in doMagicLinks) — perhaps they had reasons to do certain things certain way

Next steps: create wrappers for methods such as preg_replace_callback so Exception is thrown when error occurs.

I also wanted to call out here Frank and LCF [our ops] for providing really great support and being very responsive.