The problem that wasn’t there — and the Bosun alerts that were
When her team members started to get notifications about Bosun alerts they hadn’t set-up, experienced ghost-hunter Annette Wilson started to investigate…

Editor’s note: the first part of this article was originally published in November 2017. At that time, Annette’s story ended in disappointment; the trail of the ghostly alerts she was tracking had gone cold. In November 2019, the mystery described below was finally solved and this post was updated with the addendum ‘Getting there one ghost at a time’.
Here at Skyscanner we now make heavy use of Bosun to detect and alert us when something goes wrong in our systems. Bosun gives us a powerful way to express which patterns of metrics indicate a service failure that might need human intervention, and can dispatch notifications via multiple channels. It cooperates with OpenTSDB, which can efficiently store and query the large volumes of metrics our systems record about their health and behaviour. But since we started to use Bosun, we’ve intermittently received strange alerts that we can’t account for.
Recently I was involved in updating a number of services to send their metrics to OpenTSDB and configure their alerts in Bosun. While it mostly went smoothly, one problem in particular has proven to be difficult to get to the bottom of. Every so often we’d get odd notifications telling us that an alert that we had defined was no longer receiving valid metrics, so Bosun could no longer say whether it was in a healthy or unhealthy state. The problem was that these alerts were ‘ghosts’ — they should never have existed in the first place.
Metrics and alerts
Both OpenTSDB and Bosun share a concept of tags for both metrics and alerts. For example, you could have a metric called diskspace.freePercent with tags named dc (for data center), host and disk. Every minute, a metrics collection program (such as Diamond) running on our machines will record values for this metric tagged to identify the datacenter the machine resides in, the hostname of the machine and the disk we’re measuring the space on. The combination of a metric name and a set of tags identifies a single number that can change over time. We can graph it and use it for alerting. We might denote this metric for a particular disk on a particular machine as diskspace.freePercent{ dc=uk2, host=webserver29, disk=root }. All these metrics are stored in OpenTSDB.

Bosun allows us to define matching alerts. An alert is basically a rule: when this metric is in this state, flag up the problem and let a human know. Alerts are identified just the same way metrics are — they have a name and a set of tags, such as diskspaceLowAlert{ dc=uk1, host=webserver29, disk=root}. Bosun allows us to define a set of alerts all at once: we can tell it that when any disk on any machine goes below 20% free, it should notify us with a warning. Bosun goes and asks OpenTSDB for all the diskspace.freePercent metrics for all the tags. OpenTSDB tells it what metric values exist and what tags they have, and Bosun creates a matching alert for each metric. Each diskspaceLowAlert can be in one of three states: normal — when there’s plenty of free disk space, critical — when disk space falls below our 20% threshold, or unknown — if for example the machine fails altogether and stops reporting the metric.
Here’s an example of some servers reporting disk-space metrics with a very simple alert defined on them:

Now, we didn’t tell Bosun about all of these alerts individually. We never gave it a list of all the disks on all the machines. It determined this itself by looking at OpenTSDB and seeing what metrics exist that match the pattern we specified. In order to figure out when an alert becomes unknown, Bosun remembers all the matching metrics it has seen in the past. So when Bosun notices that it used to see a metric with some combination of tags, but doesn’t see it again for long enough, it warns us with an unknown alert. So if our above example failed, Bosun might notify us that diskspaceLowAlert{ dc=uk2, host=webserver29, disk=root } is now in the unknown state. So far, so good.
Our problem was that once or twice a week, Bosun would announce to us that some metric like diskspaceLowAlert{ dc=uk2, disk=root } is now unknown. This is puzzling, because there’s no host tag. No metric had ever been recorded with this combination of tags, nor did our alert definitions query for metrics with this combination of tags, so why is Bosun complaining that it has gone missing? We’d get this for all sorts of different alerts, missing different tags, seemingly at random. What is going on?
Lots of places to look
Now, this was a tricky problem to investigate. It only showed up once or twice a week, on unpredictable metrics. We couldn’t be sure where the problem might be: Are our applications mis-recording their metrics, occasionally dropping off a tag? Are the systems in between that deliver the metrics to OpenTSDB corrupting them? Is OpenTSDB storing them incorrectly? Is Bosun’s memory of which metrics it has seen getting corrupted? All of this is quite tricky to answer when we’ve got no reliable way to reproduce the problem on demand. But nobody wants to get woken up at night to fix something that isn’t broken!

When something goes wrong, especially when you’re working with an unfamiliar technology, it’s a very good starting point to assume you made a mistake. So our first guesses were that we’d done something daft in configuring Bosun or in configuring the systems that record our metrics. But we checked and rechecked these and couldn’t see anything that could possibly cause this transient behaviour.
The next thing that’s important to do is narrow down where the problem is occurring. We could be pretty confident the problem wasn’t happening after the notifications left Bosun. After all, we received the weird alerts over all the channels Bosun used to notify us. So we can ignore the notification channels on the right-hand side of the diagram. But there’s an awful lot going on across the rest of the diagram to worry about.
At Skyscanner we run a fairly old version of Bosun with a bunch of modifications specific to getting it to work with our other systems. We also have a layer of indirection through our Kafka-based data platform before the metrics actually arrive in OpenTSDB. Did we introduce bugs in either of these places? Thankfully it’s pretty easy to at least rule out one or the other. We can simply look at the data stored in OpenTSDB and see if we ever have metrics stored with missing tags. If OpenTSDB has values stored for these metrics with tags missing, we can say the problem is somewhere on the left side of the diagram, up to or including OpenTSDB. If OpenTSDB does not have metrics with the tags missing, we know the problem must have been introduced on the right side of the diagram, in OpenTSDB or Bosun. Upon inspection there was absolutely no sign of missing tags, so we concluded that the problem lies either in OpenTSDB or Bosun.

At this point, I tended to suspect Bosun. As far as I understood, we had an off-the-shelf version of OpenTSDB and an old version of Bosun with custom tweaks. The age and the custom tweaks made it seem like there was a good chance our Bosun had its own bugs. But how could I be sure? I needed to understand how Bosun decides to send these unknown alerts, and see if I could catch it in the act.
Inside Bosun
A quick dive through the Bosun source code shows that every time it checks for alerts — which by default it does every five minutes — it first of all tries to determine if it should fire any unknown alerts. To allow it to do so, Bosun keeps a record of all of the alert keys it has previously seen, along with the last time it saw them. It stores this in Redis as a sorted-set data-structure. Every alert key (such as diskspaceLowAlert{ dc=uk2, host=webserver29, disk=root }) is stored with a timestamp. Before it does anything else, Bosun checks if any of the last-observed times are too old, and if so, reports an unknown alert. Then it goes on to evaluate all the alerts by querying OpenTSDB. Finally it updates Redis with the new last-observed time for each alert key that it successfully observed.
Here’s a slightly scary diagram I drew of the call tree while getting to understand this code with a particular view to understanding how and when it makes use of Redis to work out unknown alerts:

This gives us another couple of promising points to investigate: what Bosun is storing in Redis, and what communications are going between it and OpenTSDB.
So far as Redis goes, we can just look inside the store and see what’s in there. Sure enough, the ghost alert keys are there! There are keys missing tags, like diskspaceLowAlert{ dc=uk2, disk=root }, which is missing the host tag. Intriguingly, they always seem to have a last-observed time within the last half-hour. It took a while to discover that it’s not that Bosun is really seeing these alert keys all the time. It’s just that it seems to update the last-seen time every time it fires an unknown alert. So the last-seen time slowly gets old over the course of half an hour, then Bosun says “Oh! I’ve not seen that alert key for too long, I’ll flag it as unknown!”, resets the last-seen time and then repeats. It’s not clear if this is intended behaviour, but it was certainly confusing — as a result I spent quite a while believing that these weird metrics were being observed far more frequently that they really were. In reality they only need to happen once and then Bosun will keep updating the last-seen time forever-after.
As for the communications with OpenTSDB, that requires rather rather more digging. Sadly Bosun doesn’t log these communications for us.
Getting our hands dirty
In the end, lots of logging was the only way to know for sure what was going on between Bosun and OpenTSDB. I built a custom version of Bosun which logged everything it sent to and received from OpenTSDB. By running this instance for days with a copy of our alert definitions, until it sent another unknown alert, I was able to search back through the log and find the incriminating result. Here’s (a slightly simplified version of) the query:
2017/08/16 16:17:01 info: tsdb.go:879: OpenTSDB request: {"start":1502895707,"end":1502896607,"queries":[{"aggregator":"sum","metric":"diskspace.freePercent","rateOptions":{},"filters":[{"type":"wildcard","tagk":"dc","filter":"*","groupBy":true},{"type":"wildcard","tagk":"host","filter":"*","groupBy":true}, {"type":"wildcard","tagk":"disk","filter:"*","groupBy":true]}]}It’s asking for metrics with tags of “dc”, “host” and “disk”. This is what you’d expect and the same as every other query that works normally. But when it gets back results, this time something is odd:
2017/08/16 16:17:01 info: tsdb.go:983: OpenTSDB response: [
{
"metric":"diskspace.freePercent",
"tags":{"dc":"uk2", "disk":"root"},
"aggregateTags":[],
"dps":{"1502896510":63.1,"1502896570":63.1}
},
{
"metric":"diskspace.freePercent",
"tags":{"dc":"uk2","host":"webserver28","disk":"root"},
"aggregateTags":[],
"dps":{"1502896531":25.7,"1502896591":25.7}
},
{
"metric":"diskspace.freePercent",
"tags":{"dc":"uk2","host":"webserver28","disk":"data"},
"aggregateTags":[],
"dps":{"1502896523":200.0,"1502896583":200.0}
},
<snipped lots of very similar metrics>
]Did you spot it? The first of of the resulting series has no “host” tag! This makes no sense — it’s not consistent with what we asked for in the query and we know that when we repeat the same query again moments earlier or later this doesn’t happen. So my suspicions were wrong. Bosun is off the hook. Something is going on in OpenTSDB such that out of all the millions of requests we make every day, a tiny number of them seem to be getting these anomalous responses with missing tags.
Looking toward OpenTSDB
Unfortunately, with the problem demonstrated to exist outside of Bosun, we’re left with a much harder time investigating. At least with Bosun it was trivial to run a separate instance to try to reproduce the problem under more controlled circumstances. But OpenTSDB is a different beast. Our OpenTSDB cluster processes enormous quantities of metrics from hundreds of services. Replicating that for controlled tests isn’t easily done by one engineer in an afternoon.
To make matters worse, the problem is slowly slipping away — it’s happening less and less often. I’m not sure if I’ve seen a single instance in the last month. Which, of course, isn’t a bad thing, but it’s never a satisfactory conclusion to say “we don’t really understand it but it eventually got better on its own”. Our current best guess is that the bad responses only happen when something times out inside OpenTSDB due to concentrated load on its servers. Our data tribe (who own the OpenTSDB systems at Skyscanner) have recently been doing great work to improve cluster performance and stability. It seems quite possible this has had the side-effect of ameliorating our problem. It’s still a bug lurking in there — since we would always prefer a correct result or a “busy, try again later” response than a subtly incorrect answer, even when the system is under load — but current signs are that it won’t be easy to reproduce and fix.
So, disappointingly, for now the story ends here. My colleague has reported a bug in OpenTSDB, and we still have some investigation going on, but like many a ghost-hunt, the trail is getting cold. Perhaps one day we’ll get to the bottom of it and unmask our ghost, but then again it may be that it will remain elusive, returning only to haunt our troubled dreams.
Addendum — Getting there one ghost at a time
My colleagues did eventually find and fix two concurrency bugs in OpenTSDB that we think were probably the source of the problem described above. With salting enabled in OpenTSDB, time series are stored across a number of salt buckets. To resolve a query, OpenTSDB will perform that query on each salt bucket separately and then merge the results. Here’s a very rough diagram.

Each query looks up the raw time series data in its HBase stores, where the tag values are not directly stored as strings, but as UIDs. So as each query result comes in (concurrently) from HBase, OpenTSDB might have to make some further requests: it needs to find out the tag value strings for any tag value UIDs in the results that it doesn’t already recognise. Then it can filter out the rows (each row corresponds to a single data point for a metric path and combination of tag values) that don’t match the query’s filters, process the rest and store the result… but two things go wrong here. Firstly, the rows from one salt bucket are stored in a Java ArrayList called “kvs”, but this isn’t done in a concurrency-safe manner. Since the lookup of tag value strings introduced concurrency here, there’s a risk of the data structure getting corrupted, especially when a tag has lots of values that all need to be looked up. Then, when the scan of this particular salt bucket is complete, its combined results are stored in a mapping called “kv_map”. This is a concurrency-safe data structure. Unfortunately, there was a race condition in the method of picking the keys to store these results, so it was possible, with particularly unlucky timing, for two sets of results to be stored under the same key, one of them overwriting the other. This left OpenTSDB thinking it had a full set of results when in fact some of them were missing.
My colleague has submitted pull requests to fix both issues and we’re running them without issues. With the fix for picking keys in place we’ve seen the spurious alerts disappear entirely, so we’re pretty confident that it was the root cause, even if the precise mechanism by which it results in missing tags remains unclear. We can all sleep a little easier at night with this ghost finally put to rest.
Working at Skyscanner
Skyscanner is a great place to work. I’ve found it to be a really positive, supportive atmosphere, with many clever and kind people of varied interests and specialisms. We’ve got a few roles available so please have a look if you think you would like it.

About the Author
I’m Annette Wilson. I work as a senior software engineer on Skyscanner’s content management and delivery systems. I’ve been a developer for twelve years and I’ve been with Skyscanner almost four. I enjoy puzzles, board games, drawing and roller-skating. I’m a trans woman and very happy about it. My favourite swiss-army knives are Vim, jq and Python, and for fun projects I love PuzzleScript and PICO-8.
Photography by Michael Gaida and Annette Wilson.







