Finding an Error Needle in a Real-Time Haystack

Published in

JW Player Engineering

17 min readApr 16, 2020

JW Player boasts one of the largest networks of video data on the web. Our global footprint of over 1 billion unique users a month creates a powerful data graph of consumer insights and generates billions of incremental video views.

This footprint is powered by the JW Player, built by the aptly named Player team. The mission on the Player team is to empower our customers with the ability to have a fully out-of-the-box player solution or customize it heavily to their liking. Every day we are building one of the most configurable applications out there. Combine a huge network of over 3 million data points a minute with the customization available to those implementing the player and you will soon find yourself lost in the proverbial real-time haystack, trying to make sense of what’s a needle, and just how sharp that needle is.

So What’s the Problem here?

On the Player team, we’re focused on serving two distinct types of customers who are looking to integrate with our player. There are the customers who are looking for minimal configuration and to just get “up and running”, and then there are the customers who want to build upon our robust API to provide a one-of-a-kind experience to their end-users. As a result, the number of potential ways to embed JW Player is incredibly high. This means that while certain code paths may be heavily used for some developers, others may never be accessed (not to mention developers who build their own custom code paths).

We could go on at length in terms of the raw testing challenges this presents, however even beyond that there is our scale which presents an even more complex problem. Obviously, manually testing the astronomical number of possibilities is untenable, so how can we deploy code and feel confident that we’ll be able to catch issues far in advance of all our customers?

“Monitoring, of course!”, you yell, unaware that you’re shouting into a screen. And that’s exactly what we yelled at our screens at first too. But the simple act of monitoring errors in the wild became a vast and complex problem that we’ve spent more than a year solving.

In a traditional scenario, you understand the basic flows of your application and you understand exactly how those are implemented. It’s your server code or your client code and even if unexpected things happen, they are triggered by very expected things. When working on an application so dynamic as JW Player however, those traditional monitoring techniques fail. How do you catch a breakage for two of your thousands of customers who may be using a specific combination of features that others aren’t? Beyond that, how do you do that when you are sifting through 3 million data points a minute?

The First of Many Attempts

Thanks to the incredible work of our Pipelines team, in February we on the Player team were able to quickly and efficiently spin up a system that orchestrated the following in the form of a Datadog dashboard:

Categorize all data points coming from the player by the type of event (play vs error vs embed etc) looking over the past minute
Attach additional information to the events such as player version
Categorize and slice the data in such a way where it could be dynamically sorted based on a tagging system
Feed the data into Datadog for graphs and analysis

The first attempt proved to surface more questions than solutions. We had graphs for different metrics such as time to first frame, setup time, some error rates and overall traffic coming from each version. Unfortunately there were quite a few issues which stopped it from being immediately useful.

First off, due to the hundreds of player versions active at any point, it didn’t provide immediacy when trying to understand how the latest version was affecting our customers. This was due to both the way we were looking at the data we had, as well as the overwhelming amount of data we had to deal with. It was just too much to look at the whole network, and didn’t solve any of the number of problems we were facing.

The most clear metrics such as time to first frame and setup time were marred by the fact that they were far more useful longer-term. Yes, it’s interesting to know that we saw a spike in setup time during a release, but it’s far more interesting to know that setup time has increased 50% over the last quarter. Generally the trends are far easier to spot than the incremental increases.

Unfortunately, these issues prevented this data from being actionable beyond “huh, that’s neat”. Though it wasn’t all a loss. One of the most important pieces of knowledge that we did gain from this however was “how long does it take for a new player version to propagate across the network?”. This wasn’t what we were hoping for but it was something.

The First of Many Iterations

Our first attempt at real-time tracking didn’t yield much. So, we went back to the drawing board. One of the largest problems we saw with our first attempt was that while we wanted to track issues that cropped up as a result of promoting new versions of the player, the first dashboard gave us no insight into this. It was always looking at the full network which meant we weren’t only looking at data from the latest release, we were looking at developers using old versions of the player, customers who were pinned to older versions, and a number of other potential variations that could muddy the waters during a release.

Without changing the underlying dataset we were able to craft a new version that solved some of those concerns. We created a new dashboard specific to our release versions which compared the metrics of the last production player version with the version we were promoting to production. While this gave some granularity and context, it became clear in time that seeing a small bump in errors or setup time on release could mean something was horribly wrong or simply a large publisher had pushed some bad code to production at a coincidental time. Another factor that could easily influence these graphs was any change in traffic.

We could have false positives due to any of the above reasons, but there were also factors that could easily cause false negatives. If a smaller publisher completely stopped working, it could be completely eclipsed by the larger network and we would never catch it. If an error increased 50% but was a nominal absolute number to begin with, we would never catch that either.

After a few months of monitoring and quite a few releases, it became clear: looking at one version over another worked better, but it was still not very useful for the majority of cases we were trying to catch.

The Epiphany

Fast-forward a few months and our Principal Engineer on the Player team, Rob Walch, is working on some unrelated improvements to the open source repository hls.js which powers JW Player’s HLS playback in browsers other than Safari. As part of these improvements, an A/B test was instrumented which compared the improvements against the existing codebase. Rob chose specific customers to run the test against and things proceeded as normal.

Rob’s always been one to dig into the data when working on Player improvements and as part of this he created this dashboard which he built to compare the hls.js master branch with the new suite of improvements known as HLS Progressive.

The HLS progressive dashboard

Now it’s easy to look at this and see some fairly traditional graphing in relation to performance metrics. We have how much the code is erroring, how much it’s stalling, and some other metrics centered around viewer experience. What was different for us though, was that this was the first time that (relatively) real-time player data had appeared so clear and concise. We knew where performance was, and with each release there was a clear throughline which showed what had been improved or made worse by the latest release’s changes.

And why was this? There was one key factor at play that would be instrumental in rocketing our progress forward in detecting errors not just in the a/b test context, and not just in a publisher by publisher context, but in a network level context.

This factor was the smaller data set. As was noted above, one of the largest problems when evaluating our original dashboard was the fact that there was far too much noise to take anything of note away. Player version groupings clogged up release specific issues, and error rates were swallowed by the overall traffic of the network. Rob approached it differently, he didn’t try to gather information about progressive by digging into the network at large; he used the slice of customers he was looking at to inform his broader context of how the changes were performing. This is a simple data concept of course, but put against the backdrop of our overwhelming scale, it was tough to see through the noise, and this unrelated data expedition brought some much needed clarity.

Our new outlook gave us the boost we need and we set out to try, try again. Unfortunately, due to technical limitations we couldn’t set up a system to a/b test player versions (though it’s always something we’re evaluating). So what could we do in the meantime that would get us closer to detecting issues before even the smallest number of customers did? The answer lay in the discoveries made above, and once again the magic of iteration.

Enter the Cohorts

With this new information in mind we did what any engineers would do in this situation and developed something. Given we had seen how looking at the right cross-sections of data could provide powerful results, we set out to create an application that could find those cross-sections for us.

With a quick and dirty front-end and a hacky SQL table, we created a view of all of our accounts which were hosted on our cloud, and the different cohorts that they fit into. The idea was that anyone could go into this application, select exactly what slice of customers they were hoping to get a look at, and find the exact customers that fit that profile. It was a project agnostic take on what Rob had done with his HLS work. Code named: Cohorticus.

It wasn’t pretty, but it got the job done

This new tool allowed us to focus on a select few cohorts and re-invent our dashboard to look specifically at error rates for two to three customers within those groupings. For the first iteration, we decided to focus mainly on high ACV customers producing mainly short-form VOD content.

But Did it Work?

Well, no. But more important than whether or not it worked was what we gleaned from this new failure. Releases came and went, and the error rates for these small groupings didn’t surface enough information. It wasn’t that our releases were bug-free, it was that viewing this small subset of customers was such a small piece of the overall picture that we would only know about bugs if they fit into the narrow two to three configurations we were aiming for. We had swung the pendulum too far in the opposite direction; suddenly instead of looking at too much information, we were looking at too little.

One thing was clear though, we were getting closer. This new dashboard provided a look similar to the epiphany inducing HLS dashboard. It was clear and concise with regards to the performance of these accounts in conjunction with important metrics we were hoping to track. It was also the first time we were solely focused on our cloud-hosted customers which were guaranteed to be on the latest version, and therefore susceptible to release breakages. These were successes, full stop, and more importantly they were something that we could bring to the next iteration.

Call in the Experts

So we did what any good team does, and we brought in some more experts. We had been through a few iterations and found some things that worked, and others that didn’t.

Our cross-collaboration group involved our Pipelines guru Steve Whelan (whose great JW engineering posts you may already be familiar with), master Data Scientist Graham Edge, and once again our Principal engineer of all things player, Rob Walch. This slice of the organization represented all the pieces that needed to fit together in order for this work to, well, work. Each person had expertise in a different method or madness, and it was clear we would need both to be successful in this venture.

As a group, we walked through everything that had been tried through the previous months. Cohorticus was demo’ed along with the resulting dashboards. The issues were apparent, but the solutions less so. After a long but productive meeting, we settled on what was the second major breakthrough in our real-time tracking efforts.

What if we were over-thinking it? It’s a tried and true fact of engineering that engineers over-engineer and this project wasn’t any less susceptible than the rest. At its core, a cohort was simply a logical grouping of customers. We reviewed this concept in conjunction with all the methods we had attempted previously. After some deliberation, we settled on something that was always in front of us, but never attempted: geography based cohorts.

Sometimes the Simplest Solution is the Easiest

With this new idea, we updated our real-time jobs yet again, this time looking at only our cloud hosted players and breaking them down by the region of the world that the data was coming from. We had graphs that showed the error rates across them and the associated volume of data coming from each region to get a better holistic view of the data.

And then what happened? Well, we caught our first issue in a release before our customers did.

A slice of the dashboard upon the release of 8.12.5

Within minutes of releasing player version 8.12.5, we had a dashboard that resembled what we had hoped for all along, clarity. Since we had looked at the cloud hosted players sliced at network scale, as compared to the customer by customer look provided by the Cohorticus approach or the network wide tracing, we caught an error that may have been swallowed by the network traffic at large, or otherwise missed because of the lack of enough signals.

One other important thing to note here is just how much cohorts mattered in dissecting the data. In real-time we were able to see the immediate impact across each region of the world, and with that information, what publishers would potentially be the most impacted. It’s also a representation of how varied the network is. Yes, Asia saw a bump, but if we were only focused there, that could easily be read as a normal trend of error rate. Having these cohorts stacked against each other was insight that we hadn’t even expected to gain, but was a nice side-effect of our approach.

This issue also surfaced the information we didn’t have. Just like in traditional monitoring, it’s not only important to know that there is an issue, it’s important to know just how critical that issue is. We had part of that information, area of the world and relative volume, but were missing an even greater piece of the picture. What type of errors did we see the increase in? We were able to use some other data sources to piece together a few data points and find out within the hour that we were seeing an increase of ad errors, but it was unclear what type and if they were fatal.

We waited the hour or so for our other data sources to compile the full data set and went to work digging in and trying to figure out what had gone on. Before this however, we were able to roll back the release with confidence, ensuring that if the issue was fatal we were on the right side of history.

In digging through the data, we pulled some customer page urls and did some local file mapping to try and replicate the issues and error codes that had led to the increase. Throughout the course of this investigation we found that the issue was actually caused by some added code that increased error visibility. So in other words, we weren’t causing new errors, we were simply catching ones we had previously missed. We didn’t have to roll back after all. How could we avoid this in the future?

Don’t Stop Believing

We rolled back that ad error inducing release due to a lack of knowledge about the type of issues the customers were seeing, and that was the first thing we opted to address with the new iteration. Our Principal engineer Rob once again proved to be a wonderful help. When discussing these needs, he brought up an older categorization system that had been used in one of our other data tools, Looker, for years. The concept was simple, just like we had aggregated groupings of publishers by geography, this aggregated groupings of error codes by logical divisions.

These error groupings were added to our real-time aggregation job and forwarded onto Datadog like the rest. We updated our dashboard to include 15 new cohorts, 3 of which were not error groupings but groupings based on the ad client in use (JW Player supports multiple different ad clients for playback). The other 12 were groupings taken straight from this Looker dashboard, and included error cohorts such as “Javascript Exceptions” and “Non-Linear Ad Errors”.

These new groupings gave us not only a bunch of new cohorts to monitor during releases, but also more information about each other cohort, given Datadog allowed us to mix and match these cohorts as we saw fit. As an example, we could now see what the error rate of Javascript exceptions in Asia was. That information would have been crucial in the previous release, so we felt comfortable that we had made some strides forward.

Error Groupings Have Their Heyday

So now we had new cohorts, new grouping abilities, and some new graphs on a dashboard. The last thing to do was test it out. Version 8.13.0 of JW Player contained a remarkable number of feature additions which, as any developer can tell you, made it ripe for breakages. It was the perfect test bed for our new views. Almost immediately upon the release of 8.13.0 we saw something that in all circumstances is horrifying, but in this particular circumstance validated every change we had made up until this moment.

The Release of 8.13.0

This screenshot is so powerful because it confirms everything we had learned up until that point. Without these new error group cohorts, we would have never caught this issue live in the wild. Due to the relatively low error count of these JS exceptions, the magnitude of the network, even on the geographic scale, easily eclipsed any ability to see that there was an issue. Being able to view each piece of this puzzle with the right slice of data was what had allowed us to catch it.

Right after noticing this spike, we started mixing and matching the various cohorts, trying to get a better understanding of the scope of the issue, and what we could take as next steps. We found that it was ~400 errors every minute, most of which were relegated to the EU, which as you can see by the above screenshot processes approximately 700 thousand data points every minute. It’s important to note that errors to overall data points are not a 1:1 relationship, however it can be a nice proxy when trying to determine the level of impact.

Now that we had information regarding the general level of impact, we felt safe leaving the release in production and waiting until the aforementioned downstream data processing had finished before digging into the more detailed data further. We spun up a few queries based on the error codes that we knew fit into that Javascript Exception cohort and found that 2 customers out of our thousands were responsible for the majority of the increase. We grabbed some affected page URLs and went to work trying to find the root cause. A quick investigation on the pages showed that these errors were the result of custom code not playing well with some changes that were made as the result of a new API added to the player.

We found the most affected cloud-hosted players from these accounts and had their account manager reach out to them to make them aware of the issue before we rolled back those players to the previous player version. They remained stable, we were able to keep the new player version in production, and all of this was accomplished before the customers even noticed we had released.

Never would we have thought that we would be able to detect such a micro-scale issue in our macro-scale network. Not to mention detecting it in real-time leveraging some of the work and knowledge of the other brilliant minds and teams working at JW Player.

What’s Next?

So where do we go from here? Well if you’ve been following along, the answer is probably fairly clear. We iterate. Just like the releases before it, while we found some new insights in our approach in this latest release, it also surfaced some weak-points around our processes and the data that enables them.

The first piece we are focusing on is our processes when release issues arise. Any good engineering team knows that the team’s code is only as strong as the processes (or lack of) that enable it. When faced with a release that only affected a small subset of customers, we didn’t have good runbooks in place that guided our response. As a result, there was a lot of confusion and frustration when initially triaging the issue. In any situation with a production issue, the focus needs to be crystal clear to ensure the team has the proper tools to deal with the situation and ignore any ancillary noise.

The second focus is on data. It took an hour and a half from when we realized there was an issue until we had enough clarity to know the affected customers and associated pages. In a situation where the impact may be wider spread, this time to triage is simply far too long. We’re partnering up with our internal Pipelines team to see how we can get faster access to page URLs and customer identifiers to lower this time to be more in line with our time to discovery which is now minutes.

Takeaways

We’ve learned a lot over the past year. Beyond the data aspects though, there are takeaways which can be applied to any venture, engineering or otherwise. In no particular order:

You never know when lightning will strike

For us, an unrelated dashboard provided the jumping point from which we were able to solve a problem that at one point was regarded as insurmountable. For others, always be vigilant and never write off a body of work as unrelated because you never know when it will cause the spark which lights a fire.

You will not get it right the first time

We didn’t get it right the first time, nor the second, nor the third. It took try after try of re-evaluating our learnings and adjusting the next approach to fill some gap and of course create others. If we would have been satisfied with our first approach we would never have gotten the level of granularity and coverage that we enjoy today. And it’s because of this that…

Perseverance is everything

“Don’t give up” is cliche but every cliche has a reason it exists.

Teamwork makes the dream work

This is by far the most important takeaway. It took 3 different teams, with multiple different brains, and many areas of expertise to get us to where we are today. No matter how much you might know about a topic, you will always have blinders of some sort and it might surprise you what can be uncovered when someone else steps in.

Plus, victory is always sweeter when it’s shared.