Real User Monitoring at Dailymotion
Collecting real user performance metrics through browser to improve QA
It’s been few years now at Dailymotion that we use RUM and we’d like to share with you the way we do it. While there are quite a lot of big actors out there to perform RUM (AppDynamics, New Relic, Pingdom, …), we went for a solution that allowed for complete control over what data is sent and how we process them. That’s why we went for free solutions allowing for complete customization. We’ll review them and see a concrete use case where RUM actually helped us to highlight possible improvements.
It all started with Boomerang
From Yahoo Boomerang to an in-house solution
Yahoo Boomerang is a wonderful piece of javascript that measures a whole bunch of performance characteristics of your user’s web browsing experience (https://github.com/yahoo/boomerang/blob/master/README.md). Boomerang is opensource and released under the BSD license so it was a perfect fit for first experiments about RUM.
The basic functioning of Boomerang as well as of all JS RUM libraries, is to collect performance metrics through the Navigation timing JS API and to provide few polyfills for old browsers missing it.
Here are all the timings that are collectable through the JS window.performance.timing object:
Boomerang also offers the possibility to add your own timings which is extremely useful to measure critical loading part your web application. In the end, Boomerang produces a beacon containing all timings collected, to be sent to an endpoint of your choice.
While Boomerang was working perfectly, we were only using a subset of its functionalities. That’s why one year ago we decided to replace it with an in-house lighter solution that only targets modern browsers (for support of the JS timing API) and is strictly fitted to our requirements: rum.js.
Aggregating and graphing data
Collecting data through boomerang or any other JS library is only the first step. What we’re really interested in, is aggregating all data sent from all clients and graphing them. A lot of different technology could play well to do some data aggregation but 3 years ago we went for the following stack:
Few months ago we slightly changed the graphing tech to use to a more customizable one: Graphite. Here is the lifecycle of a RUM beacon emitted from a client:
- Client sends beacon to RUM endpoint
- nginx handles beacons by keeping only beacons that contains coherent values and by sanitizing its values
- nginx sends values to pinba
- pinba aggregates values by country / page
- collectd retrieves data from pinba and archives them
- drraw / graphite uses data from collectd to build graphs
- Tessera embeds graphite graphs into functional dashboards
Here is the kind of graph that we’re able to build thanks to RUM:
Having such graph is a great way to watch over what happens “live” on our website.
Keeping an eye on performance
On a daily basis, we’re using RUM to watch precisely the impact of every release we make. This allows us to quickly detect changes on QA that are hardly measurable before some feature or code go live.
Another interesting way of making use of RUM metrics is to detect what proportion of users leave a player page before watching a video. This could indicate errors in the playback or could be a good indicator that the video stream took too long to load for various reasons (advertising, bugs, etc.). Thanks to all those metrics, we were able to write very concrete performance fixes.
Improving loading performance thanks to RUM
We’ll present you with one situation where RUM lead us to re-think about our resource loading strategy and the positive impact it had on page loading time.
By observing the graph of RUM on player page across various countries, we noticed that the video stream was taking quite some time to load. We started digging and thanks to webpagetest as well as the network panel of chrome, we saw that a lot of third party resource (display ads, stats, social tools, …) were loaded before our player stream. It surprised us at first because most of those third-party resources were loaded asynchronously after the player injection in the page so there was theorically no possible competition between our player and those resources. However, it turned out that once included, our flash player requests were no more priorized over other page resources. Actually, due to the limited amount of simultaneous request that a browser can emit, our player did not get any chance to trigger most of its initialization requests in the first 2 seconds of the page lifetime.
We decided to introduce a really simplistic “resource planner” mechanism into our pages. The principle was that each resource to load was given a specific priority and was then automatically loaded at the appropriate time. Of course, to be able to achieve this delayed loading of resource, all third-party JS had to be async-loadable.
In the end, delaying non-critical resources allowed us to reduce the time to play of a video stream of 1 second! This gain was much more than what we anticipated. The change in our RUM graphs was a great visual confirmation that we used to communicate about the necessity to be more careful about what we add into our pages.
Here are the lessons we learned from this experiment, about RUM and about tier resource handling:
- RUM is awesome and essential when dealing with front end performance improvements.
- It’s quite easy to setup a minimalist RUM graph stack and there is a lot to learn in the process.
- Your RUM metrics can tell you much more than just “this page took x seconds to load” depending on what custom metrics you add (like player QA metrics in our case). It also makes everyone in your company much more conscious of what your users are really experiencing.
- Resources planning can also be applied within your own resources (to lazy load images, non-essential features, etc.).
- Last but not least, don’t believe what tier-parties claim when saying: “it’s ok, our script is asynchronous, adding it has no cost and we strongly advise you to add it at the top of your page’s body” (looks familiar?). Asynchronous is far from meaning “without any cost”, especially when you consider that there are high chances that the script you add, will trigger an unknown amount of network call (though new script injection, beacons emission etc.). Hence, always ask youself: Is this tier resource more important that my own page resources? If not, add it asynchronously, in a delayed way. Doing so, you should be able to reason more clearly about the room left for performance gain around your own components. Without Real User Monitoring, we would probably have never been able to prove it so efficiently.