How Updating Legacy Iframe Creation Saved GumGum Thousands In Lost Revenue Due to DSP Discrepancies

John Bauzon
GumGum Tech Blog
Published in
9 min read1 day ago
This was not fine…

The Problem:

Discrepancies are a common issue every ad exchange platform encounters. These differences between metrics, often referring to impression numbers, are fired by differing trackers, at slightly different timings, and then processed and tabulated by two different systems. Needless to say, these variations have led to discrepancies becoming commonplace and an unwanted yet acceptable part of the industry, so long as these differences are kept at acceptable levels. Over the years, due to the evolution of both web and ad technologies, this discrepancy problem has worsened here at GumGum leading to thousands of dollars of lost revenue daily.

Given such significant losses, the problem had to be fixed and as such our team was tasked to do a deep dive into our systems to do so.

Preliminary Investigations:

To begin with, the team had to solidify the definition of the problem and also outline a set of goals for the investigation and prospective solution/s.

The discrepancies in particular that needed to be addressed were discrepancies of impression numbers between GumGum and Demand Side Platforms (DSPs), with GumGum reporting more impressions than the DSPs. This causes a loss in revenue due to GumGum being contractually obligated to pay Publishers by GumGum’s numbers as per the defined Terms of Service, while the DSPs are similarly obligated to pay GumGum by their own numbers in most cases.

A simple example:

  • GumGum records 15,000 impressions
  • DSP records 10,000 impressions
  • DSP pays GumGum for 10K impressions but GumGum pays the publishers for 15K impressions, leading to a revenue loss for the difference of 5K impressions

The overarching goal of the investigation would be to figure out the main causes of discrepancy and apply a fix to bring down the current DSP discrepancy percentage to acceptable levels. The ideal target set was a less than 2% discrepancy between GumGum and DSP impressions.

Having better defined the problem and the goal, the team then set out to begin its investigation. It started with a deep dive on reports, trying to identify features e.g. (browser, device, etc.) that seem to trigger the discrepancy. This was done through an analysis of both GumGum and DSP reporting.

In addition to this, the team also reviewed our systems for common causes of discrepancies. We then measured the likelihood of each potential cause to help narrow down our investigation. Some of these common causes include:

  • Bad markup code
  • Missing or incorrect tags and mappings
  • Poorly performing scripts
  • Inefficient tracking pixels
  • Excessive firing or use of trackers and scripts
  • Server latency
  • Pixel or event misfires

You will notice something in common from the causes listed above. Most of these would be associated with verifying the integrity of the trackers and that these were firing correctly. In order to close the loop on the whole event pipeline, the team also collaborated with the data engineering team of GumGum to ensure that the events were being recorded correctly.

Initial Findings:

Initial analyses of reports were not the most fruitful as reports showed no clear features that seemed to be contributing factors to the issue. The problem was not isolated to a single DSP nor was it isolated to a group of DSPs that shared commonalities. The numbers seemed to indicate that the problem was widespread across multiple dimensions: browser types, devices, products, and ad unit types etc. to name a few. Given this piece of information, the team formulated the conjecture that either an issue must be happening in one of our core technologies central to our ad serving and rendering pipeline, or that there were a multitude of issues that compounded that made the problem appear widespread. This conjecture would prove vital as the investigation continued.

With a list of common causes of discrepancies as a guidepost, the team did a general review of the integrity of the ad delivery pipeline. Ad serving was scrutinized to ensure that the creative markups were not being malformed and that the correct trackers were being appended. Ad rendering was also tested. A number of ads of different formats that seemed to be discrepant were pulled and reviewed to ensure that these were rendering and that its trackers were firing correctly. These fired events were then validated by the data engineering arm of GumGum to ensure that there were no apparent holes in our event recording pipeline. Unfortunately, no glaring issues that might be significantly correlated with the problem were found However, an important area of inquiry was brought to light. This being the fact that the GumGum tracker was fired before the DSP tracker. And more importantly, that the GumGum tracker is fired upon attaching the ad container onto the DOM outside of the creative markup, while the DSP tracker lives inside of the creative markup.

After the first wave of inquisition, the team took a step back to re-evaluate and redirect its approach, reviewing key insights it had gleaned so far in the process. These being:

  • That the issue was widespread and indicated that there might be a problem with one of our core systems, or that there might be a multitude of issues compounding to make it appear that way.
  • That with our current set of data, testing, and monitoring tools, no major issues with our ad delivery pipeline were found. However, the differences in the timing of firing and location of the GumGum pixel vs those of the DSP pixels was a promising area of inquiry.

With these two salient pieces of information in hand, the team began the next stage of investigation.

The Need for Data:

For the next phase of investigation, the team decided to do a deeper dive on the ad rendering code. There were a few reasons for this:

  1. It was a promising lead for the investigation due to the pixel timing and location differences.
  2. The code in question was one of our core legacy systems.

A bug there would have a widespread effect across reporting dimensions. It would match the reporting data trend, another sign we were on the right track

To simulate how the bug might occur, the team conducted a simple test — Errors were manually fired in various parts of the ad rendering code to see if we could emulate the behavior wherein the GumGum tracker fires but the DSP tracker doesn’t.

This had favorable results. It was found out that if errors crashed the creative markup code after the GumGum impression fires, the DSP impression did not fire. This tracked with discrepancy data, and thus answering how and why this happens in production became the focal point of the investigation.

The next step the team did was push changes to add new pixels and improve error handling for the code under investigation. Pixels were added to mark certain checkpoints in the ad rendering flow, the most significant being a debug pixel in the creative markup to emulate the DSP pixel. Error handling was improved to fire more descriptive error events in the rendering process and in the creative markup itself. The team then created dashboards in Grafana to capture this data and plot it onto graphs to better visualize it and help with data analysis. The code was eventually pushed to production to run for a few days to gather data.

What happened afterwards was puzzling. Based on the graphs the debug pixel behaved as expected and seemed to match DSP pixel counts, further supporting the hypothesis that something was happening in the creative markup. However, whatever that something was, was ghostlike. The additional error handling didn’t catch any errors that seemed correlated with the discrepancy. For some reason, things seemed to be failing silently. This was disheartening. The team knew it was following a good lead only to be met with another dead end. Still, the team believed that we were on the right track and just had to get over this hump. And we eventually did.

Breakthrough and a “Simple” Solution:

The breakthrough came after one last round of brainstorming. At this point we had following pieces of information to work with:

  • The discrepancy seemed to be caused by the DSP impression pixel not firing within the creative markup.
  • The creative markup and rendering code itself wasn’t throwing any relevant errors that seemed correlated to the issue.

As it turns out, it would take a fresh pair of eyes alongside another deeper dive into the code to find the issue. The Eureka moment came when I was reviewing the code with GumGum’s engineering team based in Australia. I was pointing out the following:

1. After firing the GumGum impression, the code would simply attach the Node containing the creative markup onto the DOM.

2. Once attached the creative markup would then run its own code which contained the aforementioned debug pixel and additional error handling.

I then half-jokingly pointed out that the only part of the process that had a blindspot would be the attachNode method itself which literally just appends the element onto the DOM, hence I didn’t think too much of it.

Paraphrasing the events of that day: The team then asked, “What exactly was being appended to the DOM?” I answered, “The container containing the ad markup,” as I shared the screen containing the code. “Here you can see its base is just simple HTML to which we append an iframe containing our ad creative.” The key question came next: “What are we using to write that iframe?” I replied, “document.write, but I don’t see how that would be the issue as that’s legacy code that’s been running for years.

Well as it turns out — MDN states:

Warning: Use of the document.write() method is strongly discouraged.

As the HTML spec itself warns:

This method has very idiosyncratic behavior. In some cases, this method can affect the state of the HTML parser while the parser is running, resulting in a DOM that does not correspond to the source of the document (e.g. if the string written is the string “<plaintext>" or "<!--"). In other cases, the call can clear the current page first, as if document.open() had been called. In yet more cases, the method is simply ignored, or throws an exception. Users agents are explicitly allowed to avoid executing script elements inserted via this method. And to make matters even worse, the exact behavior of this method can in some cases be dependent on network latency, which can lead to failures that are very hard to debug. For all these reasons, use of this method is strongly discouraged. Therefore, avoid using document.write() — and if possible, update any existing code that is still using it.

Aftermath and Takeaways:

After this revelation, we got to work moving over from using document.write to using iframe.srcdoc instead. Then, we gradually tested the new iframe creation method with a phased rollout, fixing issues with each iteration. This ensured the new iframes were functionally equivalent to the old ones. As to the resulting effect on the discrepancy? It was more than worth the effort.

From 7.5%+ (Yellow-Orange Line) to 2.5% (Red Line)

Discrepancy percentages across multiple DSPs dropped close to target values. With the one above approximating 2.5%, a value close to our ideal target value of 2%.

Even with the goal mostly achieved, more work still needed to be done. The team had to ensure that similar issues would be better handled in the future. We did this by building on top of the new pixels and dashboards, developing an alerting system that would notify when discrepancies hit a certain percentage. The team also created an online playbook on DSP discrepancy investigations, documenting best practices so that similar problems can be addressed at a faster rate.

While discrepancies are less of GumGum’s worries now, this may not always be the case. The best we can do to prepare for an unknown future is to be and do better. After this rollercoaster of an investigation, I come out of it with two reflections that I will hopefully take to heart and apply to my craft:

  • The best way to guard against bugs is to try to prevent these from happening in the first place. Proactive measures like, creating tests for key functionality and having quick feedback loops, are better than reactive ones.
  • “Legacy” code can become problematic. What used to work for years may not always work for the present and future. Learn when it is time to make a change.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | LinkedIn | Instagram

--

--