The Shift from Batch Data to Streaming Data

AppNexus
The Internet Insider
5 min readMar 10, 2016

With its roots in punch card machines that allowed for the United States to carry out its first-ever accurate census in 1890 (machines that consequently helped lay the groundwork for IBM), batch data processing is, at least by the standards of today’s IBM mainframe computers, a slow-moving, primeval process. It essentially works like this: individual data points are allowed to accumulate automatically over a certain, pre-specified amount of time. Once the allotted time runs out, that data gets processed and analyzed into a sum totality (a “batch”), which, when analyzed accurately, gives us a “report” of the rate at which “old data” has been cast aside and “new data” has arisen in its place.

What does batch data offer for ad tech? Generally speaking, batch data presents buyers with an “aerial snapshot” of how their campaigns are faring hour by hour, and gives sellers a glimpse into how much of their inventory is being filled within a given timeframe. Buyers and sellers can use batch data reports to identify long-term, systemic changes and trends they might not otherwise be aware of. The AppNexus platform alone supplies buyers with 96 key data points on the hour, every hour, for each viewable campaign impression. While slower to process than real-time, event-based (otherwise known as “streaming”) data, batch data still offers strategic insight that brands and publishers can leverage to shape their business decisions… if they’re willing to put up with the latency and hold-up that go hand in hand with the processing time.

But the wider social implications of batch data processing go well beyond the field of digital marketing. They also demonstrate why batch data processing, for all its strategic advantages, is of limited use in many instances that call for real-time, actionable intelligence.

Case in point: through batch data processing, we know that the Amazon Basin in Brazil is being destroyed at a ferocious rate. By aggregating and analyzing millions upon millions of data points gleaned by NASA weather satellites, we can even crunch enough numbers to see which parts of the world’s largest rainforest are being clear-cut the fastest on a monthly and quarterly basis. We can watch with satellite-enabled eyes which parts of the rainforest have been newly burned away to make room for illegal cattle ranches and palm oil plantations, or where new logging roads have sprouted in the course of 30-odd days.

But that’s about it. Batch data processing might be able to give us a monthly glimpse into Amazon deforestation (or an hourly report on how well a digital advertising campaign is doing). But what it can’t do is “stream” us the information we need right away, like the specific wheres and whens of a forest being cut down in real time. In other words, batch data can’t provide us with enough actionable insight on where and when we can stop illegal logging while it’s actually happening. That would require a new set of data measurement tools altogether.

The fact of the matter is that in today’s world, too many fast-breaking events occur within any given hour: events that we need to register and have the ability to react to in real time. Just as batch data doesn’t provide actionable, on-the-ground intelligence that can actually prevent deforestation, neither does it give programmatic buyers any leeway to change their campaigns if new, split-second, impactful developments arise (which inevitably they do). With the latency that goes into processing a batch data report, advertisers are left flying blind. As a result, oversights begin to pile up at a fast clip: advertisers begin throwing their media budgets at websites where there’s no longer enough of an audience to justify the spending. And companies are at a disadvantage if they want to test new strategies at a moment’s notice and see if they can perform better.

In contrast to batch data processing, data streaming is a child of our own age, the age of big data. According to a report by (the modern-day) IBM, 90% of the world’s information was collected within the past two years alone. While a stat like that might seem bewildering to some, for businesses, organizations, and government bodies with the necessary processing power and with smart data scientists ready to analyze it, big data streams give us a level of granular insight that none of us would have thought possible only 20 years ago.

By using the right algorithms on a streaming data set, we can actually quantify event-based changes in a matter of microseconds. And, while event-based data might not always let us know why changes occur, it does let us know what changes are occurring at any given instant with minimal latency. Presented with enough event-based, real-time data, we can suddenly notice patterns of correlation and causation where we’d never thought to look beforehand. By “mining” this stream of big data and by studying frequent correlations between events, we’re given the ability to forecast the probability of event-based outcomes with a surprising level of accuracy.

One surprising result of real-time data streaming comes from Shazam, an app that allows users to upload song segments they’re listening to on TV, the radio, the Internet, or in an elevator. By analyzing the peaks and troughs of millions of song identification queries that Shazam users make each day, Shazam can quantify within a credible measure of certainty whether up-and-coming artists are destined to become tomorrow’s hit makers. Through careful monitoring and mining of real-time data analytics, Shazam has in fact been able to determine “breakout” artists like Lana del Rey in 2012 and French Montana in 2013.

Needless to say, real-time data processing analysis enables those who react shrewdly to its insights to gain a true advantage over their competitors. Whereas batch data processing only permits strategic, long-haul observation, real-time data processing offers tactical acumen that one can use either to predict the near future… or even change the near future by altering a particular behavior pattern. Nowadays, marketers can use data streaming to adjust the ways in which they engage their audiences in real time. To give a concrete example, it’s now entirely possible for a marketer to learn that an important audience segment they’re looking to engage with on native display has begun, within these past few minutes, to shift over to mobile video. With that level of granular discernment at their fingertips, marketers can begin engaging with that audience on a meaningful level using an entirely new format and device

As of this moment, our world is beginning to cast aside the old model of batch data processing in favor of real-time, event-based data streaming. And it isn’t just businesses that stand to benefit: big data stands to benefit everyone. Given the data collection tools we now have at our disposal, we’re only able to watch via space satellite as brush fires drift lazily across the Amazon Basin. But who knows? Given the proper real-time data streaming capabilities on the ground, we might actually come up with a predictive model that can put out such fires for good.

--

--

AppNexus
The Internet Insider

AppNexus technology powers the most innovative trading solutions and marketplaces for Internet advertising. We power the ads that power the Internet.