Less than 10% of bid-stream location data is high-quality — and we know how to find it

A step-by-step guide to finding the advertising publishers that report accurate location data within the bid-stream

Natasha Whitney
5 min readAug 30, 2017

By Natasha Whitney and Noah Yonack, SafeGraph

Recently, one of SafeGraph’s partners — a very large player in the advertising ecosystem — was trying to draw insights from the massive number of location signals they saw through the bid-stream. The bid-stream is a constant stream of advertising bid-request information from devices on ad exchanges. A bid-request occurs when a publisher — like a mobile app or website — auctions off a slot to advertisers, which then bid to have their ads shown to the user.

Along with a bunch of other ad-related information, these advertising requests often come with latitude and longitude coordinates representing the location of the device where an ad is to be shown. The stream is a promising source of scale, generating over 20 billion daily bid-requests across over 200 million mobile devices. There’s just one problem: the vast majority of this location data is incomplete, inaccurate, or illegitimate. (tweet this)

Our partner couldn’t draw valid insights from invalid data, so we were faced with two options:

  1. Ignore the location data from the bid-stream
  2. Come up with a clever way to separate out the good location data from the bad

Needless to say, we chose option 2.

SafeGraph identified the publishers that report high-quality data, and filtered out 90% of the bid-stream in the process (tweet this)

Most of the location data in the advertising ecosystem is inaccurate and sometimes even fraudulent. For example, publishers sometimes don’t have access to GPS data, so they’ll infer a device’s location through an IP address, which is a coarse indicator of location. Other times, users will intentionally fake their locations by using a VPN, fooling publishers into reporting false locations.

Our hypothesis is this: if you look at the bid-stream on a publisher-by-publisher basis, some publishers will send accurate location data — via GPS, for instance — and others won’t.

How do we know what’s real and what’s not? We already know what good data looks like — because we work with it every day.

Over the past two years, we’ve developed the SafeGraph Movement Panel, our database of ultra-accurate GPS-location data that comes from anonymized mobile devices. By combining this database with our expertise in GPS location data, we devised a handful of filters for sorting out the good publishers from the bad. Here’s how we did it:

  1. Sink detection: We can tell when a publisher uses a coarse method of location inference (like IP addresses) when it reports a particular coordinate way more frequently than we expect. After all, there are about 1.5 quadrillion unique 6-digit location coordinates in the US, so the probability that two devices are at the exact same latitude and longitude is outrageously low. When we see specific coordinates occurring at impossibly high rates, we call them sinks, and they indicate that a publisher isn’t providing accurate location data. These points are unfortunately common. The most famous sink, in case you’re wondering, is a Kansas farm positioned at the exact center of the U.S., which just so happens to be the default location that some publishers report when they’re unsure of where an IP address is located.
  2. Teleporting detection: Another red flag is when devices appear to move at impossibly fast speeds, ostensibly traveling large distances between data points that occur next to each other in the bid-stream. We call these teleports, and sometimes they’re quite dramatic. If the publisher tells us that a device is in New York one second and California the next, we assume that its location is being spoofed.
  3. Jumpiness detection: A related phenomenon to teleporting is when we see two consecutive points in time that are fairly close but still implausibly spaced. These jumps might occur if the GPS receiver in your phone lags or if a nearby skyscraper interferes with the location signal. Not even the best publishers can fix this problem entirely, but we can’t trust those that report too many jumpy points.

Comparing to truth data: Luckily, SafeGraph has access to super-accurate, anonymized location data on over 5% of all mobile phones in the U.S. (as of August 2017), and we’re increasing that percentage every day. Granted, the data problems we’ve identified are common even amongst the most accurate data producers (we know this because we work with them), but some publishers show these problems much more often than others. If a publisher from the bid-stream is worse on all of these metrics than all of our high-quality data partners (and almost all are), then we know to ignore it.

Most bid-stream publishers report tons of jumpy points and sinks, but SafeGraph’s data partners don’t.

Using these filtering techniques, we threw out troublesome publishers and found that less than 10% of bid-stream data meets our quality standards. Though this number is relatively small, the publishers we whitelist still have a lot of data. And it’s accurate, too.

Whitelisting only the best location data allows quality publishers to get higher CPMs for their inventory. In the past, many ad-tech companies have been wary of using the bid-stream because of its seemingly low quality, thereby roping the good publishers in with the bad. Now we know that there’s a better way to handle this data, making the bid-stream — and the high-quality publishers within it — all the more valuable.

SafeGraph is the source of truth for human movement data — that is, where people are going and when they go there — and that requires a massive amount of high-quality data. After all, how can we expect our customers to answer society’s toughest questions if our data isn’t ultra accurate?

Excited about the work we’re doing at SafeGraph? Follow us on Medium to stay in the loop!

Join Us: We’re bringing together a world-class team, see open positions.

--

--

Natasha Whitney

Software Engineer, Functional Programming Neophyte, Environmentalist