A Quick Primer on Data
From Advertising to Propaganda
Data plays a role in almost everything we do online, but most users just don’t think about it. This is a breakdown of data and how advertisers and propagandists use it.
Like it or not, your personal data’s probably been leaked in some way or another. It could have been the Facebook data breach, the Equifax data breach, the Uber data breach, the OPM data breach, or any of the other breaches that get announced every news cycle. Do you know why that data existed in the first place? Here’s a quick rundown of what data is and how it’s being used.
What do we mean when we talk about data?
An individual datum (the singular of data) is simply a piece of information on its own. You might think of it as something with which you would fill a cell in a spreadsheet. For example, a single datum could be of the form [Name: John].
We use individual datum all the time without thinking about it. When we go to an event and wear a name tag, the name on the tag is all people know about you (although observers can combine it with other observed data such as what style clothes you wear or your hair color). When we receive a call from an unknown number, our phones display the number on our screens.
We often combine data together to do useful things. We create lists of names to know who’s been invited to a party. We create lists of phone numbers, associate them with names, and put them in our phones to make calling one another easier. We create lists of who owes us money, when they owe us money, and how to contact them when they don’t pay.
General Uses of Data: Data on an Individual vs Aggregated Data
Data on an Individual
When we identify individuals, we associate them with various pieces of information. For instance, we might create a new contact in our phone. We can associate [Name: John], [Birthday: 1 January 1980] and [Phone Number: 555–123–456] with that contact. Once the contact is created, we can treat it as an individual datum itself and add it to a list of contacts (our contact book).
Companies use data to keep track of individuals all the time. Imagine if banks didn’t keep track of their loans. Likewise, if you want to keep money in a bank, the bank needs to know who you are and how much money you’re keeping with them.
While banks need to keep track of their customers for extended periods of times, companies need to keep histories of other transactions. When you order a product online, the company needs to have your address to send you the product, your billing details to get paid, and a variety of other data including the specifics of what you ordered.
Many companies use data on individuals to match users to products. The Netflix algorithm uses data on an individual’s viewing history to suggest movies that a user might like. Spotify uses listening likes to help users discover new music.
Companies also keep track of individuals for advertising purposes. They could simply try to advertise to large populations, say, going to the local TV station or newspaper. By having databases of peoples’ likes and dislikes, companies can save money and increase the effectiveness of advertising by individually targeting people based on what they would find interesting or useful.
There are all sorts of other reasons we might want to have data on indivuals. Sometimes though, we care more about overall trends rather than what specific people are doing, and in those cases we can aggregate, or combine, the data without tying it back to individuals.
We deal with aggregated data every day, but might not think of it that way. When choosing how early to leave for work on our morning commute, we have to consider what the traffic’s going to be like. We don’t particularly care the names of any of the people on the road, what cars they drive, or if any of them have birthdays. We just care how many people are going to be on the road at a given time, the aggregated population.
Aggregated data is important to governments. We undergo a census every few years so that governments know certain things about populations. For instance, the US Census Bureau posts aggregated data about the US population, such as age and gender distributions (see the image to the left).
Aggregated population data enables governments to plan for the future. It’s hard to allocate funding for roads if one doesn’t know where people are and how they travel. Likewise, different levels of government needs data to allocate funding for schools, voting districts, and so forth.
Governments can abuse aggregated data. For instance, there’s a lot of debate over the use of gerrymandering, the re-drawing of district lines to change the demographics within a district without actually moving people. The below graphic from the Washington Post explains it pretty well.
Outside of government, companies want aggregated data to know how to market products. When you turn on the TV, you see a lot of ads. Advertisers need to know what their target market looks like so that audiences will like the ads and buy their products. They need to know what audiences find funny, sad, and so forth. Aggregated data lets the advertisers build their ads based on those statistical preferences within populations.
Buying, Selling, and Stealing Data
Legitimate companies have made fortunes off the trading of personal data. You may have heard of a few such as Acxiom (NASDAQ: ACXM), Nielsen (NYSE: NLSN), TransUnion (NYSE: TRU), Experian (LON: EXPN), and Equifax (NYSE: EFX). The feature you’re probably most familiar with is your credit score; these companies determine whether or not you’re likely to pay back a loan based on your past buying behavior, which they acquire from a variety of creditors.
There’s also a black market for data. Thieves can use data for identity theft, credit card fraud, and other schemes. Thieves steal passwords and disseminate them through black markets. Data is transferred from country to country, and there’s not much anyone can do about that.
While the data is valuable, there are also privacy issues that restrict the transfer and retention of data. Laws differ across the world, although the European Union’s GDRP will make data collection of Europeans harder.
Although there’s a large market for buying and selling data, what they’re really doing is copying the data to other people. The originating company generally maintains a copy of the data as does the buying company. This means that companies who have data on populations can’t give the raw individual data itself to other companies, because then the other company could just start selling the data itself.
In advertising, a solution has evolved that allows the social media companies to target users without giving away their users’ data. The social media companies provide aggregated data to their advertising customers, and then the social media companies themselves target individuals based on the individual data. That being said, there are edge cases with social media where third party apps are able to request access to the users data from the users themselves (who don’t always understand what sort of data they’re giving up) and the third parties can siphon off that data for their own use as in the case of Cambridge Analytica.
Combining Individual Data in Conjunction with Aggregated Data
The large online advertisers have large amounts of individual data. In order to sell eyeballs of users to advertisers, they aggregate their users’ data and show it to the advertisers while keeping the data on individual users to themselves.
Online ad companies show the advertisers the size of populations that have certain attributes. Democrats vs Republicans, High income vs Low Income, New Mothers, Retirees, Students, the list goes on. The advertisers then get to come up with various messages to send to groups based on the aggregated data, then give the ads to the ad companies who show it to the individuals.
Using Facebook Ads, for example, an advertiser can find the size of audiences that have certain interests. The below graphic shows that if an advertiser wanted to show an ad to dog users, the target audience would consist of over 300 million users world wide.
Advertisers can narrow their audience down further based on other interests. In the below example, the target audience is Americans who like dogs and wine but dislike beer, reading, and cats. The advertiser gets to see that the aggregated audience is 160,000 people, but never gets the names of any of those people. Instead, the advertiser tells Facebook to show the ads to those people, and Facebook has the individual data to make that happen.
The amount of data companies use for this sort of data can be surprising. If you’ve used the internet at all in the last ten years (and I assume you have because this post is on the internet), you’ve probably noticed that when you search for an item or buy an item, that item starts showing up in ads. Well, that’s because the ad companies took that datum where you performed a search or action and added it to the rest of the data on you. They can add your browsing and searching behavior as an interest. They even allow advertisers to target you based on whether or not you bought that item recently.
Alternate Uses of Data: Divisive Propaganda
You can use data for divisive propaganda the same way that you’d use data for advertising, except you add an extra step. The concept is similar to the gerrymandering example above.
Consider a population.
Using data points on individuals on the population, we can create a model of the population in as many dimensions as have topics of interest (e.g. one dimension for preference towards dogs, cats, beer, and so forth, and then dimensions can be combined through dimension reduction). For purposes here, we’ll just assign each individual arbitrary values.
We can then run a classifier over the population in order to categorize it into groups. Lets say that the -10 Z label is Republican and +10 is Democrat.
With advertising, we would use these categories to create and target advertisements towards these groups.
However, for divisive propaganda, we want to split the groups up, which just takes another step (we can already create propaganda to promote the differences between Red and Blue, but that’s trivial).
We identify one of the groups and run another classifier over that sub group. Here, I ran a classifier means over the +10 group, which divided the group somewhere along the x axis at x = 5.
Looking at it in 3D again, we can see that while x=5 is close to the boundary, the classifier has a few points that overlap in each of the resulting sub-groups of the Democrats.
At this point, we now have three distinct groups in the population classified by color: Red (Democrats), Blue (Democrats, partial) and Green (Democrats, partial).
We can amplify the divisions between the Blue and Green groups by promoting the differences in the features that the classifier used to determine the difference between the Green group and the original Blue group. These differences already exist in the population because they were in the data that the classifier used to classify Green as a subset of Blue in the first place!
In order to really exploit the divisive propaganda, it’s helpful to have the data on the individuals in addition to the aggregate data. It is not necessary though. Aggregate data can easily show fracture lines involving a small number of topics: for instance, a social media advertising tool can tell us what percent of the population likes cats and what percent doesn’t. The Facebook tool above even lets the user break down the population using multiple topics; however, the issue there is that the user has to manually select the features in the population to target and is guessing at what might be effective. Using data on individuals, the propagandist can allow the computer to identify the fracture lines algorithmically.
Just like with gerrymandering, there’s no change to the population itself. What’s changed is how the aggregate manner in which views of the population are expressed; rather than the discourse of the population expressing the difference between Red and Blue, the discourse expresses the difference between Red and Green.
What’s next with our Data?
Unfortunately, data is like pandora’s box — once opened, it’s out there. A lot of the information about you doesn’t change over time. Your birthday won’t change, your mother’s maiden name won’t change, the papers you wrote in college, and your browsing history won’t change. Short of hunting down that data on every server it’s stored on, that data is going to be around on the internet for a while.
That’s no reason to give up on security though. Just because the data about your past behavior is out there and ripe for exploitation doesn’t mean that you need to continue to give up your current data. As far as how to go about doing that, well, that goes beyond what I’m covering here.