Data Collection: Public Opinion

Published in

The Startup

6 min readNov 26, 2020

“If you are not the customer, you are the product being sold” Andrew Lewis

Our online activity is constantly stored, accessed, analyzed and our behaviors are profiled and sold. With the evolution of technology, the capabilities of companies and governments to extract this data has exponentially grown and the public is becoming increasingly aware how accessible their private preferences actually are. To date, arguably the most consequential proof of this is the Cambridge Analytica Facebook scandal of 2018 when Cambridge Analytica obtained the private Facebook data of tens of millions of users and sold psychological profiles of these users to political campaigns. What ensued was a two day congressional hearing, with the Senate interrogating Mark Zuckerberg on his handling of user data. No one knew if what happened was even illegal, and lawmakers were playing catch up.

“How do you sustain a business model in which users don’t pay for your service?” Sen. Orrin Hatch (R-UT)

In the end, the most blaring takeaway from this hearing was that no one really knew what data Facebook stores or how they use it.

The public was angry, and the concern for how private companies and governments collect and use personal data remains strong. Still, just like the congressional reaction in 2018, the public has very little understanding of the subject.

Sample taken by random phone call interviews from those over 18 living in the United States.

Source : “Survey conducted on June 3–17.” Pew Research Center, Washington, D.C. (2019)

Fear of the unknown is one of the fundamental fears from which most other fears originate. Due to this, I hypothesize that the more people learn and are educated about how their data is collected and used, the less concerned people will be about it. In order to study this, I use the Pew Research Center’s 2019 Survey on privacy and surveillance. Through my data, I find whether there is a link between the population sample’s self-perceived understanding of data collection and their concern for it. I also attempt to find similar links through differing degrees of knowledge about technology in general, and differing levels of education.

In the data Pew Research collected through this survey, it is clear that the majority of the American people have very little digital knowledge.

American’s understanding of their digital footprint varies by topic, while most people know how to detect a phishing scam, very little know the extent of the security browsing privately can offer with most respondents overestimating its capacity.

In order to test respondent’s aptitude on this topic against their levels of concern for data collection, only those who answered one of the most difficult questions (“Do you understand what using a private browser does?”) correctly were filtered out and tested for their levels of concern separately.

Visually, it appears that those who answered this digital knowledge question correctly are just as concerned as the entire sample population put together. Still, after a chi squared test was done, a correlation between how people answered this question and their digital knowledge did emerge.

Cross-tabulation of levels of concern against whether the respondent answered questions about digital knowledge correctly.

With the cross-tabulation pictured above it is evident why the visual does not align with the statistics of the data. There is solely a clear statistical significance among those that are “not at all concerned.” It is shown that the majority of those that answered this difficult digital knowledge question correctly, would have been the least likely to be inclined to say that they were “not at all concerned” about how companies are using the data they collect. When this column is dropped from the cross-tabulation, the correlation between these topics becomes obsolete. Still, it is interesting to learn that the more extreme stance of ‘no concern at all,’ diminishes among a more savvy sample.

In an attempt to make the categorical data more pliable in a statistical coding environment, the responses were converted to ordinal values. This garnered limited results as a regression line between levels of concern and self-perceived levels of understanding was attempted. After studying the OLS regression results it was clear this was not a useful tactic. This is because the survey’s random sampling model did not equally pull a sample from across social groups. For example, college graduates represent about 60% of the sample population in this data.

Instead, it is better to test who out of the sample data is more likely to believe that the benefits of data collection outweigh the risks. Public opinion on data collection is more clear in this survey question as it makes the respondent consider their concern in a more practical and inclusive way.

While testing how level of education may predict someone’s concern for data collection, the aforementioned unequal representation of college graduates in the data had to be accounted for. To do this, after filtering out those who answered that the risk outweighs the benefit, that group was divided by the entire sample so that a percentage was given that demonstrates how a percentage of each level of education answered.

It is clear that a vast majority of the respondents, no matter what level of education, believe that the risks of data collection outweigh the benefits, ranging between 82–86% of respondents, varying by education level.

Contrary to the hypothesis, instead of finding a correlation between understanding/education and concern, a correlation between concern and age was found. While most respondents believe the risk outweighs the benefit, it is very clear that the group most likely to believe in the benefits of data collection is the 18–29 age bracket.

The reason behind this generational divide of opinion is not clear in the data. Armed with the original hypothesis, I tested whether levels of understanding of data collection vary between age groups and found no correlation. It is possible to assume younger generations use technology more and therefore encounter times when their data is collected and used more often. From this one could possibly infer that familiarity breeds indifference, but this is a large leap to make without any data to back it up. In the end, the data showed a statistical correlation between age and whether one thinks advertisers have successfully tailored one’s ads to their actual interests. So in theory, the more useful a private entity’s data collection is to the public, the more the public might appreciate it.

In conclusion, my hypothesis was rejected. The little data that proved there was a correlation between understanding data collection and a respondent’s concern for it, only related to the most extreme stance of not caring at all. This showed that, contrary to my initial opinion, if someone understands how companies and the government collect and use your data, they are more likely to be somewhat to very concerned. Those who have no concern at all are more likely to have less understanding on the subject. On the other hand, the unrelated correlation found was age, and the only explanation found was that younger generations find companies collecting their data to be useful. This means that the only real way to get public opinion on the side of big data, is to make sure it comes in handy for the public as well.

Data Collection: Public Opinion

Written by Áine G