Termite Part Four: Assessing Data Collection, Information Sharing, and Changes to Policies

A data science solution to privacy policies that nobody actually reads

5 min readOct 3, 2020

In this post, I will explain the remaining three of our six privacy categories: Data Collection and Usage, Information Sharing and Selling, and Policy Changes.

4. Data Collection and Usage

Services can collect many different forms of data from you. The following list only begins to scratch the surface of the types of data websites can collect. Your IP address, software details, account username and passwords, email address, location, private messages, user data, usage data, voice data, prior purchases you’ve made, items you’ve viewed, your preferred settings and themes, clicking behavior, and other forms of browser history are all subject to data collection.

Initially, we wanted to create a list of all of the types of data that services might collect, and their respective sensitivities. We grappled with whether we wanted to maintain a spectrum of sensitivities, or just two sub-lists: sensitive and non-sensitive data types. A glaringly apparent example of a sensitive data type is your social security number. A less obvious example is your full name. Conversely, an immediately apparent example of a non-sensitive data type is your email address. A less obvious example is your IP address. Collecting this data seems like it is invasive. In actuality, IP addresses are an essential part of engaging with the internet.

We quickly realized that the sensitivity of many data types being collected is totally subjective. The conundrum here is centered around value-based decisions. A service can use your data for good and bad purposes, but determining good from bad is subjective, and not straightforward. Being transparent is certainly good. Yet subjective opinion of appropriate data type by use case, different feelings about privacy/security tradeoff for that type of service.

We then considered applying Nissenbaum’s Contextual Integrity framework, which centers that person and emphasizes that data collection ethics depends on the type of data, the type of company, and the purported use of the data. Using Nissenbaum’s Contextual Integrity framework to determine good from bad practices essentially boils down to user expectations of the service.

We decided against using either of these approaches. In theory, we may be able to identify some listed uses of data within a privacy policy via a complex NLP model. However, it would be too difficult to gain consensus about what data the average user thinks is appropriate to collect and use for a given website or company. Therefore, in practicality, whatever we could design could be biased and lack consistency. Technically, it would be too difficult for an algorithm to understand the context of a privacy policy. The most advanced models would require too much compute power, and thus would affect the latency of our service. As this would be especially difficult to encapsulate with a script, we deferred to precedents set by policy makers with CCPA and GDPR for documentation about data sensitivity. The EU’s GDPR regulations classifies personal data into three categories: classical personal data, digital personal data, and sensitive data.

With a narrowed-down, feasible approach, we evaluate data collection policies as follows:

5. Information Sharing and Selling

Websites can sell or share your information for a variety of purposes. Typically, a service will share or sell your information for the purpose of targeted advertising.

With that in mind, there are cases when services sell your information for purposes unrelated to advertising, such as in the case for medical or research purposes. Businesses may also need to share information for other legal purposes, such as part of compliance with Fair Credit Reporting Act (FCRA), Health Insurance Portability and Accountability Act (HIPAA), or Gramm-Leach-Bliley Act (GLBA). Hospitals and medical groups are often privately funded, yet not huge centers of profit. They might anonymize personal data when selling or sharing it, but still need to disclose that. In another example, tax preparation companies like H&R Block need to collect and share information for certain legal purposes. Furthermore, they may aggregate information from many tax returns and sell those analytics and insights to banks, who will then take that information and turn it around to devise budgeting programs for their customers. In yet another example, it is often in advisory work that statistics are sold in exchange of prospective work. In these examples, the services sell your information to aggregators, or partners adjacent to the business’s primary service. This all raises the topic of sharing to use versus sharing to sell. Data is often exchanged for other data.

There are also sharing practices that are different from selling, and independent of anything related to advertising, such as when information is shared as part of government requests, criminal investigations, mergers, acquisitions, buyouts, and bankruptcy proceedings.

These are all highly legitimate reasons. Whether these practices are morally good or bad is debatable. We resolved this conundrum by asserting that not sharing information is ideal. In addition, there are levels to the way a service shares your information. Websites can make the decision to only share anonymized and aggregated data. Others are transparent about sharing personally identifiable information (PII).

We evaluate data sharing and selling policies as follows:

6) Policy Changes

New features are always being added. Best practices, industry standards, and laws are constantly being crafted, molded, revised and updated. Companies can also decide to change a policy on their own volition. All of this can require a service to change their policies in part or in whole. We used 1) the mention and provision of policy archives, 2) date of last update, and 3) change notifications, as proxies to evaluate transparency.

What if a service updates its policy and discloses a questionable practice that it previously did not disclose. Although this may seem unsatisfactory, we felt that it shows a level of respectable transparency. In this category, we valued transparency over actual content.

We evaluated policy changes as follows:

In “Termite Part Five: Model Evaluation and Validation,” I narrate the process the Termite team took to evaluate and validate our privacy tool.

Termite Part Four: Assessing Data Collection, Information Sharing, and Changes to Policies

A data science solution to privacy policies that nobody actually reads

4. Data Collection and Usage

5. Information Sharing and Selling

6) Policy Changes

Written by Michael Steckler