The Who: Filtering Personal Data
The Internet is the world’s largest repository of user-generated data. Webpages, social media, forums, reviews, blog posts, and search data, when analyzed at scale, can reveal profound insights into consumer preferences and behavior. We at Quilt.AI specialize in interpreting the Internet to lead organizations toward better business decisions.
How do we make sure that the information we gather is anonymized and does not infringe on the privacy of the individual? Although all of the information we collect is publicly available, the open Web holds a large quantity of personally-identifiable information (PII) including names, phone numbers, and email addresses.
We are extra-careful about personal privacy for both ethical and compliance reasons. Of course, we do not want nor need PII in order to extract insights for a given demographic — we only need aggregate data. We are not in the business of the individual; we are in the business of the cohort. So, how do we discard PII seamlessly while retaining important information?
To address the issue of data anonymization, we needed to build a PII filter. When an item of content is pulled from the Internet, the PII filter should discard any sensitive data that might exist in that content item.
An engineering team may choose to manually process each content item to perform
- a keyword lookup using a table containing all possible values for specific PII categories (e.g., a table containing all possible names), and
- a pattern search to identify phone numbers and email addresses.
This solution is obviously not very robust — a previously-unseen name would not be recognized by such a filter. Furthermore, this solution is not context-aware — a sentence like “My name is Apple” should indicate that ‘Apple’ is a PII item and, therefore to be discarded, but a simple lookup would not achieve this.
With recent advances in machine learning (ML) and specifically in natural language processing, it is possible to filter PII in a more contextual way. For our use case, we found the right tool in Presidio, an open-source Python library from Microsoft that offers pre-trained models for identifying and removing PII from the text.
In order to use Presidio, we need two packages: presidio-analyzer and presidio-anonymizer. The former is responsible for the heavy processing and outputs a format that is used by the latter to anonymize and replace information within a sentence with the appropriate tag. Both packages can be installed with the usual pip commands:
After both packages are installed, we need to download a model. Presidio can use either Spacy (default) or Stanza. When looking into the different models available from these repositories, we decided to stay with the default Spacy, and to use the default English model, which can be downloaded with:
After the model is downloaded, we need to run it and specify the entities we want to detect as well as the language our input text is in. All entities and languages supported by each model can be checked on their respective repo websites. For our test case, we’ll be using PERSON and EMAIL_ADDRESS entities and the English language.
Let’s instantiate the model, then pass a sentence to it, and see the results:
As we can see, the analyzer outputs a list containing all identified PII entities, including their location within the sentence.
After this, we instantiate our Presidio anonymizer and feed the results of the analyzer to it:
The final result is the text with masked PII entities.
Once we have our anonymized text, we can proceed with our analytics (sentiment, semiotics etc.) with no risk of infringing on individual privacy.
An open question remains around the “lossiness” of the PII filter. Since sentences such as “I love Luke but hate Anakin” would be transformed to “I love <PERSON> but hate <PERSON>”, do we actually dilute our insights when using the PII filter? While the intuitive answer would be in the affirmative, it is interesting to note that for large real-world Internet datasets we did not find a large qualitative difference in the quality of insights obtained. This is likely attributable to the nature of our datasets — we choose data about brands, places, and experiences and not data about people. Nevertheless, an intelligent masking system that distinguishes between PERSON1 and PERSON2 might be useful to explore.
At Quilt.AI, we use machine learning to extract cultural meaning from publicly-available, anonymized Internet data. Reach out to us at firstname.lastname@example.org for more information!