Termite Part Five: Model Evaluation and Validation

A data science solution to privacy policies that nobody actually reads

6 min readOct 3, 2020

In this post, I narrate the process the Termite team took to evaluate and validate our privacy tool.

Modeling and Validation

Validating one’s model is a crucial part of the data science pipeline. Endeavoring to design a product that could inform users about the impact of the policy on their lives and rights, Termite has to decide what is important to our users. The intention is to distill information. It is a classic consumer safety scenario, like Energy Star eco-friendly ratings, or a Moody’s bond rating. For all of the below validations, we used the privacy policy of my graduate degree program’s learning management system, 2U Inc.

As mentioned previously, our categorization rubric was informed by user surveys on 60+ people, interviews with I school privacy experts, and our own ideas. Curious to know how our categorization would hold up to the light and scale to other users, we conducted a validation analysis from the point of view of the model, as well as the point of view of our manual analysis. In order to assess the output of our model, we wanted to evaluate our model against that of a privacy expert as well as an average internet user, who has not studied data ethics and privacy as closely as say, MIDS students, have. In addition to output quality, we realized we could simultaneously validate the integrity of our own categorization rules and therefore the consistency of our analytical approach. We therefore used the 2U privacy policy as input and compared our model’s outputs against our manual human outputs, who evaluated the policies using our rubric.

First and foremost, we wanted to evaluate our model’s accuracy improvements over different phases of model development. Below shows F1 scores, a metric that is composed of precision and recall, over three development stages: Linear Support Vector Classifier (LSVC) and Multinomial Naive Bayes.

Category Recognition

We wanted to know how well our model recognizes categories. The confusion matrix below reveals the amount of true positives along the diagonal, in addition to false positives and false negatives elsewhere. The majority of false positives and negatives returned stem from Data Collection being misclassified as Usage and Information and Selling, and vice versa. This therefore suggests that our model has trouble discerning between the two categories. On the upside, Our model does a great job of correctly identifying the Cookies and Tracking and Policy Changes categories. The rows represent the privacy category our model predicts a sentence belongs to, while the columns represent the actual category of a sentence. So, everything along the diagonal represents the amount of correctly predicted sentences for each category.

Confusion Matrix (LSVC — 1450 data points)

1: Censorship and Suspension

2: Deletion and Retention

3: Cookies and Tracking

4: Data Collection and Usage

5: Info Sharing and Selling

6: Policy Changes

We also wanted to investigate what words (unigrams) and pairs of words (bigrams) our model flagged as most important to attributing a sentence to a privacy category, and how that changed over the course of model development. The following six tables portray the top 10 unigrams and bigrams for three different stages of model development. This analysis suggests that our model is picking up noise in the form of mention of company names, which is denoted in red text. In the future, we could tweak our model to treat company names as stop words and ignore them.

Category Recognition by Top 10 Unigram (LSVC — 1450 data points)

Category Recognition by Top 10 Unigram (LSVC — 38202 data points)

Category Recognition by Top 10 Unigram (Multinom. — 38202 data points)

Category Recognition by Top 10 Bigrams (LSVC — 1450 data points)

Category Recognition by Top 10 Bigrams (LSVC — 38202 data points)

Category Recognition by Top 10 Bigrams (Multinom. 38202 data points)

Subcategory Recognition

We also evaluated different data structures with respect to how well our model can classify subcategories. Termite extracts a policy excerpt, determines which of our subcategories it relates to, and outputs a confidence score of that classification. In the first approach, Termite is able to extract multiple sentence policy excerpts, as well as single sentences. In the second, this model is constrained to only pull individual sentences. Comparing confidence scores between the two methods, we observed that breaking by sentences yields improved accuracy.

It is worth noting that sentence breaks affect the policy language that Termite pulls out. Below is an example of how two approaches yielded different confidence scores.

Despite augmenting our data to increase the size of our training data set, our model still suffered from category and subcategory imbalance. For example, we had significantly fewer examples of Deletion and Retention policies (2), relative to the other categories. In the future, we will need to annotate more full policies by hand to more sufficiently train our models.

Future Work

Model Tuning and Statistical Analysis

We want to conduct a more extensive exploration of model performance. This would involve testing and fine-tuning our grading mechanism, developing a better public accessibility score. In particular, we want to cover more topics and subtopics, such as the company’s stipulated court of legal jurisdiction, and encryption. We simply did not have enough time or bandwidth to devise a grading mechanism for these topics. We also would explore giving users the option to adjust their scores if they mark themselves as being within the jurisdiction of GDPR or CCPA.

Considering that we mostly used frequency based models, we want to explore models that utilize a greater degree of language contextualization. Full-document sequence models for initial subtopic labelling could also work, but we didn’t have the time or data for it. However, this could be achievable with more training data. It also would be interesting to assess whether the contextualized model performance between two strategies: training on entire text paragraphs, or by single sentences.

We want to improve the web scraper to assess new terms & conditions whenever you’re online and the Terms and Conditions box (ie the prompt with “I agree”) pop ups. We want to implement local caching of agreements so that we don’t have to ping the server for sites the user regularly frequents. Additionally, we want to incorporate more caching on the user side (for model scores and website metadata) to similarly decrease server pings. For future model training, we would want to store text bodies in a long term database.

Usability Research

Having launched our minimum viable product, an immediate future work priority is usability research. One can conduct usability tests actively, such as facilitating a product test over zoom. Alternatively, a passive approach can involve devising a pre- and post-test survey, written test instructions, perhaps recording a user’s engagement with the product, and finally a set of written instructions and approaches. The former offers more controlled insights, but also risks the introduction of facilitation bias. That is to say, having a person guide you through the test doesn’t accurately reflect the end-user reality, who will likely use our product when they are by themselves. User behaviors are influenced by their environments. The latter can be more easily scaled using platforms such as MechanicalTurk or UserTesting. Ideally, we would use a mixture of both methods and report the results of both.