Automating Analytics For Large Organisations: The Case For Privacy
Elegant solutions to fix and prevent personal data leaks.
This article will cover a use case related to privacy and how websites handle Personally Identifiable Information (PII), also known as “sensitive information”. We will see how a simple digital marketing tool, such as Google Analytics, can unknowingly collect PII, and how we can deploy elegant solutions to fix these personal data leaks.
Note: The problems and solutions described in this article are tool agnostic and can be applied to all analytics tools, including Adobe Analytics, Piwik Pro, Piano, Matomo, and more.
Evolution of Analytics: From Early 2000s to Now
I have spent most of my career working as a data consultant assisting large, worldwide organisations. This has meant juggling a great number of brands, product lines, and markets — each with their own sets of requirements.
Back in the early 2000s, I set out to implement a then-emergent Omniture SiteCatalyst analytics solution for a printer manufacturer with a digital real estate spanning 42 countries, 16 languages, and 6 business units. I knew I had to take into account the requirements of the global data governance team and tactical teams.
Back in the day, there were no tag management systems; everything was done with on-page JavaScript. Using ASP, JSP, or PHP was not an option.
So, I set out to build some kind of precursor to continuous integration & deployment (CI/CD) that would regularly source itself on a specification sitting in an Excel file before generating tens of thousands of webpages with the right tracking code and context-accurate dimensions and attributes (s.props, eVars, and events back in the day). The HTML pages would then be automatically posted to the right Web servers. Sweet, sweet automation was purring like a kitten, sites were visited, data was collected in SiteCatalyst, reports were generated, and life was good. Now let’s accelerate to 88 miles per hour.
Fast forward almost 25 years and we have dynamic websites, front-end frameworks, server-side data collection… and more data than we were ever meant to handle.
This also means that private and personal information has become part of this all-you-can-eat data buffet. The worst part? It’s not even intentional! Nine times out of ten, PII is collected out of negligence, incompetence, or ignorance.
Essential Practices in Data Input Sanitation for Analytics
Before we even get started with pesky consent banners, the main problem with PII is developers not realizing that they are exposing form data in URLs as parameters.
There is a very high likelihood that the next newsletter you sign up for will display your e-mail address in the URL, to the tune of &email=your.name@company.com.
This means that any digital marketing technology installed on the website will then detect and add that URL parameter (ie. the user’s email) to the analytics tool, breaching the user’s privacy rights and the tool’s privacy rules.
In Google Analytics 4, for instance, the Page Path + Query String report would include entries such as:
/forms/thankyou?name=Your%20Name&email=john.smith@acme.com
Unfortunately, this potentially applies to all kinds of sensitive information, including social security numbers, credit card numbers, and phone numbers. This usually stems from developers defaulting to using the GET method when coding a form, which exposes form field values in the resulting page URL.
Encourage developers to use the POST method whenever possible, where form information is POSTed but not visible in the URL.
The year is 2024, and any developer worth their salt needs to know the difference between GET and POST methods — and not just for REST APIs.
First-year computer science students are told to sanitize their inputs to avoid the Bobby Tables effect. Generally speaking, there is a very real need to teach about data ethics in tech in general and development in particular.
Development agencies should be incentivized to adhere to data ethics and protection laws, not just to satisfy functional specifications.
Tactics to Prevent PII Data Leaks in Analytics
Most companies don’t realise they are at risk of PII data leakage until after the fact. When it comes to modern marketing technology products, your firewall is equipped with four layers: your development team, your client-side tag management system, your server-side tag management system, and your data collection platform.
Mobilize your Development Team
Train or hire developers that write clean code with data ethics and privacy in mind. When developers are knowledgable about data ethics and advocates of data privacy, companies mitigate the risk of PII getting caught up in URLs and other data elements.
Your client-side tag management system
Any element of a webpage and its JavaScript context can be leveraged for martech tags so using your TMS to redact or provide replacement values for sensitive data elements is crucial.
For instance, you can use a Google Tag Manager variable to run custom JavaScript and cleanup variables that make up the page’s URL before making that available to individual tags.
Your server-side tag management system (SST)
Server-side data collection is hot these days because it gives you control over the data you collect before you let said data loose on its way to martech vendors. Using your own SST code, you can essentially run your data pipeline complete with data enrichment, Extract/Transform/Load (ETL) capabilities, and native exports to your favorite cloud platform.
Your data collection platform itself
A data collection platform such as Google Analytics 4 provides data redaction by way of the property admin panel for your data stream
If PII gets past that last fourth layer, it gets written in your analytics platform’s data and retroactively editing data becomes exponentially harder, or even impossible. In the case of Google Analytics 4 data being exported to Google BigQuery, just because the data is modifiable in BQ does not mean that the origin data (in GA4) gets modified!
Leveraging Automation for Enhanced Data Privacy in Analytics
In my team, we automate everything we do because our time is precious. In the case of PII, we have set up the following automated processes for client work:
- Automate the quality assurance phase with headless browsing, replaying user journeys and listening for tag data and cookies being dropped for signs of PII. Test data includes fictitious account identifiers, email addresses, names, and other PII-like data elements. Audit data is automatically posted to BigQuery for data health reporting in Google Looker Studio, for instance.
- Automate data audit in Google Analytics 4: using both the Admin API and the Data Reporting API, we automate the retrieval of every available dimension before querying each of them for PII patterns such as the “@” sign.
- CI/CD with code repositories: the QA phase and data audit processes are run automatically when a new version of the site’s code is deployed
- Automation of notifications of changes to data collection: your tag management platform will likely have a feature that sends you an e-mail when a new version of the tag container is published. When that happens, this too can trigger the QA and audit processes.
As far as the required tech for these processes go, you cannot go wrong with a combination of Python or Node.js to call APIs and pilot a headless browser system such as Selenium, Playwright, or Cypress.
Embracing AI and Automation for Data Privacy
As we saw in this post, using automation technically falls under the definition of Artificial Intelligence: the general concept of machines performing tasks that typically require human intelligence.
Are we there yet, though? No, but very soon, with enough PII, we will be able to improve detection patterns and apply them to automated QA and audit processes.
What I have observed is that automation for this type of use case saves up a sizeable amount of resources and can be easily reused and adapted for future projects.
In the current context of improved data protection regulation (CCPA, GDPR, ePrivacy, DMA), using automation to detect PII helps show that your company is doing its best to ensure compliance with privacy regulations. Furthermore, automation helps detect and correct errors and violations that could otherwise have been caught too late — after complaints and legal actions start. The intent of this post is not to use regulation and compliance as scarecrows but as an opportunity to show that your company is concerned with privacy.
I for one encourage you to try out these methods and let me know what you did with all the precious time you saved!