MiQ’s Data Governance process on TeraBytes of data (Data Minimisation)

Published in

MiQ Tech and Analytics

6 min readJan 4, 2022

Data Governance is an important concept concerned with Data Management, wherein, it’s a capability that would enable an organization to ensure high data quality throughout its lifecycle. The odds of not doing good Data Governance can come at a cost of a maximum of €20 million (about £18 million) or 4% of annual global turnover, whichever is greater in EU GDPR or a penalty of $7,500 for every intentional violation of the law with CCPA.

When we talk about data quality, it includes a wide range of areas, like availability, usability, consistency, integrity, and data security.

In this blog, we have outlined how data security under Data governance is done at MiQ, and we call it Data Minimisation.

What is Data Minimisation?

Data Minimisation resonates with minimising the security risk by abstracting or pseudonymising the actual data to remove the risk involved.

So how does this work?

MiQ ingests close to ~10+ TeraBytes of AdTech data every day using its microservices architecture. When we ingest data of such magnitude, we also ensure to apply data governance while it’s in transit and enable our analytical capabilities. There are 170+ datasets that we onboard to build our solutions and our business needs to deliver these governed datasets on time.

Focus on data security

MiQ datasets are required for the AdTech ecosystem, wherein, we onboard the user’s PII (Personally Identifiable Information) like Device ID, IP Address, Lat-Long, and DSP User Cookie ID, which we then use for our custom solutions and product suite. While this onboarded data is with a user’s consent, we would eventually need to provide the right amount of security for these datasets and ensure they are used effectively.

For instance, if you are trying to build insights on all of the user’s journey on a website, you ideally don’t need those raw/original user identifiers, as it requires a lot of aggregations (like sum/average and so on).

Our main focus was to create a streamlined data ingestion process, wherein, the data once onboarded needs to be secure and shall only be used for the right purpose. It’s not allowed to be misused or enable any path that can lead to problems. To build this, we have brainstormed extensively on possible solutions and have tried to incorporate the right measures to increase data security.

An overview of MiQ’s Data Ingestion Platform

MiQ’s Ingestion Service using Data Governance while transferring the data

For the size of data we ingest in MiQ, we need to handle data security at scale.

What does Security at Scale look like?

To include security for the user’s PII, we have two main requirements:

Ensure the user’s PII is pseudonymised/hashed for Insights — we don’t need the real/original user identifiers for performing insights like aggregation or understanding the user’s ad journey.
Secure raw user’s PII for Targeted Advertising (we call this “Activation”, a process of user bidding strategy for delivering advertisements) — we need the real/original raw user identifiers to be available to deliver the advertisements to the right audience. This raw data would need to be behind proper security with access control and shall not be easy to access for any other purpose.

Here are some different technical solutions which we considered:

1. Centralized Cache for User PII Mapping

This solution involves storing all the user identifiers mapping in a centralized cache like Redis, wherein, the key would be a hash of the user identifier and the value is the actual user identifier.

Advantages:

There would be one data source for Insights and for our Re-targeting requirements
We can introduce better security and governance in one place

Disadvantages:

Costly to store all the data at one centralized cache (size of the cache)
Ensuring cross-region replication of this distributed centralized cache (like Redis) would make it very hectic to maintain and, as mentioned, this comes with a high cost.
The incoming data traffic surge to this cache can mean, this could become a bottleneck if it can’t handle the huge requests. In our case, if we upload 10 Million user identifiers every time, then we need to keep hitting this for every use case

2. Separate Copy of User Data

We would need to store separate data, one for Insights, which has all the hashed data and another one for the original data which would be required for re-targeting the users.

Advantages:

Less hassle than to store a global lookup for all identifiers in one place
Storing in object storage like S3 is cheaper compared to running a high available cache for all the user mappings. Distributed and easily maintained data.
Can replicate the same data for our testing workloads

Disadvantages:

Copies of data being stored, i.e., one for insights and another for Ad targeting.
Managing the stakeholder expectations to maintain the integrity of hashed data and raw data along with providing the right security for our stakeholders would become challenging.

We ended up choosing a later solution because of the cost and ease of maintenance.

Take a look at the flow of how to store a separate copy of the data

MiQ Data Governance process — Here’s a sneak peek of how the entire process works

Load the data into a temporary staging area within S3
Verify the data, i.e, if there is a need for hashing or pseudonymising, then we apply this logic via a Spark-based template.

The data minimisation techniques we use are:

User ID (64 BigInt) — add a MiQ salt and then apply MD5 hashing on that data.
Device ID — this should always be converted to lowercase and then add a MiQ salt while applying SHA-256 hashing post the salt.
IP Address — we use our custom logic of finding the Zip Code and other details for the given IP address. We then add a MiQ salt and hash it using MD5 hashing.
Lat-Long — truncate them to 2 decimal places to reduce the proximity
Referral URL — remove the query parameters that might have PII being passed from the generating website

Once hashed or modified, we then store it for Insights purposes. If this data is also needed for targeted advertising, then we keep raw/original data separate into another S3 bucket which is hidden for Insights and any other common usages or misuse.

The Important Security Aspects that we follow

To ensure the raw data is secure (i.e. it should not be accessible to anyone directly), we have ensured that it’s only accessible by our in-house services like Analytics Studio and Trading Lab. Our stakeholders ensure to use these platforms to act on raw/original data to trigger advertisements.

The Analytics Studio, an internal platform, which is responsible for analytics, ensures that it hides the actual data access and provides a wrapper on top of the data for analytics and re-targeting. This acts like a clean-rooms solution within MiQ.
The Trading Lab, another proprietary platform, is responsible for allowing our traders to re-target users for advertisements and allows them to use the required users PII data. This platform hides the original data from stakeholders.

Today, we can apply this process to Terabytes of ingested data and these data pipelines run periodically. Our ability to integrate the ELT (Extract, Load & Transform) with data governance has been very important and has helped us to achieve better security measures.

In the future, we are exploring ways to focus more on exploring tools like Apache Ranger or Databricks Delta Lake for trying to reduce the duplication of data in S3. Keep watching this space for more such blogs.