Data Privacy & Governance: A seed VC’s perspective
Part 1 of our series on the modern data stack
Over the last few years, I’ve come to appreciate the “build in public” movement. While at this stage, we cannot fully put in public all the nitty-gritty of running a seed VC in India, I would like to at least start by sharing our perspectives on specific sectors.
In this and the next few pieces, I aim to deep-dive into our perspectives on data privacy, governance, data in the context of generative AI, and so on.
Data may be the new oil, but it sure isn’t depleting or scarce. We’re talking about the sheer volume of data being created globally.
Clearly, we are in the midst of a data explosion, with the volume expected to grow from 45 zettabytes in 2019 to 175 zettabytes by 2025. For a business, internal stakeholder data (such as that of employees, customers, and other businesses) exists across different applications, regions, and devices (like cloud servers and company-run data centers).
The evolution of AI/ML, lower cost of storing data, and the rising need for businesses to become more data-driven have led to businesses becoming stakeholders of large amounts of identifiable data. Most employees and stakeholders often have easy and liberal access this data.
Data privacy has two broad angles to it — an individual’s perspective and a business’ perspective. Over the last few years, much has been spoken about how users can and should try to safeguard their own data and what they should know regarding their privacy. But, far less has been discussed about how businesses store, treat, and protect their stakeholders’ data. It is easy to ask individuals for data, but difficult technically to safeguard it and for businesses to identify the right mechanisms to preserve this data, irrespective of the how the data is being used.
Now, within the realm of personal data, there are varying levels of identifiability. The article Data Privacy in the Age of Big Data by Matthew Stewart makes for a good read in this space; in that, it offers a clear breakdown of the different levels of data identifiability.
Then, the question arises regarding the limited utility of data at the lowest levels of the pyramid shown above. It is clear that privacy and utility of data lie on the opposite sides of the spectrum.
Matthew Stewart states multiple examples where simple anonymization of only one or two fields did little to actually protect the identity of the underlying users. Further, the users could be re-identified by combining multiple data sets together.
Within enterprises, it is not unusual to find data lakes and data warehouses with numerous data sources to be used for analytics, machine learning, and simple storage purposes. It is also common for employees and (in some cases) even vendors to possess undiscriminated access to these data sources.
Let’s look at an example from the e-commerce industry. For an ecommerce company, PII data such as addresses are often used across multiple teams within a business. Marketing teams leverage the data to understand campaign performance within certain demographics, operations team use it to understand their logistical effectiveness, customer support teams — to help troubleshoot problems faced by users, ML teams — to train product recommendation models for users in a certain area, and so on. So, the same data often gets used across teams who may also store a version of the data. This can result in multiple versions of the data not being centrally stored or tracked.
So, are there ways to protect the privacy of a stakeholder and also ensure that the utility of the data is not impacted?
While in some industries such as Healthcare, there are regulations that define how data should be stored, handled and managed, in most industries, there are limited laws that have indeed kept up with technological advancements.
As the cost of storing data continues to fall and businesses continue to store and use increasing amounts of data, it is crucial for them to have the right set of products and policies at each level of their data and the tech stack to prevent indiscriminate use of data being stored in their servers.
In the next few articles, I aim to dig deeper into the underlying categories of the data privacy and governance stack as well as the underlying opportunities for businesses in each of these areas.
Meanwhile, if you’re building a product that helps businesses better secure and govern their data, we would love to hear from you!
Arali Ventures is a pre-seed, seed-stage VC from India, investing in entrepreneurs building enterprise-tech solutions for the world. We help shape their journeys through product-market-fit and beyond and scale the offerings to greater heights.
Keep circling back to read our perspectives on enterprise-tech, our portfolio, and seed-stage investing in India.