Delivering continuous data privacy compliance in a complex environment

The year of 2018 will be remembered as a turning point for companies who collect, store and process information that belongs to other people. Legislation is finally catching up with the concept of electronic data having a value to the owner and to those who collect and process it. As such, regulations like the General Data Protection Regulation (GDPR) are being introduced to protect the individual in a similar way to how consumers are protected when they purchase or use physical systems.

The internet has gone through three revolutions since the 1970’s. The first being the building of the network itself. Pioneers in the fields of computer science and networking developed protocols so that one computer could send and receive messages to other computers on that same network. The number of computers on the network was low, bandwidth of the messages was also low, and only the very technically minded could create or understand the messages that were sent.

In the second major phase of the internet, the network expanded to non-technical people. The messages increased in bandwidth and complexity. At this stage the data was mostly text with some images. However, generating and distributing the data was still relatively niche and the general population were just consumers. During this phase, information creators such as newspapers or large companies generated the data and put it onto their websites to be read-only. Just like Guttenberg’s printing press in the 15th century, the technology enabled media companies to accelerate production and distribution of data to information hungry consumers.

Enter Web 2.0 and the ability for the consumers to become creators. From newsgroups, to forums, to blogs, the technology started to evolve so that it was easier and faster for people to add data to the internet…and boy did people like their voice being heard. Internet access moved from a stationary computer that sat at home, to laptops which could be carried around to eventually computing devices that could fit inside your pocket. But who owned this data? I suppose the question was never really asked.

It was clear in Web 1.0 that the creators were the owners of the material they produced. They stored their news stories on their own expensive hardware and published on their own branded websites. However, when a person writes a status update or a tweet where does the ownership lie? Well clearly the tweet is the opinion of the I guess the owner is the creator. However, someone is paying for the hardware to collect, store and share my opinion so I must owe them something for this, right? Well, if I want my opinion to get out to the world, I suppose I don’t mind if the company that stores that data uses it for other purposes…so they can generate some money to keep the lights on in their data center. Where’s the harm in that?

It turns out that companies can make quite a lot of money from other people’s opinions, tweets, status updates - not to mention things like photos, web search habits, or geo-location. The more people contribute, the more the companies learn about that person and therefore leverage more money from their data. At this time, experts in machine learning realized that the most accurate predictive models were built from the experiments that had the most training data. And so the vicious cycle began, companies wanted data to feed their data scientists and therefore made it easier for people to contribute their data. Again, the question of ownership was not addressed.

Up to this point, we haven’t seen any downsides from the arrangement. The companies were providing a service and we paid for it with our data, opinions and clicks. Then the system started to look for new ways of making money, and turned it’s attention from just taking data from the users to pushing targeted advertising in the direction of the creators. It was shown that advertisers could use a person’s data to influence that person’s decisions. Then in March 2018, it was shown how Facebook data was used to try and influence votes in a general election.

In light of these revelations, governments decided that the companies needed to be held accountable for how they used the data they collected. And the onus was put on the owners of the data…the individual.

So now we have reached a new era of the internet, one where the companies who store the data are accountable to the individual and the individual is responsible for their own data and how it is used. Very few companies or individuals know what this means or how it affects them yet. The first regulations have set out some guidelines for how this should be implemented. GDPR focuses it’s attention on the following issues:

  1. Data must be collected for specific and explicit purposes
  2. Data must be accurate and maintained
  3. Data can be retained only for as long as it is needed
  4. Data must be processed lawfully, transparently and fairly
  5. Data must be processed securely and you must be able to prove this
  6. Data held must be adequate, relevant and limited to what is needed

So how can companies fulfill these obligations? Firstly, they need to understand the data that they store and have strict definitions on how they use it. Secondly, they need to have explicit consent from the individuals for any processing on that data.

Managing this process is a non-trivial task, especially in complex environments. In healthcare, finance or government an individual’s data is spread across a vast landscape of applications, databases and formats within the same company or organization. Generating one-off data inventories or documenting data-flows is futile because the documentation becomes obsolete immediately after they are renamed “CompleteDataInventory_final.pdf” & “CompleteWorkflow_final.pdf” and copied to a buried folder on a shared network drive.

To stay on top of compliance in a dynamic environment requires an understanding of the systems and real-time monitoring of the data flows between those systems. Consent needs to evolve from qualitative written documents to a programmatic contract between the owners of the data and the data custodians.

The first step is understanding the environment and the inter-connectivity of the applications and data stores. This should not only include the up-to-date information from each application, but can also calculate a parametric Risk Score based on some key information about connectivity, access, data sensitivity and number of users. Over time, we may be able to identify anomalous activity and flag it to the security team.

From our initial learning of the system environment, we can discover and rank personal data using data discovery techniques. Essentially we search for known personal data (firstname, email, Medical Record Number) and then search in the proximity of that data (telephone number, health record, emergency contact) for new personal data to add to our catalog.

To ensure integrity of data usage, which may include third party data custodians, we can use distributed ledger and blockchain technologies. Then we can record transactions on the data in a centralized audit trail. This audit would focus solely on transactions specific to compliance (access, transfer, export or processing).

Using a distributed ledger with blockchain technologies, we can ensure integrity of the data transaction audit

A blockchain links all entries on an audit trail using a cryptographic key. This means that each entry references the previous entry (right back
to the very first entry on the audit log). If a malicious actor tries to change the audit trail, they would break the chain (as the value referenced in subsequent chain links would change). In order to cover-up the change, they would need to change all subsequent entries in the chain. This is especially difficult if
there are multiple nodes or organizations sharing the same ledger (distributed ledger).

We now have to make sure that all the accessing and processing that is taking place on the data is listed in the contract/consent with the data owner. Therefore, the third task is to collect and formalize all existing consents. Most consents for web services are collected online. However, in healthcare and some financial organizations, consent is still collected on paper. To be totally transparent to an end user on compliance, each consent needs to be specific in the data attributes it is referring to and the exact usage or recipients of the data.

A model of a consent in a continuously compliant environment

Using this formal definition of a consent, we can continuously check that each transaction on the data is properly consented and block any transaction that is not covered.

I believe that this era of democratizing the data on the internet is set to be yet another revolution. Companies who commit to respecting ownership rights and allowing transparent transactions for online data will be pioneers for how people will use the internet in the future. Implementing the steps above will not only allow continuous compliance across a large and complex environment, but also enable an organization to adapt and add value to their data and processes.