A summary of “Trusted Data — A New Framework for Identity and Data Sharing”
I read “Trusted Data” a MIT Connection Science and Engineering book. This is a brief summary of their revolutionary framework for an Internet of Trusted Data.
Trusted Data — A new Framework for Identity and Data Sharing edited by Thomas Hardjono, David L. Shrier and Alex Pentland introduces a revolutionary architecture and framework to build the Internet of Trusted Data. The aim of the framework is to allow efficient real-time data and insights sharing while preserving data privacy — a challenge much discussed these days as demand for data increases while international data protection regulations put new restrictions on how data can be used.
This is a summary of the core concept but it only touches the framework on a very high level. I admit the 380 pages of theory, research and analysis don’t keep you up at night, but the project is super interesting. Also, Shrier is the main lecturer of the Saïd Business School and Oxford University Blockchain Strategy programme which I took. Shrier does much work in the Blockchain, Digital ID and data sharing space which is another reason why I was keen to read more about this work.
Digital Currencies, Blockchain and the Future of Money | Data Driven Investor
" Blockchain", "cryptocurrencies", "tokenisation", and now "central bank digital currencies" have made it into the…
Who should read it?
The book is interesting for anyone who wants to learn and read more about alternative frameworks on data sharing and digital IDs. You should have at least a basic understanding of data frameworks as well as Blockchain technology as otherwise the first section of the book, which discusses the technical execution of the framework, can be challenging to follow.
What is the issue at hand?
Data has been dubbed largely as the world’s most valuable resource. No one denies that in the digital age, data is power and data-driven decision making will change life as we know it. However, our current data sharing ecosystem is flawed, outdated and not fit for purpose. The book addresses the following issues:
Lack of identification online
As we’re constantly connected, there is still no secure and safe way to identify yourself online, just like the book describes “on the internet, no one knows whether you’re a dog”. Our existing identification system is still analog in a world that is increasingly digital. This makes it hard to verify who you are, but also an easy case for identity theft with cases making headlines in the news regularly now.
The existing infrastructure doesn’t address data privacy
Over the years, our existing infrastructure was built to meet the growing demand for data pretty much on the get-go and without considering much issues around maintaining data privacy. Albeit data protection regulations are being introduced internationally to address this, without a technical architecture that is built with “Privacy by design” principles at its core, it will be almost impossible to achieve robust data privacy online.
Data is largely stored in silos
Huge amounts of valuable data is collected every minute. However, that data is largely inaccessible. Traditionally IT systems have been built in silos and aren’t compatible with each other. This is extremely inefficient for two reasons:
- The real value of data can be unlocked when combined. E.g. combining medical data with location data can detect a threat for a potential pandemic at its very root (imagine we could’ve detected and tracked down the Corona Virus right from the very beginning),
- It drains massive amounts of capacity. The same data is stored many many times across different datastores. Just think about how often you shared your home address. On top, often data is outdated and inaccurate and there is no system or record in place to track and trace any changes made.
What is the solution?
The framework and architecture is called the Internet of Trusted Data. It allows efficient and accurate data sharing, while preserving data privacy. Essentially proposing a framework that is fit and worthy to handle “the oil of modern society”.
The book discusses in depth how the software architecture works incl. deployment plan, security, technical requirements, governance and how it fits into today's society. The summary touches on the main points and characteristics of the OPAL framework:
- Personal Data Storage or PDS
- Data insights through vetted algorithms
- Consent to access data
- Digital ID’s
1. Personal Data Storage or PDS
A key part of the framework are the data repositories or also called “Personal Data Storages” (PDS). Rather than having your data all over the place, duplicated and copied, all of your raw data is stored in designated PDS. The PDS has some unique characteristics:
a) The data never leaves the repository
That seems at first counter-intuitive as the goal of the framework is to allow for more and accurate data sharing — we’ll get to that. For now, the thought process is that you know where your data is stored, that there is only one copy of it and that the data never leaves its place unless you want to move it elsewhere.
b) The PDS is offered as a service to you
These repositories could be managed by existing companies and institutions (e.g. banks) or it could be a new type of service provider offering PDS as a service to you. You could have several PDS with different providers who store attributes of your data (health data, financial data, etc.). The idea is the same, the repository provider stores your data on your behalf and even they don’t have access to it.
c) The data is encrypted and storage is distributed
Storing this valuable data in one single place poses a high threat to hacking attacks. To protect your data, the framework has several measures in place. The two most important: Encryption and distributed storage.
The data is, and always stays, encrypted while the repository is split and distributed over many different places containing fragments of your encrypted data. This is a secure mechanism to protect your data and to make it almost impossible for malicious hackers to get access to it. How does that work? Imagine all of your data was locked up in one safe. The hacker would need to “unlock” only a single safe to get access to everything. Instead, imagine your data is shred into pieces and distributed among many different safes that are positioned in unknown places. If a malicious player accesses one safe, he only gets fragments of unreadable data that have no value for him. He’d need to hack into all of them at the same time and decrypt the data to be able to access it.
Only you have the key to give access to third parties to your data.
2. Data insights through vetted algorithms
Remember we said that your raw data never leaves your PDS? So the question is how someone can access it if they need to. The software architecture to manage that is called OPAL or “open algorithms”. Service providers usually need data to offer their service to you, e.g. life insurers need certain data about you to give you a quote. At the moment they’d collect the data (hoping the data is true and correct) and run an analysis on their local systems to come up with a quote.
With OPAL, instead of copying the data and storing it somewhere else, the insurer can send insight queries to your PDS using algorithms. The algorithms come to your data storage, run the analysis behind the firewall and come back with the requested result or insight to the data querer. Hence, the repository needs to have the capacity to receive, execute and evaluate a received query against the available data. The insurance company doesn’t need to know what your health history looks like, they just need to know yes or no for certain aspects for your health to give you a quote. Similarly, governments could query insights on daily commute data of millions of people in real-time to assess traffic and improve infrastructure, without ever knowing who you are or being able to track you down. Rather than sharing identifiable information, the querer receives anonymous insights.
In addition, every (data insight) transaction has a unique identifier and is irreversibly logged with a timestamp on a distributed ledger, making every data transaction traceable and transparent, giving a single source of truth to the authenticity and history of a given data set.
You might wonder where these algorithms come from. In essence, algorithms that can run on encrypted data to give insights have to be vetted by an official body (subject matter experts in the field) against a certain set of rules and criteria which are decided on by participants of a trusted framework. E.g. this could be a trusted framework of participants in the health care system like hospitals and health information system operators. When an insurance company is requesting insights, the algorithm runs against all of your PDS you have with any of the providers. The algorithms themselves can be imagined as approved commands the querer can choose from.
To achieve this, it requires organisations, businesses and institutions to collaborate. The idea is that your data is stored in designated places, and the data querers can make data requests using vetted algorithms. You can be sure your data is safe and querers can be sure to get authenticated insights in real-time.
3. Consent to access data
Another important aspect of this architecture is that while repository operators offer PDS as a service, they do not have the power to decide what happens with the data. The data belongs to the data owner and he decides what happens with it. In our current data economy, your data is being shared and sold to third parties without you knowing it. In the OPAL framework, the data owner gives consent to the querer.
4. Digital IDs
The Trusted Data framework offers a potential solution to a digital identity that is verifiable while secure. First, there is the core identity of a person. This core identity is similar to an ID card or passport in digital format. The core identity can be issued by trusted entities, such as the government. Further, this identity is only and exclusively accessible to the owner of the identity and is never to be shared with anyone. From there, the owner can create so called “personas” or pseudonym identities. Personas are unique attributes of a person, you can have a “work” persona, “government” persona and “health” persona that are unique to the person yet don’t leak any sensitive information. E.g. when you apply for a job online, you mustn’t share information about your marital status, birth date, religion and gender. Instead you share your “work” persona with the company. As the attributes are connected to your core identity, any information shared with your work persona is verified and can be trusted. Your diplomas, work certificates, etc. can be connected to that persona and companies will have the assurance the information shared is true while not being influenced in their decision making by data such as gender, religion, age and so on, offering greater equality and protecting against bias. The same goes for applying e.g. for a mortgage. You can share your “finance” persona which gives the broker insights about your financial stability, spending behaviour etc. but no data about what you spent your money on, your age and gender that (even though it shouldn’t) could influence the decision negatively.
To achieve a system like this, it requires a global undertaking for Identity and Access Management as well as to have authority distributed among many trusted actors, as otherwise system security could be at risk.
Trusted Data goes more in depth into how the governance of this architecture works, their testing and research work as well as the technical setup of the architecture which is largely based on Blockchain technology. But this gives you a good overview on concept.
In summary and main benefits
The Framework for Identity and Data Sharing is a revolutionary solution that can dramatically change how we access and process data while preserving and protecting privacy. MIT and and all the parties involved are leading the way into the “Internet of Trusted Data”. The benefits of this architecture can have a tremendous impact:
The data owner is in control and privacy is protected
Obviously it’s not all black and white and data ownership isn’t always straightforward, what data belongs to the individual and which data belongs to the corporation? While this is a definition issue, the framework allows for control over the data to the person owning it. Further, as data never leaves it’s repository and is always encrypted, privacy is guaranteed.
Real-time, verified, insights and analysis at large scale
This model opens up a whole new world of opportunities. Research, emerging tech (such as AI) and governments can get combined, verified insights in real-time with the assurance that the information is coming from a single source of truth. Governments e.g. can improve overall well-being by understanding exactly how citizens move and interact in an area on a day-to-day basis without ever infringing privacy. Cyber innovation will be fostered as new data insights become available and research institutions can get authentic insights without having to go through lengthy studies and surveys.
It is secure and transparent
Build on principles of Blockchain technology, the framework is largely decentralised. Not one single entity (or a powerful few) has control over the infrastructure, instead it is distributed among many nodes which provide the computational resource, and execute the consensus mechanism in place. It provides highest protection against attacks. Further, it allows to trace back origins of various data, algorithms and processes, supporting a “chain of provenance” for auditing while guaranteeing transparency.
Network scalability and interoperability
Using PDS, a single source of truth for data and a decentralised architecture, the framework doesn’t have the same challenges as our current system, solving issues around interoperability and allowing scalability.
As we’re moving at the speed of light into a digital society that primarily works and functions with data, a new infrastructure that supports it and is fit for purpose is much needed. The existing system was built over time as needed without thinking much about the model in terms of efficiency and safety. Malicious players have it too easy, as the current architecture was never built or designed to protect data at its core. So moving to an “Internet of Trusted Data” is not only admirable but desirable.
My main point of critique is execution. Albeit various examples are given how the software architecture is being trialled in so called “living labs” (cities and environments participating in real-time experiments to test the feasibility of the OPAL framework), the implementation challenge is massive. The approach is very much strategic and top down. MIT works closely with partners consisting of governments, universities and research and to be fair , to achieve something as big as this you do need to have those connections. However, what you also need is collective buy-in from people.
The truth is that even with the implementation of data protection regulations, scandals around data breaches and identity theft, if measured by day-to-day behaviour of people, most aren’t really concerned about this. Numerous researches, no matter if in the UK or in the EU have found this contradiction. What people state about their privacy concerns in surveys, doesn’t reflect their actual behaviour on a daily basis. The convenience gained from sharing data is perceived greater than issues around data privacy. So it seems the need for a privacy preserving framework isn’t a public concern at large.
If data owners don’t really care or have the tools to understand the issue, most governments won’t invest into a revolutionary framework that only works if all parts of the ecosystem of information are involved, above all the data owner being at the centre of it. Further, the book highlights several times, for this framework to work, particularly the digital ID, it requires global collaboration, recognition and consensus. The challenge around collective buy-in and need for large scale adaptation, unfortunately is often one of the main reasons of failure for implementing Blockchain based solutions. The solutions are usually ground-breaking different but require a whole ecosystem to participate to really show its benefits and value.
Shrier does touch on these challenges and also mentions that many startups are innovating in the space. Yet, for now, the execution plan lacks the involvement of the individual.
I believe a solution that enables people to understand their complete digital footprint through one single touch point and that allows direct interaction on data usage consent between business and individuals, would be a first step to create the needed awareness and get the support from the public to implement an Internet of Trusted Data.