Returning Power to the People:
the Fourth Era of the Internet

human.ai
human.ai
Published in
12 min readMar 4, 2021

Was Tim Cook right when he wrote that we are just the products of Google and Facebook, to be sold to whomever at their whim? Was Sir Tim Berners-Lee right to rage about Mark Zuckerberg turning the internet into a closed instrument of surveillance capitalism and electoral manipulation? If they were right, then what does the future hold for us online?

We seek to answer these questions below. We do this by describing the still early evolution of the internet. In particular, we do this by building on the reflections of Chris Dixon and Andrej Karpathy. We describe the evolution towards corporate control, particularly in the age of AI. While much of this evolution is indeed bleak, we contend that a fundamental change is underway. The building blocks are now in place to return the internet to the people. Tomorrow looks a lot brighter than today.

Code & Power to the People

The first era of the internet was based on open protocols controlled by the internet community. These protocols extended from TCP/IP to a full internet protocol suite, to HTTP for the web, to XMPP and VOIP for chat and voice, and too much more beyond. This first era ran from the 1980s to the early 2000s.

Code & Power to the Corporates

The second era, overlapping with the first, started in the mid-1990s. It was characterised by closed-source products and services controlled by centralised corporate giants like Apple, Amazon, Google and Facebook.

The switch from the first to the second era was the result of incentives. Open protocols had no scalable business model at the time. This made it increasingly easy to recruit the brightest minds to build centralised monopolies.

Code to the People. Power to the Corporates

The third era started in 2006, beginning with the launch of Hadoop. In this third era, the corporate giants increasingly open-sourced code. Rather than treating proprietary code as a core differentiator, they now pursued monopoly control through network effects.

Google’s Android provides an illustrative case study. Google was able to use an open-source codebase to lower development costs through collaboration, to improve hiring pipelines, to improve customer and partner acquisition, and to establish and control industry standards. Google was also able to entrench monopoly control through multiple network effects — based on platform, personal utility, market, market network and data.

In the Age of AI, Data + Talent = Power

2016 marked a significant change in this era. This was when AI started eating the software that has been eating our world. As before, the corporate giants continued to open-source code while still pursuing monopoly control. However, their path to control focused specifically on eating the talent and eating the data that fuels the AI.

This focus on talent and data followed from the fundamentally different way that AI software is written. AI software is not written by humans. Indeed, it is often not even human-readable. Instead, it is the result of a search process conducted by machines, with human experts directing, framing and managing the process. Human experts provide their computers with a suitable goal (e.g. “satisfy a dataset of input-output pairs”). Human experts also provide bounds on the program space that the computers must search (e.g. with some neural net architecture). Human experts then carefully manage the search. The focus on AI talent stemmed from the extreme shortage of experts able to direct, frame and manage the development of AI software. The focus on expanding, joining, curating, massaging, cleaning and labelling datasets stemmed from the need for better and farther-reaching goals.

Google has been the leader in this new age of AI. The company reimagined itself to be AI-first, where ”AI is everything — and everywhere,” with all teams encouraged to build on top of a single AI infrastructure stack (which Google open-sourced). The new AI-first Google set out to monopolise the world’s limited AI talent, so much so that Google’s academic output on AI now dwarfs MIT, Stanford, and every other university. There was also a step change in the company’s approach to data. The company expanded and merged its surveillance data to create “super profiles” for all Google users that increasingly record our personally identifiable activity across non-Google websites, across all our devices, deep into our homes and when we venture outside — whether we are using our phones or not. This reached a point that senior executives felt compelled to leave. A subdivision of the company even struck secret deals to acquire our medical data, and these efforts were then merged into Google proper.

Returning Code & Power to the People

We are now entering the fourth era of the internet. In this fourth era, code will tend increasingly towards being fully rather than partly open source. People will become exclusive controllers of their data. In particular, they will become the exclusive commercial controllers of their data. The brightest minds will also be incentivised to leave their gilded cages.

Giving Everyone a Data Wallet

The founder of the web, Sir Tim Berners-Lee, took us a step closer to this fourth era when he launched the Solid Project in 2016. Conceived as a way to break down the walled gardens of Facebook, the Solid Project is now being trialled by the NHS as a way to let patients determine who can access their medical data.

The Solid Project aims to provide each of us with a shareable, interoperable data wallet. According to Sir Tim’s vision, each of us should be able to put all our data into our own data wallets. This should include our images, our social profiles, our medical data, or any data we might store today in DropBox or Google Drive. These wallets should be shareable, allowing us to share select information with select trusted parties. These wallets should also be interoperable, allowing developers to build applications and services on top of all these wallets (e.g. a decentralised alternative to Facebook).

Insights Without Data

The Solid Project solves the problem of data sharing and collaboration between trusted parties. However, some of the most important advances we can make with data require data to be accessible by untrusted parties.

Precision medical data provides an illustrative example. This data offers the chance of amplified upside. If the brightest minds are given access to our collective precision medical data, then they can materially advance our understanding of disease, patient experience and treatment-response. However, precision medical data also offers the chance of amplified downside. At the heart of this data is personal genomic data. This data is inherently identifying. It is also predictive, and therefore sensitive. It determines the likelihood of ability, behaviour and disease.

If we are to maximise the utility of this type of data while also minimising the potential for harm, then we need to solve two problems:

  • Data Leakage: When untrusted parties run computations across our personal data, it must not be possible for these parties — or any other parties — to have direct access to our raw data.
  • Compromising Inference: It must not be possible for anyone to take the results of any computations and infer anything personal about us.

Importantly, we need to solve these problems while also satisfying two constraints. The first constraint is “computation integrity”. It must not be possible for anyone to tamper with the computations. The second constraint is “remote attestation”. It must be possible for all parties to confirm remotely that computations have indeed completed as expected.

No Compromising Inference

While the second problem, compromising inference, is a challenging problem, we can reason about it. Real-world solutions have also been deployed.

Differential privacy is a mathematically provable guarantee that inference is not possible — no matter what record linkage, differencing or other attacks are used. We say that an algorithm is ε-differentially private if the algorithm’s output is essentially the same, no matter whether any individual’s data is included in the input dataset or not. The number we choose for ε precisely bounds what can be learned about an individual as a result of their private information being included in a differentially private analysis. If an individual’s information is used in multiple analyses, then more can be learnt. Differential privacy also guarantees that the increasing amount that can be learnt is a known function of ε and the number of analyses performed.

Notable real-world solutions have now been deployed by Google, Apple, Microsoft, and the U.S. Census Bureau. While these solutions are impressive, they struggle when a large number of analyses are run over time. We expect these solutions to improve. This is, in particular, because differentially private machine learning involves repeated computation. Aircloak’s Diffix also suggests that there may be other ways to deal with repeated analyses that do enough to satisfy our real-world privacy needs even if these approaches do not satisfy the strict definition of what it means to be differentially private.

Attempting to Prevent Data Leakage

The first problem of data leakage during computation has proven more challenging.

Initially cryptographic approaches were proposed. These include secure multiparty computation, homomorphic encryption and zero-knowledge proof systems. Unfortunately, despite ongoing work since the late 1970s, these approaches have seen limited use over the years. This is because they suffer significant performance overheads. They also tightly constrain how the different parties need to collaborate. Further, the interplay between these solutions and solutions to the inference problem are not straightforward.

Since then, two alternative approaches have been proposed. These alternatives are more practical, but they also require participants to trust at least one other party. The two approaches are centrally coordinated trusted execution environments (TEEs, 2016) and federated learning/federated analytics (2017/2020).

TEEs ensure that code and data are fully isolated while these are in use. This is managed through hardware. TEEs also provide remote attestation, meaning that all parties can confirm that computations have completed. For anything other than toy problems, multiple TEEs need to be coordinated. Central coordination leaves all parties needing to trust the central coordinator.

Federated analytics involves two layers of computation: (i) local computation and (ii) cross-source computation. During local computation, queries are brought to each independent data holder, and computations are then performed in the local environment using the local data. During cross-source computation, the locally computed outputs are shared and aggregate outputs are produced.

Federated analytics limits the extent to which researchers can determine population-level correlations. This is because correlations may not be evident during the local computation. Federated analytics also demands that researchers trust data owners. This is because federated analytics exposes queries to data owners. If knowledge about queries is valuable, or if the AI models being trained on the data are valuable, then data owners need to be trusted not to use these valuable assets. Exposure of queries at each data owner also amplifies the risk of attack — by increasing both motive and means. For example, it increases the chance that an individual working for a data owner will take issue with some specific use of data. This individual will also be able to manipulate queries and local outputs more easily than data.

On its own, federated analytics also struggles when data is sparse at any of the owners. In seeking to prevent the platform coordinator being able to infer personal information, federated analytics requires the results of each local computation to be differentially private. When data is sparse at some data owners, this becomes difficult. This is because we add noise to ensure differential privacy, and computations across smaller datasets require proportionally more noise to prevent inference about any individual. We can address this difficulty by combining federated analytics with secure multiparty computation. When we do this, differential privacy can be enforced once during the cross-source computation step. However, depending on the way secure multiparty computation is added, participants either need to trust the coordinator or they need a trusted third party. In both cases, this leaves governments with a target for coercion.

Solving the Data Leakage Problem

Researchers from UC Berkeley and Cornell finally solved the problem of data leakage in 2018. They provided a solution that enables privacy-preserving analysis across personal (sparse) data wallets, like those of the Solid Project. Further, they provided a solution that keeps queries and AI models secret.

Their solution started with the idea that TEEs do prevent data leakage when we use a single TEE to analyse tiny multi-source datasets. They found a way to scale the prevention of data leakage from single-TEE analysis to multi-TEE analysis without imposing any trust requirements on any party. They achieved this using a provable security model that allows for independent verification. They also achieved this with negligible performance overheads.

They achieved this by using blockchains to coordinate TEEs. Crucially, they used blockchains only to coordinate. They cleanly separated execution from consensus. They showed that all computation can happen over personal data off chain in TEEs, which then attest to their correct execution on chain. This allowed the researchers to avoid the performance limitations associated with blockchains. Importantly, this was done with a verifiable model that does not introduce additional security vulnerabilities.

Making This Real: Giving People Control

Professor Dawn Song, arguably the world’s most cited security researcher, led the development of this new solution. Soon after the research was published, a16z led a $45m investment into Professor Song’s spinout, Oasis Labs, to bring this solution to the world. After two years of development, in September 2020, the first major real-world deployment of this solution was announced in partnership with Nebula Genomics.

Aligning Incentives

While the performance improvements of the new solution cannot be overstated, this was not the only reason that a16z and others invested so heavily. Their investment followed extraordinary alignment of incentives. Each of us will be encouraged to pull ourselves and our data out of centralised services. The brightest minds will be encouraged to leave the corporate giants and build open-source services. Further, as time passes, these incentives will multiply.

If the Solid Project is augmented to allow for privacy-preserving analysis across our data, then this is clearly important. However, suppose it is difficult or costly for us to collect some of the personal data that interests analysts (e.g. wellness data). Would we simply collect this data knowing that analysts are interested? Further, if we did collect the data because of our own needs, then would we necessarily let others analyse this data? One of the things that makes the new solution exciting is that it has smart contracts built in by necessity. This means that each of us can be financially incentivised to collect and expose valuable personal data for privacy-preserving analysis.

These tokenised smart contracts will also incentivise developers to build open-source tools and services on top of and in support of the new solution. These incentives take two forms: pre-mined tokens and an operational share of each transaction. Open source is a defining feature of these tokenised, decentralised services. The need for open source is amplified in this particular case by one of the requirements of the new solution. It must be open source to provide verifiable security.

As new companion services are built, these will enable and encourage further tokenised, open-source services to be built — each with new users and with new customers. This leads to accelerating growth of an ecosystem around privacy-preserving analysis of our personal, valuable data. Traditional centralised services are inherently egotistical and competitive, seeking to own as much of the user’s experience and the customer’s experience as possible. Tokenised, decentralised services are different. They are inherently collaborative, seeking to encourage others to build higher-level services as more higher-level services lead to more value being captured lower down. This difference has been described as a transition from thin to fat protocols.

Where It All Starts: The Killer App in the Killer Territory

While Professor Song and her team are based in California, the most pressing need for their work is from healthcare providers in Europe. We expect that this is where scaled real-world deployment will begin.

This expectation follows from new regulatory requirements. The EU’s proposed Data Governance Act compels public health providers within the EU to make their data shareable to stimulate health innovation. However, histories of eugenics and secret surveillance mean that privacy cannot be sacrificed. Data sharing must be done in strict accordance with GDPR. Additionally, technical measures must be in place to prevent third-country authorities accessing European patient data.

Concluding Thoughts

Chris Dixon wrote in a typically prescient post: “Centralized platforms have been dominant for so long that many people have forgotten there is a better way to build internet services.” It is possible to be free of data and algorithmic rent seekers. It is possible to be free of fear that our data will be used against us. It is possible to make collaboration rather than competition our collective focus.

In this post, we have described the transition from an open internet to an increasingly closed internet. In particular, we have described how this has unfolded in the age of AI. In 2018, a fundamental technological innovation gave hope of a brighter tomorrow, one in which the internet becomes open once again and power transitions from centralised monopolies to the people. We described this here.

The first real-world instance of change was seen in September, 2020. This was in the US, which was to be expected as most of the technical development has been in the US. However, it is in Europe that privacy matters most — both to policy makers and to the people. We predict that real-world deployment will start scaling in Europe. Indeed, given new regulations, we predict that this will happen in Europe in partnership with public health providers.

--

--