#SSI101: Encryption and Correlation
We talk a lot on this blog about how DIDs and VCs preserve privacy “by default and by design,” and how DIDs enable a more secure way of binding identities to data. Unless you have a firm grasp on how SSI manages keys and a firm grasp on the underlying mechanics of how keys secure data, that might sound like an assertion of faith that is hard to assess. After this crash course, though, you should understand enough to make up your own mind or do you own research on the security and privacy needs of your particular use-case or situation. Even if you’re not a math person, I promise it won’t hurt or require a calculator.
The term “encryption” is often used as a general term for a wide range of cryptographic ways of interacting with information, but the original, precise meaning of the term is a good place to start. Let’s say you have a discrete, simple, small piece of information like an email. In it’s basic, readable form, it is what cryptographers call “plain text” or “clear text”. To en-crypt something is to use a secret called a “key” (a “kryptos”, in Greek) to mathematically convert the clear message into a form that is opaque, mysterious, unintelligible; this is called the “cipher text,” and converting a cipher text back to plain text is called “de-cryption”.
A cipher text that cannot be decrypted is called a hash, and the one-way mathematical operation of creating one from plain text is called a hashing function. Hash functions are “deterministic,” which means they are consistent — the same cleartext run through the same function produces the same hash no matter where, when, or how many times it’s hashed. A hash has few other functions aside from comparing two clear texts without having to know them — or one clear text at two points in time. Hashes are thus key to tamperproofing, or confirming information like a password without having to transmit it.
There are many different kinds of ciphers, and many ways of combining them into encryption/decryption schemes to protect the transmission of private or important information. Securing military communications has historically been the primary driver of encryption technology and related mathematical research. Nowadays, with the trust-poor and fraud-rich internet becoming more and more of a warzone every day, we actually benefit from automatic encryption and decryption systems working quietly in the background of essentially everything we do online.
The simplest form of modern cipher is a “linear” cipher or 1-to-1 cipher, where each letter of the alphabet is swapped for the same alternate letter each time it occurs in the clear text message to create the cipher text. Trying to guess the cipher from a simple, natural-language cipher text is called a “cryptogram” and even though there are 15,511,210,043,330,985,984,000,000 possible ciphers for a 26-letter alphabet, a trained cryptogrammer can still guess the cipher to a 50-word, natural-language message within an hour by a deductive process of elimination, relying on the small number of common 2- and 3-letter words. In the 19th century, before the discovery of Sudoku, people actually did this kind of guesswork for fun!
On the other hand, modern cryptography of the sort used to secure most of the world’s internet traffic relies on complex mathematical “curve” ciphers that take substantial computational work to use for encrypting or decrypting even when you know the cipher in advance. Guessing those ciphers, on the other hand, would take a supercomputer years to attempt, not counting the work of confirming the “clarity” (meaningfulness or authenticity) of the decrypted text. This inconceivably massive guesswork is referred to as a “brute-force attack.”
Cryptogram ciphers, in addition to being linear and more simple, are also “symmetrical,” meaning that the exact same cipher is needed to encrypt and decrypt a given message. Like the “decoder rings” that children bought from comic books during the Cold War or the cipherbooks used by spies in World War II to encrypt and decrypt messages over telegrams, the two parties have to share a cipher in advance, without their enemies getting a copy. Over long distances or with new internet friends, this is not very practical!
Assymetrical Encryption: the foundation of modern internet security
Instead, most modern computing ciphers are “assymetrical”, in the sense that one cipher encrypts the message and another decrypts it. The relationship between the two keys is not arbitrary, however — they share an important mathematical relationship. This is why they must be generated together, and why we often talk about them as “keypairs,” since their real value lies in that relationship.
One further complication that often eludes nonspecialists is derivation, a process that creates so called “child keys” deterministically, meaning that a given private key combined with the same ingredients produces the exact same derived key on any computer or in any context. From one private key, this process can derive thousands of keypairs to use in different contexts or to create distinct secure channels. In so-called “hardened” derivation, gathering hundreds of these public child-keys generated from the same private key does not enable a person (or a computer working around the clock) to guess the private key that made them all. The mathematical structure that makes possible all these informational superpowers is not simple, but it’s also not impossible for the motivated non-mathematical reader to understand; luckily, that understanding is entirely optional for our purposes.
What matters here is that a public key can safely be made public, since the private key is not at any risk of being “guessed” or deduced by brute force, no matter how widely its public key(s) are published or how widely. Anyone who has a public key can assume that only people who have the corresponding private key will be able to read a message encrypted by that public key. Anyone can prove they have the private key corresponding to a public key (often loosely referred to as “control” of that public key) by answering a simple question encrypted with the public key, which is the cryptographic equivalent of “send me a selfie with today’s paper” or “how many fingers am I holding up?”. Many “proofs” and “signatures” are ways to automate this operation — invisibly, millions of these interactions are carried out every second around the world, as pieces of software use private keys to “sign” and encrypt arbitrary tidbits.
Public Key Infrastructure: The invisible hierarchy of internet security
This one-way relationship between private keys and public keys is actually very fundamental to the modern internet. Publishing your public key while guarding your secret private key creates a secure (but one-way) communication channel between you and anyone who has that private key, since they can send a message knowing only you can decrypt it, but you have no way of sending private messages back unless their public key was inside the message. Trading public keys with a stranger you meet online creates a secure two-way communication channel… if you can trust that stranger to actually keep their private key private, that is.
After the first few decades of global internet usage proved that a “security layer” (or three) was essential for even the most trivial of transactions and social interactions, different ways of publishing public keys and encrypting direct communications became more widespread and sophisticated in an incremental escalation of key-based security schemes and layers. This system has been used for decades to secure most of the internet — albeit in a heirarchical way that has ossified into one of many centralized trust backbones structuring the internet.
The reason for this is that “publishing” is a tricky business involving a third party that we take for granted and don’t tend to think about very much. As long as both parties trust a third to reliably, objectively, and fairly publish one another’s current public keys and contact information, any two parties can interact securely. Nowadays, a distributed network of these third parties called “Certificate Authorities” (CA’s) publish registries of public keys for internet servers that want to allow unknown parties to contact them securely.
From the earliest days of the internet, networking and even computers in general have been optimized for heirarchical and client/server data flows on its lowest levels. For this reason, the “CA system” quickly developed into a whole toolkit of stable, reliable, and by modern standards simple ways to exchange keys between individuals and federated trust servers. The system is somewhat complex, but to simplify, we could say that each time you initiate a connection with a server, you verify its published keys and then it asks your browser to generate a disposable, single-use keypair on the spot, based on an arbitrary piece of information. After a few messages back and forth you have an secure two-way channel on which to start interacting, usually establishing further layers of security as necessitated by the interaction.
Establishing a secure, mutually-encrypted direct connection discreetly, while leaving no trace of the lookup on central servers, had not advanced as much over the same decades. Creating a system as robust as the CA system that allows individual actors to exchange public keys without a public intermediary, just by chains of referrals and peer-to-peer interactions along a social graph, has proven an elusive goal for decades.
One scheme for establishing end-to-end encryption in a peer-to-peer and decentralized way was the wittily-named “Pretty Good Privacy” (PGP) scheme, first developed in 1991 and distributed to power-users and activists via a worldwide movement to democratize the internet through “cryptoparties” and the like. This essay is neither the place to sketch out a thorough history of PGP or its central place in the ideological tradition driving much of decentralized identity. Some contemporary projects are also worth mentioning, since they keep PGP powerful and relevant in our contemporary privacy landscape until such a time as SSI tools reach widespread adoption.
Though far from mainstream, PGP-encrypted email is the closest thing we have yet seen to a stable and worldwide open standard for end-to-end communication-channel encryption that relies minimally on intermediaries to trade public keys and prove control of private keys. It also leaks very little traceable data, making it far more privacy-preserving than other forms of encryption that rely on auditable central servers administered by central authorities to establish and oversee direct connections. How these two schemes differ in privacy terms requires a bit more explanation, though.
Correlation: the final frontier in encryption-secured privacy
Many different concepts get conflated under the heading of privacy. For example, anonymity, where no identity can reliably be attributed to an actor, often gets confused with pseudonymity, where someone uses a context-specific, arbitrary identity unrelated to their identity outside that context. Even more dangerous than conflating the two, however, is trusting in either too uncritically. A user’s privacy in a private system is completely evacuated if a link can be established between their identity in the outside world and their pseudonym, or even just their anonymous-yet-tracked behavior in that system.
What creates that link, sometimes against the best intentions and meticulously engineered designs of a pseudonymous or anonymous information system, is correlation — the establishment of a link between one identity in one context and another identity in another. In software design, we mostly use the term negatively, to connote an unfortunate, accidental, or malicious correlation that “breaks” or compromises the privacy of a system not intended to link out to other contexts.
Technically speaking, correlation can sometimes be a positive thing as well, such as when an anonymous ballot is correlated to a voting record to make sure each voter only gets one ballot (ideally without correlating the contents of that ballot!). There are many kinds of correlation, and identity systems by definition seek to structure and control correlation so that it happens everywhere it is positive, and only there.
In the real world, however, and in a world full of malicious actors and misaligned incentives, lots of negative correlation has to be minimized, preempted, and policed. Correlation can be the work of any actor in or outside the contexts being linked; it can happen accidentally or on purpose; it can happen in advance, at the time of action, or years later. It can be definitive or merely probabilistic, material or circumstantial.
A golden age of correlation, and liability
Unfortunately, unwanting and external correlation is experience a period of rapid expansion, as new technologies like machine-learning, quantum-powered brute force, behavioristic tracking, and browser-based“fingerprinting,” being applied to metadata or third-party data completely external to the context.
This last approach is worth stopping to explain: it exploits every nook and cranny of cross-browser interoperability standards (particularly those designed to facilite tracking for advertising purposes, aka adtech), server logs, and routing metadata to track individual browsers and users even in private mode and without cookie or overt tracking mechanisms. Thankfully the last few years have seen browser standards and the browser sector of the industry move progressively to close these loopholes (often at the expense of advertising technology). Some might even say that privacy is driving a new round of “browser wars” as strategies (and business interests) diverge on how best to preserve the relative privacy of individuals on the way.
In many ways, privacy requires not just encrypting the data (ï.e., the “contents”) of communications and transactions between parties, but obscuring or shielding the metadata (i.e., the “context”) as well. Simply put, the more sophisticated correlation techniques get, the less metadata they require to tie private activity to real-world identity. Much of the sophistication of real-world SSI systems is about layering security, introducing “indirection” mechanisms, and double-blinding components even within one shared system to minimize the risk of correlation. Proper correlation-resistant design works even in cases of access breaches or external technical failures.
In the context of the General Data Protection Regulation (GDPR) that came into effect across Europe in 2018 and similar laws being passed since all around the world, correlation is an absolutely crucial benchmark and criterion. Absolutely any data becomes “personal data” once it has been correlated to a data subject, in some cases even if that correlation is done maliciously by third parties after the fact. What’s more, even data that seems inherently impersonal like automotive driving data or geolocation data also becomes personal data if connected to a specific person’s driving or location. This makes correlability of data a legal liability!
Even though precise technical limits and best practices for keeping personal and non-personal data distinct are still being evolved, the spirit of the law is clear: all data needs to be treated as potentially personal data. This obligates the stewards of all data to make good-faith efforts to minimize data leakage, expose their data flows to security audits, and mitigating unavoidable correlation risks.
Self-sovereign identity systems, in and of itself, are not intrinsically GDPR-compliant or correlation proof, but mature SSI systems all take on considerable complexity to maximize correlation-resistance. One particularly well-documented case of correlation-proofing can be found in the Sovrin Foundation’s GDPR report, although more piecemeal technical and legal documentation has been published by other major SSI systems.
Fundamentally, in a context where no correlation-proof system can be 100% future-proof, how do you know when you have engineered the system to be correlation-proof enough? As in matters of ethics or politics, the answer is always community and the widest-possible dialogue. In this case, that means standards groups and open-source code review, to which we will now turn.