Decentralized Content Networks for a Permanent Science Data Commons: IPFS

Caleb Tuttle
OpSci
Published in
9 min readDec 3, 2021
Photo by Alina Grubnyak

This article is the first in a series introduced in Rich in Data, Poor in Wisdom: Science Needs a Decentralized Data Commons.

A unified science data commons needs a single unified file system. InterPlanetary File System (IPFS) is a peer-to-peer protocol that enables the “storing and accessing [of] files, websites, applications, and data” (source). Note that IPFS is not intended to be a cloud storage solution. It is only a protocol and network. Nonetheless, because the protocol underpins some decentralized storage solutions (such as Filecoin), we will outline some of its important features. After discussing how IPFS is such a core component of Web3, we review its mechanisms: content addressing, Merkle trees, and distributed hash tables.

IPFS is at the Heart of Web3

Peer-to-Peer

IPFS is fully peer-to-peer, thus meeting the top requirement of being a Web3 technology. Anyone can set up a node and participate in the IPFS network. A node can host and retrieve content. Any content a node hosts can be discovered and retrieved by other nodes on the network. Of the graphs below, the IPFS network looks like the third, the distributed network where all nodes have equal authority. By contrast, in centralized and decentralized networks, some nodes have more authority than others.

Source: Paul Baran’s On Distributed Communications: Introduction to Distributed Communications Networks.

IPFS nodes are highly configurable. By default, if a node downloads content from another node, that content is cached so it can easily be accessed in the event that another node requests it. Also by default, a node’s cache is cleared every hour. Because IPFS has no incentive system, people and organizations often host only their own data.

IPFS Behind Many Web3 Applications

IPFS is used by many decentralized applications (dApps) to accomplish the original dream of the web as a truly peer-to-peer network. The music streaming app Audius stores its music using IPFS. The Ceramic protocol — “the smart document protocol for an open dataweb” — uses IPFS. Plenty of new storage services use IPFS (e.g., Textile, OrbitDB, Pinata, Fleek, Space, Estuary, Web3.storage, and NFT.Storage). One can easily spin up a single-page website with IPFS. Many non-fungible token (NFT) smart contracts include links to files stored on IPFS because the immutable content addresses on IPFS work well in the context of immutable smart contracts.

Content Addressing

IPFS uses content addressing instead of the familiar location addressing. In location-based addressing, a user must know the location of a file to retrieve it (e.g., /Users/user/Desktop/file or www.wikipedia.org). With content addressing on IPFS, each address is derived from the file’s contents. The derivation is a hash and so looks like a bunch of random characters (e.g., QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco). One can retrieve a file from the network without knowing its location because IPFS nodes keep track of which nodes are storing which files. This setup is similar to the DOI (Digital Object Identifier) system which gives a resource a unique identifier and then maintains a record of the resource’s location. Content addressing differs in that a resource’s address is derived from the resource itself and used by the protocol to locate the file, while each DOI is created and managed by an agency.

How Content Addressing Works

Generating an address on IPFS for a piece of content takes a few steps. When one adds a piece of content to IPFS, the content is given a content identifier or CID. Each CID has two components: a codec and a multihash. The codec “holds information about how to interpret the data” (source). The multihash has two components: the hash of the content and metadata about which hash function produced the hash and how long the hash is. All in all, a CID is a sequence of bytes with the following structure:

There are a couple interesting features of IPFS’s content addressing. Every CID is unique. As a consequence, addresses are permanent and immutable. Permanent addresses make versioning easy but present difficulties for those wanting to store mutable content, although IPFS has tools for dealing with mutable content.

The Importance of Content Addressing

Trust and Security

On a web that uses location addressing users are blind to whether malicious content is hosted at a given location because the address of a web page says nothing about its content. This highlights the main problem with location addressing: relatively high trust requirements. In a Web2 environment, a user must trust the party who controls the location of a file. For example, I trust that there is no malicious code at www.wikipedia.org; I can navigate to it without risking my computer or sensitive information. That is, I trust the Wikimedia Foundation. This trust requirement makes the web less safe than it needs to be. For example, a bad actor might earn trust and subsequently direct people to a web address that runs malicious code, or the bad actor might hack a trusted domain (such as Wikipedia). Users are thus required to trust centralized authorities.

Content addressing on IPFS reduces the trust requirement. Because addresses derived from the content will reveal differences in content, there are some cases in which we can know a file isn’t malicious. We know a CID corresponds to non-malicious content if we already have the file, have retrieved it before, or know someone who has retrieved it. Again, this only partially solves the trust problem, but it is still superior to the blindness of location addressing.

Redundancy and Duplication

A smaller problem of location addressing is unnecessary duplication. For example, if the same photo is on two different blog posts from two different blogs, the photo is probably stored twice — once for each blog — and has two different addresses. This is often inefficient, especially because the blogs might be hosted at the same physical server locations even if they have different domain names. Such redundant duplication is beneficial if the file is in high demand or if the file is stored across a geographically diverse network. In IPFS, this beneficial kind of redundant duplication is the default behavior, while the unnecessary kind is avoided. Plus, each file has only one address.

Content Linking with Merkle Trees

IPFS uses Merkle trees to link directories, files, and pieces of files together. A Merkle tree is a tree data structure in which each node’s ID is a hash of the node’s contents. The graph below represents a Merkle tree.

Source: Wikipedia

There are three significant benefits of using Merkle trees to link content: verifiability, distribution, and deduplication. One can verify that a certain piece of content corresponds to a certain CID by simply hashing the content. “This offers both permanence … and protection against malicious manipulation” (source). The distribution benefit is that any Merkle tree-including any subtree of a Merkle tree-can be retrieved on IPFS. This makes content, directories, and datasets more modular: one can retrieve a whole dataset, just half the dataset, or half of a dataset from one peer and another half from another peer. Deduplication involves removing the need to duplicate files. For example, if two distinct datasets have one image in common, this image only needs to be stored once. Content addressing and Merkle trees allow us to split files and directories into their smallest parts, store the smallest parts only once, and reconstruct the content as needed.

Content Discovery with DHTs

If the address doesn’t specify a file’s location, how does the network know where to get a file? IPFS uses a distributed hash table (DHT) to store information about which nodes are storing which files. First, what is a DHT? A hash table (HT) is a data structure that stores key-value pairs, where each key is used to find the location of value. A DHT is a hash table that is stored across a network of devices. The graph below represents a simple hash table.

Source: Wikipedia

The IPFS DHT stores three kinds of records: provider records (which maps data identifiers to peers who host the content), IPNS records (which maps IPNS keys to IPNS records), and peer records (which maps peerIDs to multiaddresses which locate the peers).

On IPFS, there are three steps in retrieving a file: discovery, routing, and exchange. First (discovery), query the provider records in the DHT, using the content’s multihash as the key to see which peers are hosting the content. Second (routing), query the peer records in the DHT to figure out where the peers are. Third (exchange), request from those peers the desired content by sending to those peers a wantlist; this wantlist is a list of blocks, where “a block [is] a single unit of data, identified by its key (hash)” ( source).

Links and Resources to IPFS Ecosystem

Conclusion

The brilliance of IPFS is its integration of technologies and data structures. Its design truly allows it to serve as the interplanetary file system. Content addressing enables permanence and security. Merkle trees enable directories and modular file storage. DHTs help the network stay connected. IPFS doesn’t, however, have the cryptoeconomic incentives that define blockchain systems. The decentralized storage network Filecoin is a natural evolution of IPFS, as it has a blockchain and is built on top of the technologies that power IPFS.

Join the Decentralized Open Science Movement

Does the idea of a free, open, internet of science ring a resonant chord with you? Consider joining the Opscientia community to learn, connect, and collaborate with others building a commons for co-discovery.

References

Benet, J. IPFS — Content Addressed, Versioned, P2P File System. (n.d.). Retrieved November 20, 2021, from https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

Distributed Hash Tables (DHTs) | IPFS Docs. (2021, February 21). IPFS Docs. Retrieved October 31, 2021, from Distributed Hash Tables (DHTs) | IPFS Docs

DWeb Tutorial | Content Addressing on the Decentralized Web (Lesson 2) | ProtoSchool. (n.d.). ProtoSchool. Retrieved October 30, 2021, from DWeb Tutorial | Content Addressing on the Decentralized Web (Lesson 2) | ProtoSchool

Host a single-page website on IPFS | IPFS Docs. (2021, August 24). IPFS Docs. Retrieved November 20, 2021, from Host a single-page website on IPFS | IPFS Docs

How IPFS Works | IPFS Docs. (2021, June 22). IPFS Docs. Retrieved October 30, 2021, from How IPFS works | IPFS Docs

Immutability | IPFS Docs. (n.d.). IPFS Docs. Retrieved October 24, 2021, from Immutability | IPFS Docs

IPLD Tutorial | Merkle DAGs: Structuring Data for the Distributed Web | ProtoSchool. (n.d.). ProtoSchool. Retrieved October 30, 2021, from https://proto.school/merkle-dags/05

Multiformats Tutorial | Anatomy of a CID | ProtoSchool. (n.d.). ProtoSchool. Retrieved October 28, 2021, from https://proto.school/anatomy-of-a-cid

Rumburg, R., & Sethi, S., & Nagaraj, H. (2020). Audius: A Decentralized Protocol for Audio Content. https://whitepaper.audius.co/AudiusWhitepaper.pdf

What is IPFS? | IPFS Docs. (2021, June 22). IPFS Docs. Retrieved October 24, 2021, from What is IPFS? | IPFS Docs

Work with blocks | IPFS Docs. (2021, February 21). IPFS Docs. Retrieved October 30, 2021, from Work with blocks | IPFS Docs

Originally published at https://hack.opsci.io on December 3, 2021.

--

--