IPFS: What it is, How it works, and Why it’s needed
What is IPFS?
IPFS, or the InterPlanetary File System, is a distributed system for storing and accessing files, websites, applications, and data. On their website they describe themselves as, “A peer-to-peer hypermedia protocol designed to preserve and grow humanity’s knowledge by making the web upgradeable, resilient, and more open.” Because IPFS is a peer-to-peer storage network, content is accessible through peer nodes located anywhere in the world. Those peer nodes might relay information, store it, or do both. There are three main principles key to understanding IPFS. The first is the unique identification via content addressing, which means IPFS knows how to find what you’re looking for by using the content’s unique address rather than its location. The second is that content is linked via Directed Acyclic Graphs (DAGs). And the third is that the content discovery system is facilitated via Distributed Hash Tables (DHTs). These principles all build upon each other to enable the IPFS ecosystem. In the remainder of this article we will examine how this is accomplished and what the benefits are.
How does IPFS work?
The way things currently work is if you want to download a photo from the internet, you tell the computer exactly where to find the photo. The location of the photo is the IP address or the domain name. This is called “location based addressing.” This data is mostly stored in centralized big data farms controlled by single companies or entities. So if you tell the computer where to get the information but that location isn’t accessible (the server is down), you won’t get the photo. When that happens there is a high probability that someone else out there has downloaded that picture before and still has a copy of it, yet your computer won’t be able to grab it from that other person. To fix this, IPFS moves from “location based addressing” to “content based addressing.” Instead of saying where to find a resource, you just say what it is you want. This is accomplished because every file has a unique hash which can be compared to a fingerprint. When you want to download a file you ask the network: “ Who has the file with this hash?”, then someone on the IPFS network will provide it to you. Since IPFS uses content based addressing, once something is added it can’t be changed. This acts as an immutable datastore very similar to a blockchain.
So what is actually going on under the hood? Files are stored inside IPFS objects. These objects can store up to 256kb worth of data and can also contain links to other IPFS objects. A simple small text file can be stored in a single IPFS object. However, other file types like videos or images can easily be much larger than 256kb. This is handled by files being split up into multiple IPFS objects that are all 256kb in size and the system creating an additional empty IPFS object that links to all the other pieces of the file. When your file is split into smaller chunks those chunks are cryptographically hashed and given a unique “fingerprint” called a content identifier (CID). This CID acts as a permanent record of your file as it exists at that point in time. When other nodes in the network look up your file, they are asking their peer nodes, “Who is storing the content with this file’s specific CID?” Then when they view or download your file, they cache a copy and become another provider of your content until they choose to clear their cache. A node can choose to pin content in order to keep and provide it forever, or discard content it hasn’t used/no longer wants in order to save space. If you decide to add a new version of your file to IPFS, the cryptographic hash of this new version(and thus its CID), will be different. This helps ensure files stored on IPFS remain resistant to censorship or tampering because any changes to a file won’t overwrite the original. Common chunks across these files can even still be reused in order to minimize storage costs. However, luckily this doesn’t mean you need to remember a long string of CIDs. IPFS can find the latest version of your file using things like the IPNS decentralized naming system or DNSLink to map CIDs to human readable DNS names.
Now you might be thinking, if you are retrieving these files from peer nodes and not some trusted centralized server, then how can you be sure that the file you requested hasn’t been tampered with? Since we are using a hash to request the file, you can verify what you received. Upon receiving a file, you can check if the hash of the file received matches the hash of the file requested. Since each time a file is updated the new version gets a new hash and CID, you can be sure that nothing inside the file has been changed. This is like having built in security! Another great feature of using hashes to address content is deduplication. When multiple people publish the same file on IPFS, it will only be created once, thus making the network more efficient.
The biggest problem IPFS faces is keeping files available. Every node on the network keeps a cache of the files that it has downloaded and helps to share them when they are requested by other people. However, if a certain file is hosted by 3 specific nodes, and those nodes all go offline at the same time, then that file becomes unavailable for anyone to grab. This is obviously problematic. There are 2 main solutions being deployed to fix this problem. You can either incentivize people to store files and make them available or you can proactively distribute files and make sure a certain number of copies always remain available on the network. Filecoin is one project aiming to help solve this problem and you can check them out to learn more about how this is being accomplished.
Why is IPFS needed?
So you might be thinking that IPFS sounds like an interesting idea, but why is it really necessary? Things are working just fine the way they are, aren’t they? Despite how great the internet currently works, there are a few key areas where IPFS raises the bar and provides an even better, fairer, and freer online experience.
IPFS makes the web more efficient and less expensive
When using HTTP, files are downloaded from one server at a time. A peer-to-peer system like IPFS retrieves pieces of files from multiple nodes at once which enables bandwidth savings of up to 60% for things like videos. This makes it possible to efficiently distribute high volumes of data without duplication of files.
IPFS can help preserve humanity’s history
The average lifespan of a web page is 100 days. The goal of IPFS is to preserve humanity’s history by letting users store data while minimizing the risk of that data being lost or accidentally deleted. IPFS accomplishes this by making it simple to set up resilient networks for mirroring data. Thanks to the content addressing system utilized, these files are always automatically versioned. However, storage is still finite. This means nodes need to clear out some of their previously cached resources to make room for new resources. This process is called garbage collection. To make sure that data persists on the network, it can be pinned to one or more IPFS nodes giving you control over disk space and data retention.
IPFS helps make the web more decentralized
Centralization of the web is not a good thing. The internet has helped drive innovation and been one of the greatest equalizers in human history by providing low cost, barrier free access to information and education (almost) anywhere in the world. As of January 2021 it has been estimated that there are 4.66 billion active internet users worldwide, which make up approximately 59.5% of the global population. This number is only going to continue to climb and the increasing risks associated with consolidation of control over that network threaten this progress. IPFS aims to stay true to the original vision of an open web.
The way the web works currently, it is much easier for entities like governments to censor people like dissenters or journalists because content is hosted on only a small number of huge servers. This is not just hypothetical. A real world example of this took place in 2017 when the Turkish government ordered internet providers operating in the country to block access to Wikipedia citing the crowd sourced encyclopedia as a “threat to national security.” In response to this, the people behind IPFS took the Turkish version of Wikipedia and put a copy of it on IPFS. Because IPFS is distributed and there are no central servers, the government could no longer block access if the files were retrieved using the IPFS network.
IPFS helps support this vision of a more resilient and decentralized internet by making it possible to download a file from many locations that aren’t managed by one organization. If someone was to attack Wikipedias servers, or an engineer working on the project made a huge internal mistake that crashed the servers, you can still access the same webpages from somewhere else. Another positive of the IPFS model is that you can retrieve a file from someone nearby instead of a server thousand of miles away and often get it faster. This can be especially valuable in parts of the world that have communities networked locally but lack a strong connection to the wider internet. This results in better connectivity for the developing world, during natural disasters, or even if you’re just using some questionable coffee shop wi-fi.
Wrap Up
Despite the complex technology used to facilitate IPFS, the underlying fundamental ideas focus on changing how networks of computers and people communicate. IPFS uses a data structure that can be very powerful and the architecture of this network allows us to use it like a filesystem. It will be very interesting to watch if IPFS can continue to accomplish their goal of being a “low-level tool that allows a rich fabric of communities, business, and cooperative organizations to form a distributed web that is much more reliable, robust, and equitable than the internet we currently have today.”