What’s really happening when you add a file to IPFS?
From raw data to Merkle DAGs and a few steps in between
When you ask someone for their favorite cat video, they probably aren’t going to say something like “oh haha, the one on this server, at this sub-domain, under this file path, slash hilarious dash cat dot mp4”. Instead, they’re probably going to describe the content of the video: “oh haha, the one where the cat knocks the glass off the counter, thug style… classic”. This is obviously an intuitive way to think about content for humans, but is generally not how we access content on the web today. Having said that, decentralized protocols such as IPFS actually do use this type of content addressing to find content on the decentralized web. In this article, we’ll explore a little bit how this whole process works, taking a look under-the-hood to find out exactly what happens when you add a file to IPFS. While we’re at it we’ll spend a good chunk of time learning about IPLD, the underlying data structure of the Interplanetary File System.
So, first things first, to support content addressing, we need to come up with some way to create a ‘fingerprint’ or summary of the content that we can use to reference said content. Similarly to finding a book, where we use ISPN numbers. In practice, content addressing systems on the web such as IPFS use cryptographic hashing functions to create fingerprints. Basically, we take the raw content (in this case, a cat photo), and run that data through a hash function, to produce a digest. This digest is guaranteed to be cryptographically unique to the contents of the file (or image or whatever), and that file only. If I change that file by even one bit, the hash will become something completely different.
So we’ve hashed out image (created a digest), now what? We’ll, what we’re after is a content address/identifier. So we need to now take that digest, and convert it into something that IPFS and other systems can use to locate it… but this is not all that simple. What if things change in the future, and we want to change the way we address content? What if someone invents a better hash function? Even the IP system we have now has had to undergo upgrades. We’ll the good folks at IPFS have thought of this too!
Have you ever noticed that IPFS hashes all seem to start with
Qm? This is because those hashes are actually something called a multihash. This is cool, because the hash itself specifies which hash function it used, and the length of the resultant hash in the first two bytes of the multihash. In most of our examples, the first part in hex is 12, where 12 denotes that this is the
SHA256 hash function, and the output length is 20 in hex (or 32 bytes)… which is where we get the
Qm from when we base58 encode the whole thing. So then you might ask, why base58 encode the whole thing? Well, because similar-looking letters are omitted: 0 (zero), O (capital o), I (capital i) and l (lower case L), and non-alphanumeric characters + (plus) and / (slash) are dropped, making it slightly more human readable. And all of this because we want a future-proof system that allows for multiple different fingerprinting mechanisms to coexist. So if that awesome new hashing function does get invented, we’ll simple change the first few bytes of the multihash, and voila… IPFS hashes no longer start with
Qm… but because we are using multihashes, the old ones will still work, along with the new ones… cool!
Merkle DAG ➞ IPLD
Ok, so I’ve got my file, I’ve hashed and encoded it. But that’s not really the whole story. What is actually happening is something more like this…
The content is chunked up into smaller parts (about 256k each), each part is hashed, a CID is created for each chunk, and then these chunks are combined into a hierarchical data structure, for which a single, base CID is computed.
This data structure is essentially something called a Merkle DAG, or directed acyclic graph.
Here’s an awesome video of Juan Benet of protocol labs explaining how IPFS uses Merkle DAGs as their core data structure… for what is called the Interplanetary Linked Data (or IPLD) structure:
Linked data is actually something that folks in the decentralized web community have been talking about for quite some time. It’s something Tim Berners-Lee has been working on for ages, and his new company, Solid, is building a business around it.
Essentially what we are talking about, is a structure that models everything as a series of linked objects. In the IPLD world, we have objects, each with
Links fields (where
Data can be a small blob of unstructured, arbitrary binary data, and
Links is an array of
Link structures, which are simply links to other IPFS objects). Speaking of which,
Links each have a
Hash (or CID) of the linked object, and a
Size, which represents the size of the linked object. This last bit of info is really just so we can estimate object/file sizes without having to pre-fetch too much data, but its extremely nice to have.
Data— blob of unstructured binary data of size < 256 kB.
Links— array of Link structures. These are links to other IPFS objects.
A Link structure has three data fields
Name— name of the Link
Hash— hash of the linked IPFS object
Size— cumulative size of linked IPFS object, including following its links
Learning by doing
We can actually explore IPLD objects using the IPFS command line tools. So first, make sure you have IPFS installed and are comfortable with playing around with the command line. If you need an introductory tutorial, check out Session 1 of our Textile Build Series. Once you’re ready, we’ll take a quick look at the object structure for a different cat image (uses handy dandy jq tool). Start with the following command, which pipes (
|) the result from getting the IPFS object to the
ipfs object get QmW2WQi7j6c7UgJTarActp7tDNikE4B2qXtFCfLPdsgaTQ | jq
Producing the following output:
Notice that this object contains a single
Link, which we can further explore using the same commands:
ipfs object get Qmd286K6pohQcTKYqnS1YhWrCiS4gz7Xi34sdwMe9USZ7u | jq
Which in turn, produces the following output. Notice the two
Links are each < 256K in size:
"Data": "\b\u0002\u0018ކ\u001b ��\u0010 ކ\u000b"
This is pretty cool, and due to the flexible nature of DAGs (simple link-based graphs), we can represent just about any data structure we want using IPLD. For instance, let’s say you had the following directory structure, and you wanted to add it to IPFS. Firstly, it’s amazingly easy to do this (see below), and secondly, the benefits of using a DAG to represent data in IPFS become immediately apparent, as we’ll see in a moment.
In this example, assume that all three files with an asterisk (
testing.txx— contain the same text: “
Hello World!/n”. Now let’s add them to IPFS:
ipfs add -r test_dir/
When you do this, you end up with a DAG that looks something like this:
Where (depending on the actual contents of the files in your directory), you end up with a series of objects, linked via their CIDs. At the top level we have the actual folder, without a name but with a CID. From there we have direct links to
bigfile.js, the underlying
my_dir (in the middle) we have links to
testing.txt, both of which actually reference the same CID! This is pretty cool. Because we reference content (not the files themselves), we get deduplication ‘for free’! Lastly, on the bottom left, we have our
bigfile.js, which has been chunked into three smaller pieces, each of which has its own CID, which together form the larger file. If you follow all of these CIDs up the tree, you get a CID that describes the contents below it. This is a critical concept…
The fact that we have
Links gives our collection of IPFS objects a graph-like structure (or a tree). Again, DAG means Directed Acyclic Graph, and Merkle comes from the name of the inventor, Ralph Merkle, who actually patented hash trees in 1979. Anyway, what Merkle DAGs get us is content addressing, such that all content is uniquely identified by its cryptographic hash, including links to things it references. This makes the structure tamper proof, because all content is verified with its hash — right hash, right content. And again, since we are hashing the contents of the files, we have no duplication, because in the Merkle DAG world, all objects that hold the same content are considered equal (i.e. their hash values are the same), and so we only store them once. De-duplication by design.
We can play around with this idea of Merkle DAGs and chunking up large objects ourselves from the command line. For example, let’s grab a nice big jpg to play with. You can
ipfs cat it, or just download it directly from GitHub if you want:
ipfs cat QmWNj1pTSjbauDHpdyg5HQ26vYcNWnubg1JehmwAE9NnU9 > cosmos.jpg
Now you can
add it locally, and if you
cat’d it initially, make sure the hashes match (here we’re assigning the returned hash to the env variable
hash=`ipfs add -q cosmos.jpg`
You should get back a CID hash that looks exactly like this one (plus some progress):
Now, let’s take a look at the underlying
ipfs object for that particular image:
ipfs ls -v $hash
Note that each linked object is about 256k. Together, these chunks make up the whole image. So when requesting this file from the network, we can actually grab bits from different peers, and then our peer will put it all together at the and to give us the file we want. Truly decentralized!
Hash Size Name
Now, just to show you that the above four chunks do indeed make up the single image, you can use the following code to ‘manually’ join the chunks together to create the image file — which is essentially what
cat is doing in the background when you reference the base CID:
ipfs cat \
Alternatively, you could do this more succinctly using pipes again:
ipfs refs $hash | ipfs cat > test.jpg ; open test.jpg
There you go. It’s links all the way down. And on top of that we’ve learned a few many IPFS command-line tools to manipulate IPFS DAG objects. Handy!
So let’s do a quick recap. Merkle DAGs are a core concept of IPFS, but they are also at the core of many other technologies like git, bitcoin, dat, etc. These DAGs are basically hash ‘trees’ made up of content blocks, each with a unique hash. You can reference any block within that tree, which means you can build up a tree from any combination of subblocks. Which brings us to another awesome thing about DAGs, particularly when working with large files: to reference a large data file, all you need is the base CID, and you actually have a verified reference to the whole object. For large, popular files stored in multiple places on a network, sending around CIDs and then requesting bits from multiple peers makes file sharing a breeze, and means you only need to share around a few bytes, rather than a whole file.
But of course, you will rarely interact with DAGs or objects directly. Most of the time, your friendly
ipfs add command will simply create the merkle DAG from data in files that you specify, creating the underlying IPNS objects for you, and you’ll go on your merry way. So the answer to the question “what’s really happening when you add a file to IPFS?” is… cryptography, math, networking, and some magic!
And that’s all folks. You now know pretty much exactly what happens when you add a file to IPFS. What happens next is a topic for a future post. In the mean time, why not check out some of our other stories, or sign up for our Textile Photos waitlist to see what we’re building with IPFS. While you’re at it, drop us a line and tell us what cool distributed web projects you’re working on— we’d love to hear about it!