The technology behind IPFS

what can IPFS do

James

Published in

Coinmonks

5 min readJul 19, 2018

In short, IPFS is a P2P file system. Some terms:

IPNS — a name system. Maps a cryptographic key to a file in IPFS.
IPRS — a record system. An abstract record that can be validated and queried for. All IPNS records are (well, would be) IPRS records but not all IPRS records would be IPNS records.
DHT, Blockchains, Pubsub, etc. — Ways to distribute these records (and more).

How does it work:

IPFS gives a unique hash (fingerprint to identify the file) to each file. The hash is totally different even if there’s only a difference of one single character. Hence, IPFS can use the content of the file to locate its address, instead of using a domain name just like what HTTP does.

IPFS remove redundant files in the whole network. And it does version control for it. Every edit history is recorded and can be easily traceback.

When a search query is fired, IPFS search for the document based on its hash. As the hash is unique, it’s easy to make a query.

Both hashes in IPFS and ip address in HTTP are hard for the human to remember. People invented domain name to locate ip address. And we also have IPNS to locate IPFS hash. All of the nodes in IPFS stores a hash table to record the corresponding location of the file.

Technology

Distributed Hash Tables

In the case of IPFS, the key is a hash over the content. Making a query in an IPFS node for the content with hash QmcPx9ZQboyHw8T7Afe4DbWFcJYocef5Pe4H3u7eK1osnQ, the IPFS node will lookup in the DHT which nodes have the content.

How the specific value is found efficiently (fast, with a little as possible network requests) and how to manage the DHT so that changes (nodes that enter/leave the network, or new entries to the table) are absorbed easily is different for the various DHT implementations that exist.

The DHT is used in IPFS for routing, in other words:

to announce added data to the network
and help locate data that is requested by any node.

The white paper states:

Small values (equal to or less than 1KB) are stored directly on the DHT. For values larger, the DHT stores references, which are the NodeIds of peers who can serve the block.

Block Exchanges — BitTorrent

A notable difference is that where in BitTorrent each file has a separate swarm of peers (forming a P2P network with each other) where IPFS is one big swarm of peers for all data.
When peers connect, they exchange which blocks they have (have_list) and which blocks they are looking for (want_list)

BitSwap Strategy:
- this strategy is based on previous data exchanges between these two peers
- when peers exchange blocks they keep track of the amount of data they share (builds credit) and the amount of data they receive (builds debt)
- this accounting between two peers is kept track of in the BitSwap Ledger
- if a peer has credit (shared more than received), our node will send the requested block
- if a peer has debt, our node will share or not share, depending on a deterministic function where the chance of sharing becomes smaller when the debt is bigger
- a data exchange always starts with the exchange of the ledger, if it is not identical our node disconnects

Version Control Systems — Git

Properties:
1) Immutable objects represent Files (blob), Directories (tree), and Changes (commit).
2) Objects are content-addressed, by the cryptographic hash of their contents.
3) Links to other objects are embedded, forming a Merkle DAG. This provides many useful integrity and workflow properties.
4) Most versioning metadata (branches, tags, etc.) are simply pointer references, and thus inexpensive to create and update.
5) Version changes only update references or add objects.
6) Distributing version changes to other users is simply transferring objects and updating remote references.

Self-Certified Filesystems — SFS

This is used to implement the IPNS name system for IPFS. It allows us to generate an address for a remote filesystem, where the user can verify the validity of the address.

SFS introduced a technique for building Self-Certified Filesystems: addressing remote filesystems using the following scheme:

/sfs/<Location>:<HostID>

where Location is the server network address, and:

HostID = hash(public_key || Location)
Thus the name of an SFS file system certifies its server.

Applications

IPFS can also do pubsub application.

‘Publishers’ send messages classified by topic or content and ‘subscribers’ receive only the messages they are interested in, all without direct connections between publishers and subscribers.

CLI Usage: https://ipfs.io/blog/25-pubsub/

Example:

Content publishing / monetization platform: http://www.alexandria.io/learn/#integrated-technologies

github:
https://github.com/dloa/alexandria-librarian

P2P chat application:
https://orbit.chat
https://github.com/orbitdb/orbit

Online player: input hash to view corresponding video
http://www.ipfs.guide/
http://ipfser.org/2018/04/08/r36/

Media network:
https://akasha.world/

List of Examples:
https://github.com/ipfs/awesome-ipfs

Drawback

Slow:
Copying 100MB and 1G file through scp took 1.5secs and 17secs respectively. Whereas the same took 16secs and 170secs in IPFS.

Downloading a 373M binary file:

wget: 12.6 MB/s
ipfs get (direction A): 9.10 MB/s
ipfs get (direction B): 2.07 MB/s

And

ping: 33.018 ms
ipfs ping: 32.93 ms
ipfs cat (direction A): 151ms
ipfs cat (direction B): 443 ms

Reference: https://github.com/ipfs/go-ipfs/issues/2111

Private IPFS:

All files in IPFS are available to the public, which is not a desirable feature for most applications, especially for enterprise applications. One may want a private IPFS that can only be accessed by certain entities.

This feature is still in experimental phrase.
Repo: https://github.com/ipfs/ipfs-cluster

Roadmap: https://cluster.ipfs.io/roadmap/
Short term (Q2 2018):

Project website (ongoing)
Key functionality extraction from go-ipfs / importers (ongoing)
Sharding support prototype (ongoing)
Improve UX to handle larger files / concurrent pin operations (several fronts, some ongoing)
Efficient repository disk usage in IPFS (ongoing)
Reference guide to setup, manage and operate a production cluster.
First metrics are exposed.
Live large-storage cluster operated and maintained by us (IPFS) consolidates all our pinsets. (ongoing)
Discussions and collaborations started with players in the “large dataset” space.

Midterm (6–8 months horizon), this section is written in 07/2018

Stable collaborations with different players interested in using ipfs-cluster/ipfs to store large datasets.
Good sharding support for at least ~1TB datasets (package repositories).
We have metrics to have an idea of how big a cluster can grow (peers, pinset, repository size), where degradation start, critical paths in the application performance.
Collaborative archival efforts (between strangers)