Using IPFS For Distributed File Storage Systems
--
IPFS (InterPlanetary File System) is a decentralized storage solution for blockchain-based content. IPFS uses a P2P (Peer-to-Peer) network model for file sharing that is decentralized and distributed across many computers or nodes. Files are broken down into different parts and stored across a network of nodes, that track the file by hashes. When the parts are assembled together, based on their hash value, it recreates the original file.
Decentralized File Storage
The use of Distributed Hash Tables (DHT) for file system storage and retrieval is the core innovation for IPFS. It is similar to the BitTorrent protocol, but different in the way they point to the file for sharing. This stores files on a blockchain as key value pairs. The data is broken up into 256 KB chunks and spread across a network of nodes or computers. It is efficiently coordinated to enable efficient access and lookup between nodes. BitTorrent does not use a blockchain, but rely on torrents instead to point to files. You can have different torrents pointing to the same file, but in IPFS you only need one hash ID that points to a file.
Files are not posted to IPFS in the same way as posting a file to the cloud. All data on IPFS is addressed by its hash ID. When someone requests that data, they are requesting that data directly by its hash ID and not the actual file itself. IPFS thus provides an abstraction to the actual location of the file, so the actual physical location does not matter to the application. This abstraction removes the complexity for application developers.
Nodes host the file on the network. They are incentivized to do so by a digital asset like Filecoin on an IPFS blockchain. Nodes are given an incentive for providing storage space on their computer or server for hosting files. These files are given a hash ID which can then be distributed across the network. Other nodes can also host the same file thus allowing many copies of it to be made. Users who want the file will access it based on the hash of the nearest node to their location.
All nodes that host the file will reference a root hash that is the hash ID of the file. Whenever a file request is made, the hash of the nearest node that stores the file based on its root hash is used by the user to download the file. There are no duplicates on IPFS, because the hash will always refer to the file or a chunk of the file when it was uploaded.
Once a file is put into an IPFS blockchain it remains available until it is removed by unpinning a file and running a garbage collection routine. The file itself can have different nodes pointing to it by its hash. Different nodes can also host the file as long as there is the hash pointing to it. IPFS can be updated to point to different hashes, but anyone with the original hash can still access that data provided at least one node is still hosting that data now or any time in the future.
Storage Addressing Schemes
What sets IPFS apart from typical Internet based storage systems on the cloud is that it is content-based (content addressed) and not location-based (location addressed). An example of a location addressed storage system is the HTTP protocol. When a storage system is based on location, it is about identifying a server by its host name using a DNS server. This tracks a host by a logical addressing scheme (e.g. IP address) mapped to a user friendly name. If the host changes its name or address, it must also be modified in the name service table.
Content-based addressing storage pertains to the content to get data from the network. This requires a content identifier that determines the physical location of a file. In this case the data is accessed based on its cryptographic hash rather than logical address, much like a digital fingerprint of a file. The network will always return the same content based on that hash regardless of who uploaded the file, where and when it was uploaded.
When it comes to speed and reliability, IPFS can perform better than HTTP. Rather than rely on a server location to get a file, a content addressed storage system can provide the file from various servers (e.g. a peer or node on the IPFS network) that are nearest to the user. In other words, a user can just simply search for a file without a search engine having to reference the location i.e. the server name or address. Instead they will reference it by the file’s hash, and it will be available from the nearest available nodes on the network.
Installing IPFS
There are 2 node options for a common installation of IPFS.
- IPFS Desktop — Host and share files directly from a computer (laptop or desktop PC). An IPFS companion app can be installed to allow access to a local node using a web browser. This is the type of installation for a peer type file sharing.
- IPFS Cluster — For hosting and sharing files at scale, the cluster enables orchestrating and coordinating pinsets across a swarm of IPFS nodes. This allows a massive large scale file storage system to be built through distributed nodes.
After installing the basic IPFS desktop, configuring the node begins by initializing the repository. The following are commands you type from a Windows Powershell or Mac/Linux terminal shell.
ipfs init
> initializing ipfs node at /Users/<username>/.go-ipfs
> generating 2048-bit RSA keypair...done
> peer identity: Qmcpo2iLBikrdf1d6QU6vXuNb6P7hwrbNPW9kLAH8eG67z
> to get started, enter:
>
This initialization is performed when using IPFS for the first time. The next step is to run the IPFS daemon process to join the node to the network.
ipfs daemon
> Initializing daemon...
> API server listening on /ip4/127.0.0.1/tcp/5001
> Gateway server listening on /ip4/127.0.0.1/tcp/8080
This initializes and runs the daemon process on local machine 127.0.0.1. It launches an API Server that listens on TCP port 5001 and a Gateway Server on TCP port 8080. Now you should be able to see other IPFS nodes on the network by issuing the swarm command. It should look like the following:
ipfs swarm peers
> /ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ
> /ip4/104.236.151.122/tcp/4001/p2p/QmSoLju6m7xTh3DuokvT3886QRYqxAzb1kShaanJgW36yx
> /ip4/134.121.64.93/tcp/1035/p2p/QmWHyrPWQnsz1wxHR219ooJDYTvxJPyZuDUPSDpdsAovN5
> /ip4/178.62.8.190/tcp/4002/p2p/QmdXzZ25cyzSF99csCQmmPZ1NTbWTe8qtKFaZKpZQPdTFB
As explained in the IPFS documentation, the peers take the format:
<transport address>/p2p/<hash-of-public-key>
Here is an example command to get a file on the network:
ipfs cat /ipfs/QmW2WQi7j6c7UgJTarActp7tDNikE4B2qXtFCfLPdsgaTQ/cat.jpg > cat.jpgopen cat.jpg
This gets an object from the specified peer called ‘cat.jpg’ and opens it locally.
Javascript With IPFS
The following is test code for writing data to the IPFS network using Runkit NPM and the Infura gateway (free to the public).
const IPFS = require('ipfs-mini' 1.1.5 );
const ipfs = new IPFS({host: 'ipfs.infura.io', port: 5001, protocol: 'https'});
const data = "Writing a test message on the network";
ipfs.add(data, (err, hash) => {
if(err){
return console.log(err);
}
console.log('https://ipfs.infura.io/ipfs/'+hash);
})
In this code, I request the ‘ipfs-mini’ package on Node.JS using the require function. I then configure the access to the Infura IPFS gateway ‘ipfs.infura.io’. I then specify the data as the string “Writing a test message on the network”. I then create a condition to return an error if there is a problem, otherwise I want the value of the hash and then console log the URL of the gateway plus the hash value.
The result will return the unique hash: QmQhadgstSRUv7aYnN25kwRBWtxP1gB9Kowdeim32uf8Td
I can now input the URL link: https://ipfs.infura.io/ipfs/QmQhadgstSRUv7aYnN25kwRBWtxP1gB9Kowdeim32uf8Td
This will show the data I just put on the Infura gateway. Data is not persistent and will be removed after a few days or weeks of inactivity. For persistent data storage, a dedicated server is required either on premise or on a cloud.
The Pros Of IPFS
Decentralization — The files are stored across a network of nodes, referenced by hashes. The incentivization to nodes to host files is through Filecoin.
Fault Tolerance — If one node fails, the file is still available so long as there are nodes hosting the file. There is no single point of failure.
Scalability — The more nodes there are hosting files, the faster and more available it becomes to users on the network.
Persistent Storage — The main point for IPFS is storage of data: as long as objects corresponding to the original data, and any new versions are accessible, the entire file history can be retrieved. Given that data blocks are stored locally across the network and can be cached indefinitely, this means that IPFS objects can be stored permanently without being modified.
Censorship Resistance — Once content has been uploaded to IPFS, no central authority can remove it because it is distributed across an entire network. Removing it from only one node does not delete the file entirely. It means there are still copies of it available on other nodes.
The Cons of IPFS
Not User Friendly
The way files are indexed on an IPFS network is not very user friendly. For example to access a file by its hash ID requires typing:
ipfs.io/ipns/QmeQe5FTgMs8PNspzTQ3LRz1iMhdq9K34TQnsCP2jqt8wV
Developers can share files using links, but this can become tedious and a time consuming process. IPFS uses IPNS (Interplanetary Name System) to lookup files. IPNS will attempt to make name resolution more user friendly just like DNS on the Internet.
There is a GUI and web-based extension IPFS companion app that users can use for easier access. However it is still not as user friendly or simple to use as a regular smartphone app since the learning curve is steeper. It is not as simple as clicking buttons on a web page. Users will have to know how IPFS works to be able to use it.
Data Privacy and Compliance
It is not best practice to put customer data, like Personally Identifiable Information (PII) like KYC on a public shared storage system using IPFS. First it violates storage compliance rules which states that KYC data cannot and should not be exposed on a public cloud or shared storage space, and would include IPFS. Being on a public cloud puts less control on the organization to manage the data. Strict requirements for financial institutions is to have the data and backup of the data on a regulated and not public storage system. Another issue here is that since it is on a public network, any node can host the KYC data. That is further violation of laws that strictly enforce who and where the data can be stored.
The second problem is that all nodes must be compliant to the rules and regulations of financial systems, meaning they must have backup, strong security, fault tolerance, etc. On a public network, the nodes are random and cannot be made to comply with the rules because they don’t have to trust your system. They can also make the KYC data available to other users on the network, which bad actors can access even if it is encrypted. They can decrypt it on their own and this gives them an avenue to do so.
Data Inconsistency
On IPFS there is also little incentive for nodes to maintain long term backups of data on the network. Nodes can choose to clear cached data to save space, meaning theoretically files can end up ‘disappearing’ over time if there are no remaining nodes hosting the data. At current adoption levels this isn’t a significant issue but in the long term, backing up large amounts of data requires strong economic incentives.
The problem here is that if a company uses a public IPFS network for file storage, the nodes can at anytime choose not to host the file in the future. If all nodes decide to do this, then there is no way to keep the file on the network unless the IPFS is hosted on a private network. According to the IPFS protocol, If the file you added to IPFS network is not accessed by many people, it fades away. Your data needs to be more popular on the network for it to be permanent. If you never want your data to fade out from IPFS network, you must pin your data on your node. Pinning ensures that over the network at least your node has that data.
Since IPFS is decentralized, all hosting nodes will have a copy of the file you uploaded. Normally, files are removed if they are not active or often used. This can be a very contentious issue because there are times files are archived and not often used and at other times need to be deleted immediately. When data that is already stored on IPFS changes, its hash must also change. If there is a new version, you will have to upload it, but it doesn’t overwrite the older version. This affects existing links to the file, so the original remains unchanged but now you will need to create a new link for a new file.
This can be a challenge when updating KYC data, which includes passports and driver licenses. When those documents expire, a new version must be uploaded to replace the old one. IPFS provides version control, but it becomes tricky once it is put on a public network because many versions can exist from different nodes. The old version is not automatically updated. The old one must either be archived or destroyed. IPFS cannot archive the file in the same way as on AWS or Azure.
IPFS does have a version control system. This is a feature of the Merkle DAG structure of IPFS that allows you to build a distributed version control system (VCS). The most popular example of this is Github, which allows developers to easily collaborate on projects simultaneously. Files on Github are stored and versioned using a Merkle DAG. It allows users to independently duplicate and edit multiple versions of a file, store these versions and later merge edits with the original file. However this is pretty much in theory according to many developers, and not yet a fully tested and proven technology that works (as of this writing). If we were to implement it, that would require more time and development costs which can be good in the long run.
SYNOPSIS
IPFS is more ideal for permanent data storage, like for digital music, works of art and accreditations (e.g. certificates, diplomas, awards, donations). These are data types that don’t need to be changed and putting it on a blockchain based storage system makes more sense. It provides the creator or artist a digital proof that cannot be altered by anyone and provable through a hash based system with key value pairs that is unique only to one item.
IPFS storage is also more public, so confidentiality of data is required. This could be a violation for certain types of data storage which exposes private data (e.g. GDPR Rules). There are also available scalable solutions for data storage on AWS and Azure cloud that meet privacy, security and compliance. In my own opinion I don’t feel the need to distribute personal information the way IPFS stores data. For content that is made publicly available, using IPFS can provide a proof of authenticity to the owner of that content. This can secure a creator’s work like art by proving their ownership which then allows them to collect royalties and prevent others from taking credit for something they did not create.
It seems IPFS delivers fast and secure fault tolerant file storage for content. However, it may not be suitable for financial and personal data that requires strict regulation of how the data is stored and protected. This is also not recommended for storing files that change frequently due to updates, like active log files that continuously record data. The storage of data on the blockchain serves a different purpose, and IPFS provides that solution. As IPFS evolves, it could use a privacy layer that can hide personal data that is also encrypted at rest, so there would be no violations of exposing anything confidential.