Setting up your first distributed private storage network on IPFS: Part 2

vasa
towardsblockchain
Published in
9 min readApr 24, 2018
Bait for IPFS lovers

Receive curated Web 3.0 content like this with a summary every day via WhatsApp, Telegram, Discord, or Email.

Welcome back to the IPFS private network series. If you are wondering why I am welcoming you back, then you should take a look at this previous post...

If you like high-tech Web3 concepts like IPFS Cluster explained in simple words with interactive tutorials for setting up a multi-node IPFS Cluster, then head here.

Now assuming you have read the above post, let's get started where we left last time.

Static cluster membership considerations

We call a static cluster, that in which the set of cluster.peers is fixed, where "peer add"/"peer rm"/bootstrapping operations don't happen (at least normally) and where every cluster member is expected to be running all the time.

Static clusters are a way to run ipfs-cluster in a stable fashion, since the membership of the consensus remains unchanged, they don’t suffer the dangers of dynamic peer sets, where it is important that operations modifying the peer set succeed for every cluster member.

Static clusters expect every member peer to be up and responding. Otherwise, the Leader will detect missing heartbeats and start logging errors. When a peer is not responding, ipfs-cluster will detect that a peer is down and re-allocate any content pinned by that peer to other peers. ipfs-cluster will still work as long as there is a Leader (half of the peers are still running). In the case of a network split, or if a majority of nodes is down, cluster will be unable to commit any operations the the log and thus, it’s functionality will be limited to read operations.

Dynamic cluster membership considerations

We call a dynamic cluster, that in which the set of cluster.peers changes. Nodes are bootstrapped to existing cluster peers (cluster.bootstrap option), the "peer rm" operation is used and/or the cluster.leave_on_shutdown configuration option is enabled. This option allows a node to abandon the consensus membership when shutting down. Thus reducing the cluster size by one.

Dynamic clusters allow greater flexibility at the cost of stablity. Leave and, specially, join operations are tricky as they change the consensus membership. They are likely to fail in unhealthy clusters. All operations modifying the peerset require an elected and working leader. Note that peerset modifications may also trigger pin re-allocations if any of the pins from the departing cluster crosses below the replication_factor_min threshold.

Peers joining an existing cluster should not have any consensus state (contents in ./ipfs-cluster/ipfs-cluster-data). Peers leaving a cluster are not expected to re-join it with stale consensus data. For this reason, the consensus data folder is renamed when a peer leaves the current cluster. For example, ipfs-cluster-data becomes ipfs-cluster-data.old.0and so on. Currently, up to 5 copies of the cluster data will be left around, with old.0 being the most recent, and old.4 the oldest.

When a peer leaves or is removed, any existing peers will be saved as bootstrap peers, so that it is easier to re-join the cluster by simply re-launching it. Since the state has been cleaned, the peer will be able to re-join and fetch the latest state cleanly. See "The consensus algorithm" and the "Starting your cluster peers" sections here for more information.

This does not mean that there are not possibilities of somehow getting a broken cluster membership. The best way to diagnose it and fix it is to:

  • Select a healthy node
  • Run ipfs-cluster-ctl peers ls
  • Examine carefully the results and any errors
  • Run ipfs-cluster-ctl peers rm <peer in error> for every peer not responding
  • If the peer count is different depending on the peers responding, remove those peers too. Once stopped, remove the consensus data folder and bootstrap them to a healthy cluster peer. Always make sure to keep 1/2+1 peers alive and healthy.
  • ipfs-cluster-ctl --enc=json peers ls provides additional useful information, like the list of peers for every responding peer.
  • In cases were leadership has been lost beyond solution (meaning faulty peers cannot be removed), it is best to stop all peers and restore the state from the backup (currently, a manual operation).

Remember: if you have a problematic cluster peer trying to join an otherwise working cluster, the safest way is to rename the ipfs-cluster-data folder (keeping it as backup) and to set the correct bootstrap. The consensus algorithm will then resend the state from scratch.

ipfs-cluster will fail to start if cluster.peers do not match the current Raft peerset. If the current Raft peerset is correct, you can manually update cluster.peers. Otherwise, it is easier to clean and bootstrap.

Finally, note that when bootstrapping a peer to an existing cluster, the new peer must be configured with the same cluster.secret as the rest of the cluster.

Pinning an item

ipfs-cluster-ctl pin add <cid> will tell ipfs-cluster to pin (or re-pin) a CID.

This involves:

  • Deciding which peers will be allocated the CID (that is, which cluster peers will ask ipfs to pin the CID). This depends on the replication factor (min and max) and the allocation strategy (more details below).
  • Forwarding the pin request to the Raft Leader.
  • Commiting the pin entry to the log.
  • At this point, a success/failure is returned to the user, but ipfs-cluster has more things to do.
  • Receiving the log update and modifying the shared state accordingly.
  • Updating the local state.
  • If the peer has been allocated the content, then:
  • Queueing the pin request and setting the pin status to PINNING.
  • Triggering a pin operation
  • Waiting until it completes and setting the pin status to PINNED.

Errors in the first part of the process (before the entry is commited) will be returned to the user and the whole operation is aborted. Errors in the second part of the process will result in pins with an status of PIN_ERROR.

Deciding where a CID will be pinned (which IPFS daemon will store it — receive the allocation) is a complex process. In order to decide, all available peers (those reporting valid/non-expired metrics) are sorted by the allocator component, depending on the value of their metrics. These values are provided by the configured informer. If a CID is already allocated to some peers (in the case of a re-pinning operation), those allocations are kept.

New allocations are only provided when the allocation factor (healthy peers holding the CID) is below the replication_factor_min threshold. In those cases, the new allocations (along with the existing valid ones), will attempt to total as much as replication_factor_max. When the allocation factor of a CID is within the margins indicated by the replication factors, no action is taken. The value "-1" and replication_factor_min and replication_factor_max indicates a "replicate everywhere" mode, where every peer will pin the CID.

Default replication factors are specified in the configuration, but every Pin object carries them associated to its own entry in the shared state. Changing the replication factor of existing pins requires re-pinning them (it does not suffice to change the configuration). You can always check the details of a pin, including its replication factors, using ipfs-cluster-ctl pin ls <cid>. You can use ipfs-cluster-ctl pin add <cid> to re-pin at any time with different replication factors. But note that the new pin will only be commited if it differs from the existing one in the way specified in the paragraph above.

In order to check the status of a pin, use ipfs-cluster-ctl status <cid>. Retries for pins in error state can be triggered with ipfs-cluster-ctl recover <cid>.

The reason pins (and unpin) requests are queued is because ipfs only performs one pin at a time, while any other requests are hanging in the meantime. All in all, pinning items which are unavailable in the network may create significants bottlenecks (this is a problem that comes from ipfs), as the pin request takes very long to time out. Facing this problem involves restarting the ipfs node.

Unpinning an item

ipfs-cluster-ctl pin rm <cid> will tell ipfs-cluster to unpin a CID.

The process is very similar to the “Pinning an item” described above. Removed pins are wiped from the shared and local states. When requesting the local status for a given CID, it will show as UNPINNED. Errors will be reflected as UNPIN_ERROR in the pin local status.

Cluster monitoring and pin failover

ipfs-cluster includes a basic monitoring component which gathers metrics and triggers alerts when a metric is no longer renewed. There are currently two types of metrics:

  • informer metrics are used to decide on allocations when a pin request arrives. Different "informers" can be configured. The default is the disk informer using the freespace metric.
  • a ping metric is used to signal that a peer is alive.

Every metric carries a Time-To-Live associated with it. This TTL can be configued in the informers configuration section. The ping metric TTL is determined by the cluster.monitoring_ping_interval, and is equal to 2x its value.

Every ipfs-cluster peers push metrics to the cluster Leader regularly. This happens TTL/2 intervals for the informer metrics and in cluster.monitoring_ping_interval for the ping metrics.

When a metric for an existing cluster peer stops arriving and previous metrics have outlived their Time-To-Live, the monitoring component triggers an alert for that metric. monbasic.check_interval determines how often the monitoring component checks for expired TTLs and sends these alerts. If you wish to detect expired metrics more quickly, decrease this interval. Otherwise, increase it.

ipfs-cluster will react to ping metrics alerts by searching for pins allocated to the alerting peer and triggering re-pinning requests for them. These re-pinning requests may result in re-allocations if the the CID's allocation factor crosses the replication_factor_min boundary. Otherwise, the current allocations are maintained.

The monitoring and failover system in cluster is very basic and requires improvements. Failover is likely to not work properly when several nodes go offline at once (specially if the current Leader is affected). Manual re-pinning can be triggered with ipfs-cluster-ctl pin <cid>. ipfs-cluster-ctl pin ls <CID> can be used to find out the current list of peers allocated to a CID.

Using the IPFS-proxy

ipfs-cluster provides an proxy to ipfs (which by default listens on /ip4/127.0.0.1/tcp/9095). This allows ipfs-cluster to behave as if it was an ipfs node. It achieves this by intercepting the following requests:

  • /add: the proxy adds the content to the local ipfs daemon and pins the resulting hash[es] in ipfs-cluster.
  • /pin/add: the proxy pins the given CID in ipfs-cluster.
  • /pin/rm: the proxy unpins the given CID from ipfs-cluster.
  • /pin/ls: the proxy lists the pinned items in ipfs-cluster.

Responses from the proxy mimic ipfs daemon responses. This allows to use ipfs-cluster with the ipfs CLI as the following examples show:

  • ipfs --api /ip4/127.0.0.1/tcp/9095 pin add <cid>
  • ipfs --api /ip4/127.0.0.1/tcp/9095 add myfile.txt
  • ipfs --api /ip4/127.0.0.1/tcp/9095 pin rm <cid>
  • ipfs --api /ip4/127.0.0.1/tcp/9095 pin ls

Any other requests are directly forwarded to the ipfs daemon and responses and sent back from it.

Intercepted endpoints aim to mimic the format and response code from ipfs, but they may lack headers. If you encounter a problem where something works with ipfs but not with cluster, open an issue.

--

--

vasa
towardsblockchain

Entrepreneur | Co-founder, TowardsBlockChain, an MIT CIC incubated startup | SimpleAsWater, YC 19 | Speaker | https://vaibhavsaini.com