Apr 24, 2018 · 10 min read
Bait for IPFS lovers

IPFS Private Storage Network Series

This post marks the first in a new IPFS series I am starting in an effort to provide some easy to read instructions covering topics I found online, particularly vast and hard to get started.

In this and next 2 posts we are only going to cover the basics of IPFS cluster: Collective pinning and composition for IPFS. But if you want to get your hands dirty, then jump to this post:

Stay tuned for the next post.

Setting up a multi-node private storage network on IPFS: Part 5

For now, lets get started.

IPFS Cluster


IPFS cluster consist of 2 main applications namely:

ipfs-cluster-service : for starting your cluster peer.

ipfs-cluster-ctl : for interacting with the cluster peer through various inbuilt APIs.

The consensus algorithm:

ipfs-cluster peers coordinate their state (the list of CIDs which are pinned, their peer allocations and replication factor) using a consensus algorithm called Raft. It is strongly recommended to see this raft explanation before proceeding.

Raft is used to commit log entries to a “distributed log” which every peer follows. Every “Pin” and “Unpin” requests are log entries in that log. When a peer receives a log “Pin” operation, it updates its local copy of the shared state to indicate that the CID is now pinned.

In order to work, Raft elects a cluster “Leader”, which is the only peer allowed to commit entries to the log. Thus, a Leader election can only succeed if at least half of the nodes are online. Log entries, and other parts of ipfs-cluster functionality (initialization, monitoring), can only happen when a Leader exists.

For example, a commit operation to the log is triggered with ipfs-cluster-ctl pin add <cid>. This will use the peer's API to send a Pin request. The peer will in turn forward the request to the cluster's Leader, which will perform the commit of the operation. This is explained in more detail in the "Pinning an item" section of this post.

The “peer add” and “peer remove” operations also trigger log entries (internal to Raft) and depend too on a healthy consensus status. Modifying the cluster peers is a tricky operation because it requires informing every peer of the new peer’s multiaddresses. If a peer is down during this operation, the operation will fail, as otherwise that peer will not know how to contact the new member. Thus, it is recommended remove and bootstrap (these are the peers to which a new peer connects on boot up) any peer that is offline before making changes to the peerset.

By default, the consensus log data is backed in the ipfs-cluster-data subfolder, next to the main configuration file(see next section). This folder stores two types of information: the boltDB database storing the Raft log, and the state snapshots. Snapshots from the log are performed regularly when the log grows too big (see the Starting your cluster peers section below for options). When a peer is far behind in catching up with the log, Raft may opt to send a snapshot directly, rather than to send every log entry that makes up the state individually. This data is initialized on the first start of a cluster peer and maintained throughout its life. Removing or renaming the ipfs-cluster-data folder effectively resets the peer to a clean state. Only peers with a clean state should bootstrap to already running clusters.

When running a cluster peer, it is very important that the consensus data folder does not contain any data from a different cluster setup, or data from diverging logs. What this essentially means is that different Raft logs should not be mixed. Removing or renaming the ipfs-cluster-data folder, will clean all consensus data from the peer, but, as long as the rest of the cluster is running, it will recover last state upon start by fetching it from a different cluster peer.

On clean shutdowns, ipfs-cluster peers will save a human-readable state snapshot in ~/.ipfs-cluster/backups, which can be used to inspect the last known state for that peer.

The configuration file

The ipfs-cluster configuration file is usually found at ~/.ipfs-cluster/service.json .It holds all the configurable options for cluster and its different components. The configuration file is divided in sections. Each section represents a component. Each item inside the section represents an implementation of that component and contains specific options.

After IPFS cluster installation a default configuration file can be generated with ipfs-cluster-service init .It is recommended that you re-create the configuration file after an upgrade, to make sure that you are up to date with any new options.

The cluster section of the configuration stores a secret: a 32 byte (hex-encoded) key which must be shared by all cluster peers. Using an empty key has security implications (see the Security section here). Using different keys will prevent different peers from talking to each other.

Now, lets see the some sections of the configuration file. ( I have skipped some sections as the file is too big. You can see the full file with explanation here)

cluster section (service.json)

{// main cluster component configuration
"cluster": {
// peer ID
"id": "QmZyXksFG3vmLdAnmkXreMVZvxc4sNi1u21VxbRdNa2S1b",
//node private key
"private_key": "<base64 representation of the key>",
//above mentioned secret
"secret": "<32-bit hex encoded secret>",
// List of peers' multiaddresses
"peers": [],
// List of bootstrap peers' multiaddresses
"bootstrap": [],

// Abandon cluster on shutdown
"leave_on_shutdown": false,
// Cluster RPC listen
"listen_multiaddress": "/ip4/",

// Time between state syncs
"state_sync_interval": "1m0s",

// Time between ipfs-state syncs
"ipfs_sync_interval": "2m10s",
// Replication factor minimum threshold. -1 == all
"replication_factor_min": -1,
// Replication factor maximum threshold. -1 == all
"replication_factor_max": -1,
// Time between alive-pings. See cluster monitoring section
"monitor_ping_interval": "15s"

We will talk about other sections later in this post as we explore further.

Starting your cluster peers

ipfs-cluster-service will launch your cluster peer. If you have not configured any cluster.peers in the configuration, nor any cluster.bootstrap addresses, a single-peer cluster will be launched.

When filling in peers with some other peers' listening multiaddresses (i.e. /ip4/, the peer will expect to be part of a cluster with the given peers. On boot, it will wait to learn who is the leader (see raft.wait_for_leader_timeoutoption below ) and sync it's internal state up to the last known state before becoming ready.

consensus section (service.json)

"consensus": {
"raft": {
// How long to wait for a leader when there is none
"wait_for_leader_timeout": "15s",
// How long to wait before timing out a network operation
"network_timeout": "10s",
// How many retries should we make before giving up on a commit failure
"commit_retries": 1,
// How long to wait between commit retries
"commit_retry_delay": "200ms",
// Here and below: Raft options.
"heartbeat_timeout": "1s",
// See
"election_timeout": "1s",

"commit_timeout": "50ms",
"max_append_entries": 64,
"trailing_logs": 10240,
"snapshot_interval": "2m0s",
"snapshot_threshold": 8192,
"leader_lease_timeout": "500ms"

If you are using the peers configuration value, then it is very important that the peers configuration value in all cluster members is the same for all peers: it should contain the multiaddresses for the other peers in the cluster. It may contain a peer's own multiaddress too (but it will be removed automatically). If peers is not correct for all peer members, your node might not start or misbehave in not obvious ways (due to problem in reaching consensus).

You are expected to start the majority of the nodes at the same time when using this method(for operating with multiple peers simultaneously you can use this tool). If half of them are not started, they will fail to elect a cluster leader before raft.wait_for_leader_timeout. Then they will shut themselves down. If there are peers missing, the cluster will not be in a healthy state (error messages will be displayed). The cluster will operate, as long as a majority of peers is up.

Alternatively, you can use the bootstrap variable to provide one or several bootstrap peers. In short, bootstrapping will use the given peer to request the list of cluster peers and fill-in the peers variable automatically. The bootstrapped peer will be, in turn, added to the cluster and made known to every other existing (and connected peer). You can also launch several peers at once, as long as they are bootstrapping from the same already-running-peer. The --bootstrap flag allows to provide a bootsrapping peer directly when calling ipfs-cluster-service.

Use the bootstrap method only when the rest of the cluster is healthy and all current participating peers are running. If you need to, remove any unhealthy peers with ipfs-cluster-ctl peers rm <pid>. Bootstrapping peers should be in a cleanstate, that is, with no previous raft-data loaded. If they are not, remove or rename the ~/.ipfs-cluster/ipfs-cluster-datafolder.

Once a cluster is up, peers are expected to run continuously. You may need to stop a peer, or it may die due to external reasons. The restart-behaviour will depend on whether the peer has left the consensus:

  • The default case — peer has not been removed and cluster.leave_on_shutdown is false: in this case the peer has not left the consensus peerset, and you may start the peer again normally. Do not manually update cluster.peers, even if other peers have been removed from the cluster.
  • The left the cluster case — peer has been manually removed or cluster.leave_on_shutdown is true: in this case, unless the peer died, it has probably been removed from the consensus (you can check if it's missing from ipfs-cluster-ctl peers ls on a running peer). This will mean that the state of the peer has been cleaned up (see the “Dynamic Cluster Membership considerations” here), and the last known cluster.peers have been moved to cluster.bootstrap. When the peer is restarted, it will attempt to rejoin the cluster from which it was removed by using the addresses in cluster.bootstrap.

Remember that a clean peer bootstrapped to an existing cluster will always fetch the latest state. A shutdown-peer which did not leave the cluster will also catch up with the rest of peers after re-starting.

If the startup initialization fails, ipfs-cluster-service will exit automatically after a few seconds. Pay attention to the INFO and ERROR messages during startup. When ipfs-cluster is ready, a message will indicate it along with a list of peers.

The shared state, the local state and the ipfs state

It is important to understand that ipfs-cluster deals with three types of states:

  • The shared state is maintained by the consensus algorithm and a copy is kept in every cluster peer. The shared state stores the list of CIDs which are tracked by ipfs-cluster, their allocations (peers which are pinning them) and their replication factor.
  • The local state is maintained separately by every peer and represents the state of CIDs tracked by cluster for that specific peer: status in ipfs (pinned or not), modification time etc.
  • The ipfs state is the actual state in ipfs (ipfs pin ls) which is maintained by the ipfs daemon.

In normal operation, all three states are in sync, as updates to the shared state cascade to the local and the ipfs states. Additionally, syncing operations are regularly triggered by ipfs-cluster. Unpinning cluster-pinned items directly from ipfs will, for example, cause a mismatch between the local and the ipfs state. Luckily, there are ways to inspect every state:

  • ipfs-cluster-ctl pin ls shows information about the shared state. The result of this command is produced locally, directly from the state copy stored at the peer.
  • ipfs-cluster-ctl status shows information about the local state in every cluster peer. It does so by aggregating local state information received from every cluster member.

ipfs-cluster-ctl sync makes sure that the local state matches the ipfs state. In other words, it makes sure that what cluster expects to be pinned is actually pinned in ipfs. As mentioned, this also happens automatically. Every sync operations triggers an ipfs pin ls --type=recursive call to the local node.

Depending on the size of your pinset, you may adjust the interval between the different sync operations using the cluster.state_sync_interval and cluster.ipfs_sync_interval configuration options.

As a final note, the local state may show items in error. This happens when an item took too long to pin/unpin, or the ipfs daemon became unavailable. ipfs-cluster-ctl recover <cid> can be used to rescue these items. See the "Pinning an item" section in this post for more information.


Revolutionzing the business and establishing trust using distributed ledger technology.


Written by


Entrepreneur | Co-founder, TowardsBlockChain, an MIT CIC incubated startup | SimpleAsWater, YC 19 | Speaker |


Revolutionzing the business and establishing trust using distributed ledger technology.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade