Scale your Gluster Cluster, 1 node at a time!!

One of the growing pains of any cluster is scaling it out, and any tool / company which made the scaling part easy are doing pretty well for themselves. One of the reasons OpenSource, Software Defined Storage Community is liking Gluster is because it can provide scale out options for Storage, and to a large extent, can handle the things smoothly!

But regardless of all the improvements over the years, when we (I represent Gluster developer community) have to migrate data as part of scale-out operations, it is never completely pain less! User expectation and reality always ended up having some mismatches.

In this blog, I will try to explain few steps an Admin can do to make life easy for themselves when there is scaling of cluster involved! From now on, this blog would go little more technical about Gluster’s design and few of the recent features which would help people planning to scale their storage cluster, 1 (or N) node at a time.

Step 1: Create volume with more bricks than the number of hosts.

A general assumption of gluster’s volume is that it should export just 1 brick from each peer involved. Also, in most of the cases, we, the developers recommend an homogeneous setup (ie, a machine with same size, same config) across all the involved nodes for better performance and ‘support-ability’.

Starting 3.12 we have the brick mux feature, and the statfs enhancement (to divide by bricks on a backend mount), which enables creating more bricks on a node. If an admin wants to use high availability, minimum required node is 3. Also, recommended to use ‘brick-multiplexing’ to be enabled on the volume if this feature needs to be enabled!

n1$ gluster peer probe n2
n1$ gluster peer probe n3
n1$ gluster volume create demo replica 3 n1:/br/b1 n2:/br/b1 n3:/br/b1 n1:/br/b2 n2:/br/b2 n3:/br/b2 n1:/br/b3 n2:/br/b3 n3:/br/b3 n1:/br/b4 n2:/br/b4 n3:/br/b4 n1:/br/b5 n2:/br/b5 n3:/br/b5 n1:/br/b6 n2:/br/b6 n3:/br/b6 n1:/br/b7 n2:/br/b7 n3:/br/b7 n1:/br/b8 n2:/br/b8 n3:/br/b8

Notice that there are total of 24 bricks from just 3 nodes.

Step 2: Start the volume, and consume the storage

Nothing special. Use the volume as if you would use any Gluster volume, no special treatment!

Step 3: Add new node to cluster, and expand your storage!

Now is the fun part! All the trouble you took to create the volume now pays you back!

When you add the new node, all you have to do is, run many replace-brick commands (or add-brick + remove-brick in case of plain distribute volume).

n1$ gluster peer probe n4
n1$ gluster volume demo replace-brick n1:/br/b2 n4:/br/b2
n1$ gluster volume demo replace-brick n1:/br/b6 n4:/br/b6
n1$ gluster volume demo replace-brick n2:/br/b3 n4:/br/b3
n1$ gluster volume demo replace-brick n2:/br/b7 n4:/br/b7
n1$ gluster volume demo replace-brick n3:/br/b4 n4:/br/b4
n1$ gluster volume demo replace-brick n3:/br/b8 n4:/br/b8

Notice that the above commands makes sure different bricks from different replica pair moves to new node. So, technically it becomes 6 bricks on each node from 8 bricks.

Step 4: Happy Scaling!

With the above steps, you would notice a significant difference in the way Gluster handled scale-out. This doesn’t need any further rebalance operations, which solves the issue of more than required data being migrated with Gluster Scale out operation!

All the above steps will lead to migration of data with ‘self-heal’ instead of rebalance. Make sure that you complete the ‘healing’ process before adding any further nodes.

Step 5: Grow!

As and when you need to grow your storage, repeat the steps 3 and 4.

Lets take the example to add one more machine after some time to this volume.

n1$ gluster peer probe n5
n1$ gluster volume demo replace-brick n1:/br/b3 n5:/br/b3
n1$ gluster volume demo replace-brick n2:/br/b4 n5:/br/b4
n1$ gluster volume demo replace-brick n3:/br/b5 n5:/br/b5
n1$ gluster volume demo replace-brick n4:/br/b6 n5:/br/b6

Notice that at this time, the bricks are balanced as 5,5,5,5,4, on each node respectively. Also this can be performed only after all the pending self-heal counts are 0 due to previous set of operation in Step3.

NOTE:

Limitations and Concerns in this approach:

  • Snapshots wouldn’t work seemlessly, as there are brick movements here.
  • Quota: Technically, it should work fine, but not validated with tests.
  • Prone to manual errors while giving ‘replace-brick’ command
  • Choose the proper bricks to migrate, or else, we would loose the good copy of the data.
  • With this approach, you can’t scale to any number of machines unless you run rebalance operation like earlier.
  • Notice that the ‘inode’ usage on the brick mount point will be very high in this model. Decide to use this model after you are clear about the inode usage.
Like what you read? Give Amar Tumballi a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.