Setting up a Proxmox VE cluster with Ceph shared storage

Pavel Ladyzhenskyi
5 min readMay 11, 2019

--

Sooner or later, there comes a time in a person’s life — when you have to start thinking about resiliency and high availability of your services. This article is about how to configure Proxmox HA cluster on 3 nodes with shared storage to provide possibility for the live migration of VM. We thought a lot about what to choose as a shared storage(the choise was between ceph and gluster)and finally decided to dwell our choice on ceph. Here and here you can find pros and cons about that two storages, so feel free to choose this one you need in your case.

Creating a cluster

List of IP addresses and DNS names which being used in our setup

192.168.25.61 machine1192.168.25.62 machine2192.168.25.63 machine3

First of all, we need to set up 3 proxmox nodes. For that we can use proxmox iso from official site or install it from repository on fresh debian.

For install from repo I recommend to use official guide.

When installation is completed you need to update your system

sudo apt-get update

After that we are going to edit our /etc/hosts on each node for more convenience

127.0.0.1 localhost.localdomain localhost
192.168.25.61 machine1.local machine1 pvelocalhost
192.168.25.62 machine2.local machine2
192.168.25.63 machine3.local machine3

Check via ping that each node sees each other.

Add non-subscription repo for Proxmox

echo "deb http://download.proxmox.com/debian/pve stretch pve-no-subscription" > /etc/apt/sources.list

Now we are ready to create a cluster. On the node that will act as a master node, enter the command

pvecm create <clustername>

Adding remaining nodes to the cluster

pvecm add "masternode ip or dns name"

Checking that all of nodes are in cluster

pvecm status

If everything has gone well we can access web-gui and be able to manage all nodes from one browser window. This works regardless of which node you are logged into. (8006 is a default port for Proxmox web-gui)

Configure Ceph

Lets configure Сeph storage, for that I recommend to use separated network for VM and dedicated network for Ceph (10gb NIC would be nice, especcialy if you want to use SSD)

For VM live migration between nodes you may need to create vlan so VM can see each other even if they are on different nodes.

Make sure that you allow ports 6789, 6800–7100 in your firewall. Ceph Monitors listen on port 6789 by default. Also daemons bind to ports within the 6800–7100 range.

Install Ceph on all nodes:

pveceph install --version luminous

Initalize Ceph only on masternode (change 10.10.10.0/24 to your CIDR block):

pveceph init --network 10.10.10.0/24

Create monitor on node, repeat this for each:

pveceph createmon

After creating the Ceph monitors now we can use the GUI for the remaining processes.

Create OSDs via web-ui. Do this on each node in the cluster.

By the way, that’s all, you can create Ceph storage pools using the web-gui and you will be fine, buuuut…

By default, when you create storage pool, it is trying to allocate all discovered OSDs. More often than not a Ceph cluster typically consists of several nodes having multiple disk drives. And, these disk drives can be of mixed types. We will create a pool named ssd-pool backed by SSD disks, and another pool named sata-pool, which is backed by SATA disks.

In this case osd.0, osd.1 and osd.2 are SSD disks. Similarly, osd.3, osd.4, osd.5, osd.6, osd.7 and osd.8 are SATA disks.

  1. Get the current CRUSH map and decompile it:
ceph osd getcrushmap -o crushmapdump
crushtool -d crushmapdump -o crushmapdump-decompiled

2. Edit the crushmapdump-decompiled CRUSH map file and add the following section after the root default section:

root ssd {
id -20
alg straw
hash 0
item osd.0 weight 0.010
item osd.1 weight 0.010
item osd.2 weight 0.010
}
root sata {
id -21
alg straw
hash 0
item osd.3 weight 0.010
item osd.4 weight 0.010
item osd.5 weight 0.010
item osd.6 weight 0.010
item osd.7 weight 0.010
item osd.8 weight 0.010
}

3. Create the CRUSH rule by adding the following rules under the rules section of the CRUSH map, and then, save and exit the file:

rule ssd-pool {
ruleset 1
type replicated
min_size 2
max_size 10
step take ssd
step chooseleaf firstn 0 type osd
step emit
}
rule sata-pool {
ruleset 2
type replicated
min_size 2
max_size 10
step take sata
step chooseleaf firstn 0 type osd
step emit
}

4. Compile and inject the new CRUSH map in the Ceph cluster:

crushtool -c crushmapdump-decompiled -o crushmapdump-compiled
ceph osd setcrushmap -i crushmapdump-compiled

5. Check the OSD tree view for the new arrangement, and notice the ssd and sata root buckets:

ceph osd tree

6. Create and verify the ssd-pool.

ceph osd pool create ssd-pool 128

128 — number of pg_num, you can use this calculator to count number of placement groups you need for you Ceph.

Verify the ssd-pool, notice that the crush_ruleset is 0, which is by default:

ceph osd dump | grep -i ssd

Let’s change the crush_ruleset so that the new pool gets created on the SSD disks:

ceph osd pool set ssd-pool crush_rule ssd-pool

Verify the pool and notice the change in crush_ruleset:

ceph osd dump | grep -i ssd

7. Similarly, create and verify sata-pool.

ceph osd pool create sata-pool 128
ceph osd dump | grep -i sata
ceph osd pool create sata-pool 128
ceph osd pool set sata-pool crush_rule sata-pool
ceph osd dump | grep -i sata

8. Let’s check that it works like expected.

Since these pools are new, they should not contain any objects, but let’s verify this by using the rados list command:

rados -p ssd-pool ls
rados -p sata-pool ls

Now we add an object to these pools using the rados put command. The syntax should be:

rados -p <pool_name> put <object_name> <file_name>

rados -p ssd-pool put dummy_object1 /etc/hosts
rados -p sata-pool put dummy_object1 /etc/hosts

Using the rados list command, list these pools. You should get the object names that we stored in the last step:

rados -p ssd-pool ls
rados -p sata-pool ls

9. Verify that the objects are getting stored on the correct set of OSDs. Check the osd map for ssd-pool using the syntax:

ceph osd map <pool_name> <object_name>

ceph osd map ssd-pool dummy_object1

You should get output with id of OSDs where “dummy_object1” is actually stored. As shown in the preceding screenshot, the object that is created on ssd-pool is actually stored on the OSDs set [0,2,1]. This output was expected, and it verifies that the pool that we created uses the correct set of OSDs as we requested.

To point out the conclusion, one can say that presently is a complete solution to greatly increase fault tolerance of your virtualization platform without much effort. On the other hand, Ceph provides superb reliability and scalability that allow us to increase capacity of our storage by hot-adding OSDs.

If you found any of these extensions useful and you liked the article, feel free to click and hold the clap button! :)

--

--