How to gently break Ceph cluster
It’s not an abusive relationship, it’s a rough foreplay!
A good integration testing must test every integration point. One of those points is metrics pulled by an external system. In my case it was ceph_health_status
to display in a management software.
It has three possible values:
- 0 — booooring
- 1 — funny
- 2 — party hard!
And, of course, I need to test them. How to make cluster to become HEALTH_ERR
quickly, reversibly, and reliably and, preferably, without messing with real OSDs?
After some tries I found a way to cause a lot of screaming without causing any damage (without actual data movement or service restarts).
HEALTH_WARN
Any pool with size=1
causes Ceph to raise a health warning. Even if this pool is empty
ceph osd pool set mypool size 1
And you get
HEALTH_WARN: # 1 pool(s) have no replicas configured (POOL_NO_REDUNDANCY)
To reverse it just, set back a good pool size. It’s super fast for an empty pool.
HEALTH_ERR
That one was harder to find. Most ‘bad’ scenarios you can imagine are just warnings for Ceph.
Except…
ceph osd set-full-ratio 0.0
Which annoys Ceph a lot.
HEALTH_ERR: full ratio(s) out of order
Reversal is simple: set it back to 0.9:
ceph osd set-full-ratio 0.9
Bonus: Breaking Ceph completely but gently
ceph mon add bad1 30.0.0.1
ceph mon add bad2 30.0.0.2
ceph mon add bad3 30.0.0.3
Boom! And your cluster looses quorum forever (until you fix it manually). Even ceph
command stops to work.