How to gently break Ceph cluster

It’s not an abusive relationship, it’s a rough foreplay!

Published in

OpsOps

1 min readJan 21, 2021

A good integration testing must test every integration point. One of those points is metrics pulled by an external system. In my case it was ceph_health_status to display in a management software.

It has three possible values:

0 — booooring
1 — funny
2 — party hard!

And, of course, I need to test them. How to make cluster to become HEALTH_ERR quickly, reversibly, and reliably and, preferably, without messing with real OSDs?

After some tries I found a way to cause a lot of screaming without causing any damage (without actual data movement or service restarts).

HEALTH_WARN

Any pool with size=1 causes Ceph to raise a health warning. Even if this pool is empty

ceph osd pool set mypool size 1

And you get

HEALTH_WARN: # 1 pool(s) have no replicas configured (POOL_NO_REDUNDANCY)

To reverse it just, set back a good pool size. It’s super fast for an empty pool.

HEALTH_ERR

That one was harder to find. Most ‘bad’ scenarios you can imagine are just warnings for Ceph.

Except…

ceph osd set-full-ratio 0.0

Which annoys Ceph a lot.

HEALTH_ERR: full ratio(s) out of order

Reversal is simple: set it back to 0.9:

ceph osd set-full-ratio 0.9

Bonus: Breaking Ceph completely but gently

ceph mon add bad1 30.0.0.1
ceph mon add bad2 30.0.0.2
ceph mon add bad3 30.0.0.3

Boom! And your cluster looses quorum forever (until you fix it manually). Even ceph command stops to work.

How to gently break Ceph cluster

It’s not an abusive relationship, it’s a rough foreplay!

HEALTH_WARN

HEALTH_ERR

Bonus: Breaking Ceph completely but gently

Written by George Shuklin