How to gently break Ceph cluster

It’s not an abusive relationship, it’s a rough foreplay!

George Shuklin
OpsOps
1 min readJan 21, 2021

--

A good integration testing must test every integration point. One of those points is metrics pulled by an external system. In my case it was ceph_health_status to display in a management software.

It has three possible values:

  • 0 — booooring
  • 1 — funny
  • 2 — party hard!

And, of course, I need to test them. How to make cluster to become HEALTH_ERR quickly, reversibly, and reliably and, preferably, without messing with real OSDs?

After some tries I found a way to cause a lot of screaming without causing any damage (without actual data movement or service restarts).

HEALTH_WARN

Any pool with size=1 causes Ceph to raise a health warning. Even if this pool is empty

And you get

To reverse it just, set back a good pool size. It’s super fast for an empty pool.

HEALTH_ERR

That one was harder to find. Most ‘bad’ scenarios you can imagine are just warnings for Ceph.

Except…

Which annoys Ceph a lot.

Reversal is simple: set it back to 0.9:

Bonus: Breaking Ceph completely but gently

Boom! And your cluster looses quorum forever (until you fix it manually). Even ceph command stops to work.

--

--

George Shuklin
OpsOps

I work at Servers.com, most of my stories are about Ansible, Ceph, Python, Openstack and Linux. My hobby is Rust.