recovering Ceph from “Reduced data availability: 3 pgs inactive, 3 pgs incomplete”
When your pool stuck and you don’t know what to do.
I have a gory story to tell. It’s less gory then a situation for some of possible readers, as it had happened in my private laboratory with no important data whatsoever.
Nevertheless I was able to recover every single bit of those non-important data, therefore, if reader of that text have THE PRODUCTION in the same state, stay calm, as MAY BE I can help you.
Close inspection
It was my benchmark pool with size=1. That means all my data was in a single copy. After some jerking with reboots (and may be this bug, but I’m not sure), I got this picture:
cluster:
id: bbc3c151-47bc-4fbb-a0-172793bd59e0
health: HEALTH_WARN
Reduced data availability: 3 pgs inactive, 3 pgs incomplete
At the same time my IO to this pool staled. Even rados ls
stuck at the middle of output, never able to finish.
I evaluated PGs:
ceph pg ls incompletePG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP2.19 0 0 0 0 0 0 1500 1500 incomplete 2018-10-29 15:40:22.301233 1033'4056498 1644:142 [2] 2 [2] 2 1033'4056498 2018-10-26 01:02:58.706233 1033'4056498 2018-10-20 21:04:09.8878542.50 0 0 0 0 0 0 0 0 incomplete 2018-10-29 15:40:22.301294 0'0 1644:118 [2] 2 [2] 2 1033'4267614 2018-10-25 15:06:35.887580 1033'3946410 2018-10-19 13:02:17.2817202.57 0 0 0 0 0 0 0 0 incomplete 2018-10-29 15:40:22.301368 0'0 1644:113 [2] 2 [2] 2 1033'4050786 2018-10-25 21:22:40.186352 1033'4050786 2018-10-22 05:25:13.534321
Bueh… What a mess in output. Sorry for this gibberish. Here is a terse version:
Each of affected PG is in incomplete state and resides on OSD.2 (see ACTING column), each has 0 objects and only PG 2.19 have 1500 DISK_LOG size.
Here should be a long and thrilling story of my unsuccessful attempts, but I’ve spent all my enthusiasm on a corporate blog entry, therefore, I keep my story short and successful.
OBJECTS COUNT MATTER (should be 0)
Before continue, important notice:
I was lucky (my random pile of useless bytes was lucky) that each of those PG have 0 object. 0 mean ZERO. None.
I can do anything I want with those objects and wouldn’t loose data.
If you have non-zero number of objects, I need to ask you to stop reading. You have a different issue and my way won’t do you anything good.
So, OBJECTS=0
for EACH incomplete PG. MUST.
Low-level OSD trickery
Next, I located OSD where those PG were. I want to remind you that I have pool with size=1, therefore each data chunk was stored with no redundancy, so I have to deal with a single OSD for each PG. If you have size>1, you may need to do this with more then one OSD for each PG.
I found OSD responsible for those PGs. It’s in the field “ACTIVE” . In my case it was just a ‘[2]
’, which means OSD.2
I’ve stopped OSD.2 (systemctl stop ceph-osd@2
), then I checked info for each pg with low-level utility ceph-objectstore-tool
:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op info --pgid 2.19
ceph-objectstore-tool
I have two types of injured PGs: with logs and without.
I removed PG with no data:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op remove --pgid 2.19 --force
And I marked as ‘complete’ two completely empty PGs:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op mark-complete --pgid 2.50
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op mark-complete --pgid 2.57
After that I’ve started ceph-osd service (systemctl start ceph-osd@2
), and forced creation of the removed PG (2.19):
ceph osd force-create-pg 2.19
After that I got them all ‘active+clean
’ in ceph pg ls
, and all my useless data was available, and ceph -s
was happy:
health: HEALTH_OK
Conclusion
If you have stuck PGs with zero object count (and you are sure it has 0 object count for real) you can remove them or ‘mark complete’ them. It’s an offline operation, but it allows your cluster continue to thrive. It’s fast operation, so you may get away with a downtime of ‘discovery_time’ plus 1–2 minutes.