rados objects: omaps and xattrs

Published in

OpsOps

3 min readOct 11, 2019

If you take for granted basic assumptions of a beginner’s introduction into Ceph, you are at risk of missing one foundational detail of ‘how Ceph works’.

This foundational detail is ‘what is an object in Ceph?’, and the basic beginner’s assumption is ‘object is like a file. You can put and get it.’

This assumption is a rough approximation, and it’s good for initial understanding of relations between PG and OSD, CRUSH, etc.

Unfortunately, it’s not good if you dig deeper. So, listen up.

Ceph object consists of…

Each object may have four components:

Name
data content (data stream)
xattrs (Attributes)
OMAP (more on that below)

While ‘name’ and ‘data content’ is the part of the “beginner’s assumption”, xattrs and OMAPs aren’t. Moreover, xattrs normally does not cause any misunderstanding, normal files do have attributes too.

OMAP is a completely different beast.

What is OMAP?

OMAP is a key-value database. Each object may have own database. This database allows to perform following operations:

insert/update pair of key and value
delete pair of key and value by key
get value by key
list keys
list pairs of keys and values

This database can be arbitrary large, although, Ceph does not like oversized OMAPs, it can even complain about this in health report:

HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects

This database is ‘attached’ to object, and is replicated between OSDs, like a datastream. When someone performs query on that database, those requests are send directly to primary OSD, which uses locally optimized RocksDB (level db) to answer those requests. Every write request (insert/update) is replicated to all secondaries OSDs (or at least, to min_size number of alive secondaries OSDs). Moreover, when object is moved between OSDs, it’s OMAP is been replicated as well.

Why?

The reason is performance. If you need to perform database-like operation, read/write operations on data storage can be over expensive. Even with best tree data structures you still need something of log(n) read operations per each read, and some of log(n) operations for each write. If you perform those on objects datastream, it means multitude RTTs. Combining it with replication latencies (primary OSD confirms writes only enough secondaries confirm it), it cause significant bottleneck. Moving those multitude operations on OSD itself causes enormous boost to performance.

Is it used?

Yes, it is used, by primary by own ceph applications: RGW and cephfs. Former is an excellent example of ‘how to use OMAPs’, as RGW bucket indexes are stored as OMAPs for special ‘index’ files with apparent zero size datastream (that means, size of those object is zero, but it’s OMAP can be few megabytes in size).

Look and feel

In modern Ceph (circa 14.2/Nautilus as of time of writing) one can see OMAPs in output of ceph osd df (I trimmed output a bit):

ID SIZE    RAW USE  DATA     OMAP    META    AVAIL   %USE  VAR  PGS 
 5 931 GiB  128 GiB  126 GiB 193 MiB 1.2 GiB 803 GiB 13.70 0.75  78     
 4 931 GiB  213 GiB  211 GiB 110 MiB 1.7 GiB 718 GiB 22.86 1.25  73     
 2 931 GiB  169 GiB  167 GiB  96 MiB 1.6 GiB 762 GiB 18.17 1.00  86     
 3 931 GiB  212 GiB  210 GiB 289 MiB 1.8 GiB 719 GiB 22.81 1.25  82     
 0 931 GiB   85 GiB   84 GiB 175 MiB 977 MiB 846 GiB  9.14 0.50  58     
 1 931 GiB  212 GiB  210 GiB 166 MiB 1.7 GiB 719 GiB 22.78 1.25  87     
Tot: 5.5 TiB 1019 GiB 1009 GiB 1.0 GiB 9.0 GiB 4.5 TiB 18.24

When objects with OMAPs was removed, it’s easy to ‘compact’ omap:

ceph tell osd.* compact

(Beware, it’s resource-hungry latency-inducing operation).

Internally, OMAPs can be stored on a separate partition for OSD (f.e. on NVME) while preserving data on a slower medium (e.g. on rusty spindles).

You can perform OMAP operations with an object by rados:

rados setomapval test foo bar

(test is object name, foo is key name, bar is value)

rados -p pool1 listomapkeys test
foo# rados -p pool1 listomapvals test
foo
value (3 bytes) :
00000000  62 61 72                                          |bar|
00000003rados -p pool1 getomapval test foo
value (3 bytes) :
00000000  62 61 72                                          |bar|
00000003

(yep, value can be binary, so it’s displayed as hexdump)

and we can cleanup the mess:

rados -p pool1 rmomapkey test foo
rados -p pool1  clearomap  test

Takeways

Objects in Ceph are more complex than just ‘name+data’, they have built-in key-value database (OMAP) which plays a big role in RGW and CephFS operations.