Working with Ceph Object Storage

What is Ceph?

Ceph is an open source distributed object store and file system designed to provide excellent performance, reliability and scalability.

Why should I use it?

In many ways Ceph is the only storage solution that delivers four critical capabilities:

  • open source
  • software-defined
  • enterprise-class
  • unified storage (object, block, file)

Many other storage systems let you pick 2 out of 3, but almost no one else offers you all four together. They can be open source or scale out or software-defined or unified or have enterprise features, and some.

To better understand every point, I will explain in details below:

  • Open source means lower cost for you or for your company
  • Software-defined means flexibility in deployment, faster hardware upgrades, and lower cost
  • Scale-out means it’s less expensive to build large systems and easier to manage them
  • Block + Object means more flexibility (most other storage systems are block only, file only, object only, or other two combinations)
  • Enterprise features means replication (or erasure coding), snapshots, thin provisioning, automated storage tiring (ability to shift data between flash and hard drives), self-healing capabilities
Figure 1: Venn diagram showing Ceph might be unique

Ceph is built to organize data automatically using Crush, the algorythm responsible for the intelligent distribution of objects inside the cluster. I’m not going to describe in further details how crush works but you can get an idea of what Crush can do from this article.

How to use it?

In eMAG we use it in combination with OpenStack but not only. Our cloud infrastructure needs scale-out storage, and the best option for that is Ceph. It is tightly integrated with OpenStack and scales with user needs. The first Ceph cluster built in eMAG was a Supermicro cluster to test Ceph as Block Storage for OpenStack and later Object Storage for various internal teams.

Currently it is the main backend for the “DEV” OpenStack cluster and it still has some legacy Object Storage buckets that need to be migrated off. Future plans include using it as a “warm storage” via Object Storage for backups and other data that is not accessed frequently but is not fit for offline storage (ie: on tapes).

Ceph Object Storage supports two interfaces:

  1. S3 — provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.
  2. Swift — provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.

I integrated Ceph as Object Storage in the application that I work, which is written in PHP, using the AWS SDK. Here you will find a demonstration of how you can working with Ceph in PHP. It’s very easy to use and Amazon S3 has a well documented API.

Ceph performance conclusion

If you ask about performance tests which we made in eMAG when we decided to use Ceph, our internally run tests consist of many scenarios.

Test cases

Object size:
- between 150KB and 300KB
- between 900KB and 1100KB
Number of objects:
- 100K
- 500K
- 2M
Number of concurrent clients:
- 50
- 200
- 1000
- 5000

Test scenarios

1. 150KB - 300KB files, 100K files, 50 threads
2. 150KB - 300KB files, 100K files, 200 threads
3. 150KB - 300KB files, 100K files, 1000 threads
4. 150KB - 300KB files, 100K files, 5000 threads
5. 150KB - 300KB files, 500K files, 50 threads
6. 150KB - 300KB files, 500K files, 200 threads
7. 150KB - 300KB files, 500K files, 1000 threads
8. ............................................
9. 900KB - 1100KB files, 2M files, 1000 threads
10. 900KB - 1100KB files, 2M files, 5000 threads

During all our tests, Ceph did its job without any doubts. The performance remains very stable even in high load situations and independently of the amount of data stored. In our opinion, Ceph is an excellent choice to store large amounts of data and it gives you a reliable, cheap, and performant alternative to distributed file storage.

We also have to say that the observed performance is the result of a carefully tweaked hardware and software configuration. So while Ceph is excellent choice for distributed object and file storage, the installation and configuration requires a high amount of Unix, networking, and Ceph internals knowledge.