Exploring the Architecture of Object-Based Storage: A Comprehensive Guide to Understanding the Fundamentals

10 min readOct 14, 2022

Overview

It is fascinating to observe how computing and storage have been separated in recent years to support massive scale at both levels. In terms of storage, the industry has relied on file and block-based storage for a long time. However, managing metadata such as inode or other attributes, or journal information, can be a burden for the compute node that manages the file system.

To alleviate this burden, a new way of storing files on disk has emerged rapidly in recent years: object-based storage. By using an “object ID,” specific files can be tracked, retrieved, and deleted. Essentially, some metadata has been reduced and is maintained by a separate compute node called a “metadata server.”

Features of Object Based Storage

Let me put the features it has:

Scalable performance via an offloaded data path
Robust & shared access by many clients
Intelligent space management in the storage layer
Data-aware pre-fetching and caching
Strong fine-grained end-to-end security i.e. TLS, encryption, ACL, IAM role based access etc.

Two Major Benefits

The Object Storage Architecture provides a method for allowing compute nodes to access storage devices directly(but in distributed network) and in parallel(through multipath) providing very high performance.

Example: Uploading file to s3 bucket in multipart mode or retrieving data similarly using object ID(i.e. whole path from bucket to file name). Worth to mention that a sample object ID looks like below in aws cloud: “s3://mybucket/examplefolder/testfile.csv”

2. It distributes the system metadata allowing shared file access without a central bottleneck.

Metadata server is scalable enough, thus client requests don’t hit to a single server every time. It also slices data file and replicates to multiple disks.

How data is accessed in Parallel

This architecture defines a new & more intelligent disk interface called the Object based Storage Device (OSD). The OSD is a network-attached device containing the storage media, disk or tape, and sufficient intelligence to manage the data that is locally stored.
The compute nodes communicate directly to the OSD to store and retrieve data(bypassing the metadata server). Since the OSD has intelligence built in there is no need for a file server to intermediate the transaction. Also file system stripes the data across a number of OSDs, the aggregate I/O rates and data throughput rates scale linearly.

As an example, a single OSD attached to Gigabit Ethernet may be capable of delivering 400 Mbps of data to the network and 1000 storage I/O operations, but if the data is striped across 10 OSDs and accessed in parallel, the aggregate data rates achieve 4,000 Mbps and 10,000 I/O operations.

What happens behind the scene at Distributed Metadata

In traditional storage system has a single monolithic metadata server that serves two primary functions.

>> First, it provides the compute node with a logical view of the stored data (the Virtual File System or VFS layer), the list of file names, and typically the directory structure in which they are organized.

>> Second, it organizes the data layout in the physical storage media (the inode layer).

The Object Storage Architecture divides the logical view of the stored data (VFS layer) from the physical view (the inode layer) and distributes the workload allowing the performance potential of the OSD to avoid the metadata server bottlenecks found in today’s NAS systems.

The VFS portion of the metadata typically represents approximately 10% of the workload of a typical NFS server, while the remaining 90% of the work is done at the inode layer with the physical distribution of data into storage media blocks.

We’ll explore the architecture of object based storage and understand how it interacts with client. AWS S3, GCP cloud storage or Azure’s blob are some example of Object based storage.

Current Architecture

**Object Based Storage Architecture, Image by Author**

OBJECT STORAGE COMPONENTS

There are five major components to the Object Storage Architecture.

1. Object

It contains the actual data and enough additional information to allow the data to be autonomous and self-managing. These additional information can define on a per file basis at the RAID levels, data layouts and quality of service. The additional information or attributes are tracked in the metadata server.

All objects are accessed via a 96-bit object ID with in the device. The object is accessed with a simple interface based on the object ID, the beginning of the range of bytes inside the object and the length of the byte range that is of interest (<object, offset, length>). There are three different types of objects.

The “Root” object on the storage device identifies the storage device and various attributes of the device itself, including it’s total size and available capacity.
A “Group” object provides a “directory” to logical subset of the objects on the storage device.
A “User” object carries the actual application data to be stored.

The user object is a container for data and it has two types of attributes.

Application Data — The application data is essentially the equivalent to the data that a file would normally have in a conventional system. It is accessed with file-like commands such as Open, Close, Read, and Write.
Storage Attributes — These attributes are used by the storage device to manage the block allocation for the data. This includes the object ID, block pointers, logical length, and capacity used. This is similar to the inode-level attributes inside a traditional file system.
User Attributes — These attributes are opaque to the storage device and are used by applications and metadata managers to store higher-level information about the object. These attributes can include file system attributes like ownership and access control lists (ACLs) etc.

2. Object based Storage Device (OSD)

An intelligent evolution of today’s disk drive that can store and serve objects rather than simply putting data on tracks and sectors. This is innovation at hardware level.

The OSD is an intelligent device that contains the disk, a processor, RAM memory and a network interface that allows it to manage the local object store and autonomously serve and store data from the network. It is the foundation of the Object Storage Architecture, providing the equivalent of the SAN fabric in conventional storage systems.

In the “Object SAN,” the network interface is gigabit Ethernet instead of fibre channel and the protocol is iSCSI, the encapsulation of the SCSI protocol transported over TCP/IP. SCSI supports several command sets, including block I/O, tape drive control, and printer control. The new OSD command set describes the operations available on Object-based Storage Devices.

The result is a group of intelligent disks (OSDs) attached to a switched network fabric (iSCSI over Ethernet) providing storage that is directly accessible by the compute nodes. Unlike conventional SAN configurations, the Object Storage Devices can be directly addressed in parallel, without an intervening RAID controller, allowing extremely high aggregate data throughput rates.

Check below links to get some idea about the hardware how HP & Dell offers object based storage hardware:

What is object storage and what are object stores? | Glossary

Object storage is a method of managing data storage in discrete units, known as objects. An object store is a platform…

www.hpe.com

Cloud Object Storage

Cloud object storage allows IT teams to flexibly capture, store, protect and manage unstructured data at public…

www.dell.com

The OSD provides four major functions for the data storage architecture:

a) Data Store: The primary function in any storage device is to reliably store and retrieve data from physical media. The data is not accessible outside the OSD in block format, but only via their object IDs. The compute node requests a particular object ID, an offset to start reading or writing data within that object and the length of the data block requested.

b) Intelligent Layout: How data will be written on disk or how they will be fetched using processor & memory are being intelligent handled by this. The set of sectors to be selected or size of the data will be written are managed by this.

c) Metadata Management: Metadata about an object is managed by this. This is similar to inode information we know in linux or unix. Thus client computer reduces its burden and handover all this metadata management stuffs to storage controller. Metadata server only maintains object ID but leaves it sector or other information. This reduces overhead on storage device or metadata server.

d) Security: Each action & authorization is managed by this device. Even encryption too.

3. Installable File System

Integrates with compute nodes, accepts POSIX file system commands and data from the Operating System, address the OSDs directly and stripes the objects across multiple OSDs. Example: “s3fs-fuse”.

4. Metadata Server

Intermediates throughout multiple compute nodes in the environment, allowing them to share data while maintaining cache consistency on all nodes.

It takes care of Authentication, File or Directory Access Management, Cache control, capacity management, scaling etc. It distributes the block/sector management to the OSDs (which is approximately 90% of the workload) and maintains the file/ directory metadata management (10% of the workload) in a separate server that can also be implemented as a scalable cluster. The scalability of the MDS is the key to allowing the entire object storage system to scale, balanced in both capacity and performance.

5. Network Fabric

It ties the compute nodes to the OSDs and Metadata Servers. It provides the connectivity infrastructure that binds the Object-based Storage Devices, Metadata Server and compute nodes in a single fabric.

With the advent of inexpensive gigabit Ethernet, it became possible to run storage traffic at speeds that meet or exceed specialized storage transports like fibre channel. This gives the Object Storage Architecture two advantages,

(i) The lower component costs implied by the commodity status of Ethernet and

(ii) Lower management costs associated widespread knowledge of building reliable Ethernet fabrics.

However, the Object Storage Architecture is wedded only to TCP/IP, rather than Ethernet. It is possible to build an Object Storage system on other transports such as Myrinet and InfiniBand using their support for TCP/IP.

How READ operations take place

Client Initially Contacts Metadata Server

2a. Metadata Server Returns List of Objects

2b. Metadata Server Also Returns a security Capability like authentication token with authorization details(role, permission, verbs etc.)

The Metadata Server sends a Capability or security token which authorizes the node to access the specific component objects, at specific offsets, with a specific set of permissions, for a specific length of time.

3. Client Sends Read Requests Directly to OSD along with capability

4. Direct-data Transfer Between Client and OSD

How WRITE operations take place

Client Initially Contacts Metadata Server

2a. Metadata Server Returns a List of Objects

2b. Metadata Server Also Returns a Security Capability

3. Client Sends Write Requests Directly to OSD with Capability

4. Direct-Data Transfer Between Client and OSD

Performance

In a traditional system the WRITE commands generate a series of blocks written out to sectors and tracks on disk all of which need to be managed by the NAS filer head in order to optimize the placement of the data onto the media creating a significant burden on the filer head.

In Object Storage system, compute nodes send data to the OSDs, which can autonomously worry about optimizing placement of the data on the media. This minimizes the burden on the compute node and allows it to write to multiple OSDs in parallel, maximizing the throughput of a single client, but also allowing the storage system to handle the aggregated writes of a large cluster. This can be especially important in applications that use checkpointing where all compute nodes run to a barrier in the application and then write their physical memory out to storage.

These peak rates are important, but for most Linux cluster applications, the aggregate sustained I/O and throughput rates from storage to large numbers of compute nodes are even more important. The level of performance offered by the Object Storage Architecture is not achievable by any other storage architecture.

Now a days, the OLAP or OLTP or ML/AL compute nodes are using data from object location. Basically, they externalizing their data store. For OLAP, it may copy huge amount of data from OSD to its local compute cache or local disk but this copy process may increase overall execution of time. However, pre-fetching mechanism reduces the execution time. As an example, snowflake uses S3 as external storage. It’ll take more time than result set cache or local disk. However, SF makes sure that private connectivity is there with AWS S3 & they also make sure both compute and buckets are in same region. This reduces overall execution time of the query.

Scalability

OSD Offloads 90% of Workload from Metadata Servers. Metadata server is scalable as well and it handles 10% workload from the compute node. At the network fibre level, multiple high bandwidth multipath are present. There is no compromise at the scalability. At present, data are copied across geo location so that multiple clients can access the same data from different OSD data center.

Security

Authentication of compute nodes to the storage system through IAM, Role, Access key, Secret key, Multifactor, session based etc.
Authorization for compute node commands to the storage system like permission, role, verbs, acl etc.
Authentication of the storage devices to the storage system i.e. it validates the user who is trying to access data through internal identity or external IDP.
Integrity checking of all commands via CRC checks
Privacy of data and commands in flight via IPsec

In conclusion

Due to compliance strategies and competition among cloud vendors, they may choose not to share all information related to their architecture. However, I have found various white papers and architectures that are quite informative. In this article, I’ve provided some insights into object-based storage. It may not be suitable for heavy IO operations, so it’s important to carefully consider your storage options based on the recommendations of your cloud vendor. You should analyze whether your use case can be adequately supported on this storage, taking into account factors such as security, scalability, separation of storage from compute, and backup.

Reference:

“ https://docs.aws.amazon.com/s3/index.html”

“https://cloud.google.com/storage/docs”

“ https://azure.microsoft.com/en-gb/products/storage/blobs/#documentation”

“ https://www.netapp.com/data-storage/storagegrid/”

“https://www.panasas.com/”

“ https://www.dell.com/nl-nl/dt/learn/data-storage/object-storage.htm”

“ https://www.hpe.com/in/en/storage/file-object/scality.html”