Seaweedfs Distributed Storage Part 1: Introduction.

Ali Hussein Safar
5 min readSep 5, 2023

--

SeaweedFS is a distributed storage inspired by Facebook’s Haystack photo storage system which is designed to store blobs, objects, and files with predictable low latency with O(1) disk seek (Single disk read operation required to read the requested data from the storage).

How Seaweedfs Store files

In the SeaweedFS cluster, files are stored in volumes located in a volume server and volume is just a large empty file with a predefined size that is used to store files. For instance, a 10 GB volume could store 10,000 files with an average size of 1 MB.

The Issues that SeaweedFS tries to solve:

  1. File Systems Inode Disk Size Limitation: As every file in every operating system must have metadata (Also known as inode in Linux) which includes file name, path, permission, date created, date modified, and location in the disk. Some of this stored metadata is crucial, but the others waste disk size. In XFS or EXT4 file systems the default inode size is 512 bytes if CRC is enabled which means every single file size will be the size of the file itself plus 512 bytes for inode size and this includes empty files. For instance, storing 1 million small files will require 512 MB of extra storage for the inodes only. However, this issue was resolved in Seaweedfs by storing multiple files in a single larger file (volume). Furthermore, in the seaweedfs configuration, the default volume size is 30 GB, and this means all the required metadata is only 512 bytes plus 16 bytes per file (16 bytes for file metadata in the volume like file offset location and size) stored in the volume. Therefore, storing 1 million small files will only cost 512 bytes + (1Million * 16 bytes) which approximately is equal to 16MB. This approach was helpful in the case of Facebook storage for uploaded images that hold more than 260 billion images.
  2. Required Disk Operations to Read File from the Disk: In the typical file systems including XFS and EXT4, three disk read operations or more are required to read a single file from the disk, one for the file path to inode number mapping, one for inode retrieval from the disk to the memory, and one operation for file retrieval from the disk. Seaweedfs could overcome this issue by storing files’ metadata in the memory and therefore there will be only a single read operation to the disk because memory access time is in nanoseconds and the HDD hard drive is in milliseconds.

Seaweedfs Cluster Components:

Typical Seaweedfs Cluster consists of the following components:

  1. Master Service: In other distributed file systems masters manage files stored in the storage servers (volume servers) but in SeaweedFS masters only manage volumes stored in the volume servers and the volume servers will manage the files stored in the volumes. This means the master server will only store metadata about the volumes (like volume sizes, in which volume server the volume is stored, etc.). However, storing the actual file metadata is the role of the volume servers, not the master servers. Also, the master is responsible for choosing which volume server will be used to store new files and from which volume server we can access our stored files. Also, the master server will be responsible for converting volumes to read-only once they have reached the maximum size.
  2. Volume Service: The volume servers store a bunch of volumes with each volume size 30GB (configurable). The volume service is responsible for storing many objects (files and chunks of files) efficiently in the volume. Also, storing every file metadata (file name, size, offset location on the volume, etc.) on the disk and caching them in the memory to provide fast access to them and provide O(1) disk read operation. When the master is elected as leader in the cluster, all volume servers will send every stored volume’s metadata to the newly elected master.
  3. Filer Service: Interacting with the master and volume servers to upload and download files can be a tedious task therefore Seaweedfs team has created the Filer service to provide the following:
  • connects to master servers for up-to-date volume locations, and requests for volume server IP, volume ID, and file ID during file writes.
  • Handling file and directory operations, such as create, delete, read, write, and rename.
  • Keeping track of the metadata of files and directories, such as the file ID, file size, last modified time, and access time, etc…
  • Communicating with the master server to keep track of the cluster state.
  • Logging and auditing files and directory operations.

Also, it provides an entry point to access the data in different ways some of them are:

  • HTTP entry point to upload and download files.
  • Read and write files directly as a local directory via FUSE mount point.
  • S3 compatible API.
  • Accesses files from Hadoop/Spark/Flink/etc.
  • WebDAV.
  • Kubernetes CSI driver.

4. Filer Store Service: This is an important component that is used by the filer service to store the files metadata (like Volume ID, file ID, etc.). Also, the filer store is used to scale the filer service. There are two types of filer stores

  1. Shared: All filers will talk to shared filer store like MariaDB, Redis, Cassandra, etc. Therefore, this option will ease the scalability of filer servers since all the filers will use the same shared database.
  2. Embedded: Every single filer will store metadata locally using LevelDB which is “in disk a key-value store with memory cache” designed by Google. In the embedded filer store all the metadata changes are propagated to all the other filers if they exist. This will make sure that all filers are updated with the latest metadata changes.

5. S3 service: This is an optional service that offers S3 buckets designed in the AWS style. It may be launched alone or simultaneously with the filer.

The following diagram will show cluster components and connections

Seaweedfs Cluster Components

Note: The connection between the client and the volume server is optional. However, it can be used when the client mounts the cluster using a Fuse mount.

Part 2: Reading and Writing Files’ Process.
Part 3: Features.

Resources

https://github.com/seaweedfs/seaweedfs/wiki.
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf.

Thank you for reading my article on SeaweedFS. I hope you find it informative and helpful. If you enjoy the article and would like to support my work, follow me or you can buy me a coffee at https://www.buymeacoffee.com/ahsifer. Your support is greatly appreciated

--

--