Distributed File Systems and Object Stores on Linode (Part 1)
We live in the Age of Data. Data analytics, machine learning and big data have become critical factors — and in some cases, even unique selling points and outright products — for many businesses and verticals.
This explosion in data often involves persisting large volumes of data to storage. The distributed and scalable nature of big data processing systems often impose similar demands of scalability and performance on their storage layers.
Distributed File Systems and Object Stores are the two storage models most prevalent in big data ecosystems.
Distributed File Systems (DFS) provide the familiar directories-and-files hierarchical organization we find in our local workstation file systems. Each file or directory is identified by a path that includes all the other components in the hierarchy above it. What is unique about DFSs compared to local filesystems is that files or file contents may be stored across disks of multiple servers instead of a single disk.
Object Stores, in contrast, store data in a flat non-hierarchical namespace where each piece of data is identified by an arbitrary unique identifier. Any other details about the piece of data are stored along with the data itself, as metadata. Metadata may include permissions, file size, file type, or even details of relationships to other objects, such as to a “parent” object or a “group” object, where the meanings of “parent” or “group” are defined by the client application that added the metadata.
In an Object Store, each consuming application can decorate the same object with its own application-specific metadata entries that carry meaning only to itself. This flexibility and openness is lost in the more rigid directories-and-files hierarchy of a distributed file system, forcing applications to store their application-specific metadata away from the files, such as in external databases. This separation of data and metadata storage complicates applications architectures. In contrast, Object Stores lead to less moving parts and simpler application architectures. However, a distributed file system is sometimes unavoidable especially if the system contains components that expect a file system interface.
In this part 1 of the series, we explore GlusterFS, a popular distributed file system.
GlusterFS is a scalable, POSIX-compliant, distributed file system with a fairly simple architecture.
In Gluster, a Brick is just a storage area. It can be a physical disk, a virtual disk, a simple disk partition or even a mere directory. A brick has, or is part of, a regular disk file system such as ext4, XFS, btrfs or any other filesystem that supports extended file attributes.
A Storage Pool is a set of nodes, each hosting one or more bricks and running the GlusterFS daemons.
A Volume is a notional storage area that in reality can span bricks of multiple nodes. Volumes are what client applications see and operate on. Each set of logically related data may go into a separate volume — for example, all videos go to a “videos” volume, or all documents to a “documents” volume.
Volumes are mounted as regular directories on client machines, but under the covers, they may span many bricks across multiple nodes of the storage pool and orchestrate those nodes to carry out file operations requested by applications.
For each volume, GlusterFS supports different brick layouts — such as distributed, replicated, striped or erasure-coded — to achieve different levels of scalability, availability, data durability and performance. All the layouts are explained well in the GlusterFS documentation and should be chosen for each volume by analyzing application workloads on it.
Gluster is scalable. When you need more storage, just create more nodes in the storage pool and add their bricks to required volumes, making sure a brick is added to a single volume.
Apart from cluster-level replication, GlusterFS also supports asynchronous Geo-replication, that is replication to an entirely different GlusterFS cluster in another datacenter. A variant of this is cascading geo-replication, where the remote replica may itself have a remote replica, which in turn may have a remote replica, and so on, in a cascade.
With that brief introduction to GlusterFS architecture, let’s move on to planning a Gluster deployment on the Linode cloud.
So how many linodes do you need for X TB of storage and what type of nodes should they be? Apart from pricing, there are a number of technical factors to understand here.
Storage capacity and replication
Replication is primarily for data durability and secondarily for availability.
A minimum replica factor of 2 means every file in the cluster has one copy and doubles the total storage capacity required to 2X TB.
Replica factor of 3 means every file has two copies and trebles total storage required to 3X TB. Any factor higher than 3 in the same datacenter is probably overkill, and geo-replication is probably a better solution if your data really warrants that many replicas.
Apart from capacity, replica count also affects latency. Higher replica factors are not a problem for read-heavy loads but make latency worse for write-medium and write-heavy loads.
Disks and filesystems
Storage disks should be separate from OS disks. On a Linode, this means at least three disks — one for the brick (there can be more than one, too), one for the OS and one for swap. The OS disk can have any filesystem recommended for that OS, but XFS is recommended for brick disks (ext4 or btrfs will work but are not recommended).
Gluster’s architectural elements such as quorum-based strong consistency, client-side failover, and client-side fanout architecture for replication mean that a single cluster should have ideally between three and a hundred or so storage nodes. A small number of high capacity nodes are better than a large number of low capacity ones.
Incoming data is unlimited and free
Incoming data on Linodes is completely free, unlimited, unmetered and clocks at 40Gbps for all configurations, all of which is good for writing large volumes of data to the GlusterFS cluster. The only factor to analyze is whether 40Gbps is fast enough for client applications, but all other incoming network factors like costs or quotas can be ignored.
Analyze if outgoing network quotas affect your applications
Outgoing data quotas — that is, the data volume that can go out of a Linode datacenter for free — are pooled, which means total limit is proportional to number of nodes under your Linode account. Outgoing data rates, however, are not pooled.
For external facing applications that involve high volumes of outgoing data, such as video sites or Gluster geo-replication, you might want to maximize number of nodes to benefit from the pooling of limits.
For external facing applications that require a high rate of outgoing data, select nodes with highest data transfer rates.
For internal applications that require high rates of internal data transfer — such as distributed processing systems like Hadoop or Spark that store incoming data, process them and are consumed by internal applications like reporting — outgoing quotas don’t matter at all, but network throughputs between nodes do matter. Throughput of a Linode over its private network interface is same as its plan’s outgoing throughput. This means choose plans with highest throughputs, such as the 10 Gbps plan.
Deploy DNS for private network
Volumes should refer to bricks via hostnames rather than IP addresses, because changing a node’s IP address in a Gluster trusted pool is inconvenient. Since clients outside the datacenter also see the same hostnames for a mounted volume, these hostnames should resolve to public IPs for clients and private IPs for other storage nodes. Deploy a Split-horizon DNS server for this.
Gluster supports application-layer security by accepting file operations requests only from authorized client IP addresses, and supports transport-layer security for those communications. These same authorized client IP addresses can be used in the network firewall layer to protect ports on all storage nodes that need to be kept open to support Gluster’s client-side fanout architecture.
Additionally, since Linode does not provide an internal firewall to protect private IP addresses from other machines in the same datacenter, the private network interface should be as strictly firewalled as the public interface. Transport security is not necessary for this private communication because it can’t be sniffed by other machines in the datacenter, but if the data is confidential or subject to information security standards, it should be implemented.
Lastly, to reduce the overall attack surface, it’s a good idea to put GlusterFS behind a HTTP interface or behind a proxy, and require external clients to connect via HTTP, WebDAV or something like SSHFS.
GlusterFS-Linode Configuration Options
Assume your application needs 10TB of storage initially with a single volume and no replication (bad idea!). The table below gives an idea of the trade-offs between price, data volume and data transfer rate:
With a replication factor of 2:
Notice that the Linode 64GB configuration results in least overall cost with a respectable outgoing transfer quota of 360 TB relative to storage capacity of 10 TB and outgoing transfer speed of 6Gbps.
GlusterFS Performance on Linode
I explored how a GlusterFS deployment performs, by creating a small cluster on Linode and measuring IO throughputs of distributed, replicated and striped volumes, while subjecting them to a series of iozone tests. All tests were done on machines in the Newark, NJ datacenter, which I presumed is probably one of the busier ones and, therefore, likely to give realistic results over time.
Factors affecting performance
In addition to the Gluster volume type, read throughput of a cluster depends on
- read throughputs of disks in its storage nodes
- outgoing network throughputs of storage nodes
- incoming network throughputs of client nodes.
Write throughput of a cluster depends on
- write throughputs of disks in its storage nodes
- incoming network throughputs of storage nodes
- outgoing network throughputs of client nodes.
Among these factors, cluster administrators have no control over disk throughputs, but can control network throughputs by selecting the right Linode plans. Ignoring incoming network throughput which is 40Gbps (or 5GB/s) for all plans, the remaining factor under administrator control is the outgoing network throughput.
Raw disk throughputs
I started by analyzing raw disk throughputs for sequential and random read and write loads in four modes — normal cached mode, fully synced (file integrity completion) cached mode, data-only synced (data integrity completion) cached mode and direct IO non-cached mode. These are the last stage throughputs seen by the Gluster daemons reading from or writing to bricks.
For file sizes less than available RAM, read throughputs reached 4.2–5GB/s in cached modes, and 2.4–9.7 GB/s in direct mode (with larger block sizes giving far higher throughputs).
For file sizes more than available RAM, they reached around 1.2–1.3 GB/s.
Write throughputs were around 850MB/s in normal cached mode, 400–800MB/s in cached sync modes, and ranged between 870MB/s-1.6GB/s in direct mode.
Since similar disk throughput values were observed in low end and high end Linode plans, it appears that disk performance is the same for all plans. So, assuming normal cached mode is the norm for most applications, they would see around 850MB/s writes and 1.2GB/s reads on average on Linode nodes.
This means that in a GlusterFS Linode cluster, for the network to keep up with the disks and avoid becoming the bottleneck, storage nodes should have at least 6Gbps outgoing, which corresponds to Linode 64GB (with 1.1TB storage) and higher plans.
Single server tests
Once raw disk throughputs were known, I next setup a simple non-replicated non-striped Gluster volume consisting of a single 1TB brick in a Linode 64GB with 6Gbps outgoing bandwidth. The test client was a single Linode 2GB with 125Mbps outgoing (~15MB/s), performing IO operations from a single process.
As expected, the client’s outgoing network throughput proved to be the bottleneck, capping write throughputs at ~15MB/s, which matches its network throughput of 125Mbps. If a client with higher network throughput was selected, it would have been much higher.
Expanding the test workload to two Linode 2GB client nodes, each doing IO from two processes, doubled all aggregate throughputs.
Aggregate read throughputs reached 780MB/s — 1.6GB/s.
Write throughputs of each client node was still limited to ~15MB/s due to clients’ outgoing network, but the aggregate was double as expected.
The storage pool was expanded to two Linode 64GB servers working in distributed mode. Since storage pool was doubled, doubling the workload still gave the same throughput for every client as expected.
What happens if the volume is a striped volume with two bricks? There was no obvious deterioration — which is good — but no obvious improvement, either, which suggests the cluster was not stressed enough by this workload.
What happens if it’s a replicated volume with two bricks? Notice how read throughput has not reduced, but write throughputs have halved, because every write has to be performed twice.
This means outgoing network throughput of client nodes is even more critical for replicated volumes.
Comparison with Amazon EFS
I was also interested in how a self-hosted Gluster deployment compares with Amazon’s distributed file system service, EFS.
I setup a fairly similar test workload with two t2.small clients (2GB RAM and 1 vCPU, a similar configuration to a Linode 2GB) and an EFS filesystem in the same availability region with max throughput performance mode enabled .
Iperf reported outgoing network bandwidth of 580–640Mbps from the EC2 test clients, which is about five times what Linode 2GB supports (although later investigation suggested this was probably a burst value and would have probably dropped down after some time).
In single stream tests from a single t2.small instance, EFS write throughput reached 83–97MB/s in regular cached mode and about 15MB/s in synced and direct modes.
In multi stream tests with two t2.small clients running two processes each, it reached about 100MB/s in aggregate write throughputs, about 150MB/s in aggregate sequential read throughput, and about 300MB/s in aggregate random read throughput.
In contrast, the Linode cluster had achieved about five times higher — 800MB/s and 1.6GB/s — read throughputs.
Deploy GlusterFS on Linode
Glusterfs-linode is a set of interactive scripts to easily create and provision secure GlusterFS clusters on the Linode cloud. They are written in Python, use Ansible for deployment and support Ubuntu 14.04/16.04 LTS, Debian 8 and CentOS 7.
This is still a work in progress and the first production ready release is expected around February 15 2017. See https://github.com/pathbreak/glusterfs-linode for detailed installation and documentation. Contributions and suggestions welcome.
GlusterFS on Linode showed excellent read throughputs when deployed on higher configuration plans.
Since write throughput depends on outgoing network bandwidths of writing nodes, applications should be deployed in such a way that components writing to the storage cluster are deployed on higher configuration nodes or on the storage nodes themselves.
From a cost perspective for big data, deploying compute components on the storage cluster nodes is more economical than deploying them separately. Even then, it’ll probably still cost more than EFS, but the higher performance and saved costs due to pooled network quotas may be worth it.
About me: I’m a software consultant and architect specializing in big data, data science and machine learning, with 14 years of experience. I run Pathbreak Consulting, which provides consulting services in these areas for startups and other businesses. I blog here and I’m on GitHub. You can contact me via my website or LinkedIn.
Please feel free to share below any comments or insights about your experience using a distributed file system or, in particular, GlusterFS. And if you found this blog useful, consider sharing it through social media.
While Karthik’s views and cloud situations are solely his and don’t necessarily reflect those of Linode, we are grateful for his contributions.