Building an Object Storage Service for your Application with DigitalOcean Block Storage

When the squad wants to store their meme stash as objects, but hate trying to figure out what S3 costs.

Many Cloud Providers offer extensive storage options (usually in the form of block storage, at a per-GB/month price), but a block store and an object store are two different things that, depending on what you want to do, usually are used for different types of data.

Some providers offer Object Storage as a service (like Rackspace’s Cloud Files Object Storage powered by OpenStack Swift; SoftLayer’s similar OpenStack Swift-based Object Storage product, extended by their own stellar APIs; Most notably, Amazon’s S3 storage service), but using any of these products can become pricy, and difficult to estimate cost at that, even with tools engineered to do so.

DigitalOcean now offers Block Storage, which is an excellent place to start when building your own S3 API-compatible Object Store. One benefit is that you do manage the space by block-device (rather than on a pay-as-you-go basis, so you’re controlling your quota in aggregate, and can add capacity as required).

My interest stemmed from trying to find a good, pretty universal use case for Block Storage that might address a need many of our users may, or may not, have, but until now couldn’t address effectively, and that partially has been how to automate the deployment of a storage cluster, but not just that, a low-cost (but still highly-available) one powered by some kind of Object Storage/S3-compatible overlay. I landed on Minio. This seemed like a reasonable (if you don’t necessarily mind being responsible for your own “eleven nines” of reliability) place to start.

Minio, aside from being equipped with an S3-compatible API (meaning you can interact with it from your application much like you would with AWS’ object storage service), has seen adoption in some of the most robust products currently available for running a cloud environment; for example, Deis, a PaaS utility that offers many features that leverage Minio.

Here is what, ultimately, was produced:

At the time of this writing originally, Block Storage was not yet available on DigitalOcean, so the above applies with a few caveats:

  1. Your Storage volume would be the (set of) Block Storage volumes allocated to a droplet (up to 112TB — at 16 TB per drive — per droplet, and this space can be pooled; I’ll cover this in a moment).
  2. Because GlusterFS is being used for redundancy (and if you’d like, to pool resources further), you will want to start Minio on nodes with the Gluster mountpoint, rather than the volume mountpoint. If you do not wish to have this replication, you can use the volume mountpoint directly.

The components required were:

  1. A lot of storage space. In your DigitalOcean dashboard, when you create a new droplet in the NYC1 and SFO2 regions, you will have the option to add drives:

You can add up to 7 of these drives, at 16TB each as the upper limit.

To make a logical device of multiple drives (you can skip this step if you plan to only use a single drive), you can

i) Create a software RAID

ii) Use btrfs (recommended)

If you plan to use btrfs, the process is very straightforward.

Verify the ID of the devices:

root@bucket:~# blkid
/dev/vda1: LABEL=”DOROOT” UUID=”050e1e34–39e6–4072-a03e-ae0bf90ba13a” TYPE=”ext4"
/dev/sda: UUID=”f543d6db-4963–4336–8d82–1f0250a3b797" UUID_SUB=”4588ce05–4325–4db0–921d-8b64c30ff79c” TYPE=”btrfs”
/dev/sdc: UUID=”f543d6db-4963–4336–8d82–1f0250a3b797" UUID_SUB=”34f51698-bd6a-42fb-bd47–3787ac705c7b” TYPE=”btrfs”
/dev/sdb: UUID=”f543d6db-4963–4336–8d82–1f0250a3b797" UUID_SUB=”889990d4–6460–4a0d-b155-df3728045297" TYPE=”btrfs”
/dev/sdd: UUID=”f543d6db-4963–4336–8d82–1f0250a3b797" UUID_SUB=”c2dc8dbc-a943–421c-963e-55d181e807b1" TYPE=”btrfs”
/dev/sdf: UUID=”f543d6db-4963–4336–8d82–1f0250a3b797" UUID_SUB=”5967bd6b-edb5–4db2-b628–6585e3956df3" TYPE=”btrfs”
/dev/sde: UUID=”f543d6db-4963–4336–8d82–1f0250a3b797" UUID_SUB=”dd10b2cf-6489–4756-b35a-b6804aafe1e7" TYPE=”btrfs”
/dev/sdg: UUID=”f543d6db-4963–4336–8d82–1f0250a3b797" UUID_SUB=”0a91b12b-7354–4d89–8be5–1647548a3cbb” TYPE=”btrfs”

In my example, these are already formatted, but on your fresh system, you’ll see your devices listed with the `/dev/sdX` format; in this case `/dev/sda` through `/dev/sdg`.

The fastest way to make this space contiguous, is to install `btrfs-tools`:

sudo apt-get install btrfs-tools -y

and then create and mount the filesystem (this will take a while for large volumes, or a high number of drives):

mkfs.btrfs /dev/sd{a,b,c,d,e,f,g}
mount /dev/sda /mnt

Adding new devices down the road, for example, is simple:

btrfs device add /dev/sd{whatever} /mnt

and then balance it with the rest of the volume:

btrfs filesystem balance /mnt

There are many options available with btrfs how these drives are configured, and how metadata is managed (and you can, indeed, still configure analogous RAID levels in this array, with a simpler interface that utilities is `mdadm`).

Like I said, if you don’t care about replication (Block Storage is very reliable upstream on DigitalOcean’s end, so integrity is less of an issue if your application does not require the data be available across datacenters, or just not in a volume that requires more availability), you can skip to part 3.

2. Some way to replicate data across hosts:

i) In my configuration, I had four servers, each with a 4 TB block device attached.

ii) A load-balancer was configured to point to the fastest responding Minio installation.

iii) The data on these volumes was replicated using GlusterFS; in this instance, I configured one main disk, and 3 remote replicas.

You would start by probing each of the servers from the other (over the private network, if they’re all in a single region):

gluster peer probe <IP of the other server>

and once you confirm that the peering is set:

gluster peer status

you can proceed to create the volume (the directory you plan to replicate) and set the replica:

gluster volume create object-storage replica 2 transport tcp 1.1.1.1:/storage 2.2.2.2:/storage
gluster volume start object-store

Since Minio is reading from the disk directly, a gluster mount was not required, so you can limit connections to the localhost, as well as disabling NFS.

gluster volume set datastore auth.allow localhost
gluster volume set nfs.disable on

This data was kept in-sync over a private network, which I recommend using if your provider offers it.

3) Keeping the Minio Server running

I did this using Supervisor, which is super easy to install and configure. Since Minio has to run as a non-root user, I setup this script for Supervisor to call upon to start the service as my user:

[program:minio]
command=/root/mini.sh
autostart=true
autorestart=true
stderr_logfile=/var/log/minio.err.log
stdout_logfile=/var/log/minio.out.log

Since Supervisor is running as root by default, and Minio requires that it be run by a non-root user, I set `command` to run the following script (/root/minio.sh):

#!/bin/bash

sudo -H -u minio bash -c "/home/minio/minio server /storage" (or /mnt, directly, if you did not setup replication)

You’ll notice that the volume you’re replicating is the same volume you’re starting Minio to manage (/storage). This will ensure that you’re sharing the same replicated space.

Once you’ve downloaded the Minio client, and pointed your storage hostname to your load-balancer address (with each of the nodes’ private IPs as the backend addresses), you can configure the client config (usually in `~/.mc/config.json`:

“myminio”: {
“url”: “http://coolstorageusa.biz:9000",
“accessKey”: “<Your Access Key>”,
“secretKey”: “<Your Secret Key>”,
“api”: “S3v4”
}

The server-side config is a little more straightforward: You would set your Access Key and Secret on each of the hosts (using the same config) in `/home/minio/.minio/config.json`:

{
“version”: “2”,
“credentials”: {
“accessKeyId”: “<Access Key Here of your choosing>",
“secretAccessKey”: "<Secret Key of your choosing>”,
“region”: “us-east-1”
},
“mongoLogger”: {
“addr”: “”,
“db”: “”,
“collection”: “”
},
“syslogLogger”: {
“network”: “”,
“addr”: “”
},
“fileLogger”: {
“filename”: “”
}
}

so when Minio is started, both nodes will use the same credentials as stored in your client. And you are ready to connect through the load-balancer.

The last step of the Minio setup is to connect your client to the host (in this case, just referencing the LB address and the shared credentials):

mc config host add myminio http://coolstorageusa.biz myaccesskeyissuitablylong secret/key/is/definitely/not/p@$$w0rd

If your domain is “coolstorageusa.biz” (see the above client config), for example, you might point that to a server running HAProxy and your backends may look like this:

backend storage_pool
server obj-1 node-1.coolstorageusa.biz:9000 check
server obj-2 node-2.coolstorageusa.biz:9000 check
...

Or, if you’re more security conscious, start Minio on your private interface, and use those addresses in your HAProxy config as the backend, and then firewall your private environment to ensure that connections to port 9000 only come from your load balancer:

iptables -A INPUT -p tcp --dport 9000 -s <LB IP Address> -j ACCEPT
iptables -A INPUT -p tcp --dport 9000 -j DROP

Securing your Minio installation’s UI with LetsEncrypt, for example, is trivially simple, but in a highly-available setup, this can also be done at the load balancer level with HAProxy (using your Minio web UI endpoints as your backends):

Because this is just an example, and not production-ready out of the box, I’ll skip the above security recommendations for sake of demonstration, but everything will still function much in the same way.

With the setup completed, you should be able to log into http://coolstorageusa.biz:9000 with your access credentials (which you can set manually in your config), and add a file:

The `mc` CLI will reflect the change when you hit your domain:

jmarhee@iampizza ~ $ mc ls myminio
[2016–03–24 09:04:52 CDT] 0B memes/
jmarhee@iampizza ~ $ mc ls myminio/memes/
[2016–03–24 13:24:40 CDT] 21KiB 12813981_825573214238454_8396023985127052921_n.jpg

Once the replica is created across however many “active” Minio nodes you configured HAProxy to use at once, you should see the file on all of the nodes if you were checking each manually; in my case, the data I’m storing is small, so I had two active and two passive storage nodes, but for the sake of availability, you can have 1 active, and 3 passive replicas, for example.

After all of this, you should be able to access your files through `mc` as seen above, as well as begin generating public links:

From the above, to generate a link, just specify file using this format:

minio_name/bucket_name/filename

it’ll look something like this:

jmarhee@iampizza ~ $ mc share download myminio/memes/12813981_825573214238454_8396023985127052921_n.jpg
URL: http://coolstorageusa.biz/memes/12813981_825573214238454_8396023985127052921_n.jpg
Expire: 7 days 0 hours 0 minutes 0 seconds
Share: http://coolstorageusa.biz/memes/12813981_825573214238454_8396023985127052921_n.jpg...

and there ya have it.

Final Thoughts

I recommend, if you do wish to build this into a production-ready environment (Minio’s developers, at this time, recommend not using it in production), that TLS be used — covered above — , anything that can be done on private networks (internal to your datacenter or LAN; not accessible over the Internet) be done that way (i.e. Gluster peering, Load Balancing), and that your component droplets be firewalled to only communicate with each other, if public access is not required (only your load balancer could be Internet-facing, for example).

Another neat production quality Object Storage product, that is fairly extensive, and likely includes many of the features I describe configuring yourself above, might be Basho’s Riak S2 (which has a free community version).

There are also many good tutorials for deploying and using Ceph as a clustered object store in production, but this one is my favorite.

Aside from Minio, both of these alternatives can be deployed using DigitalOcean, and used likewise on the new Block Storage product.