How does Tantan meet up with MinIO

5 min readAug 26, 2019

It has been a long time since we, Tantan (a globally popular Chinese dating app), shared our experience with MinIO at GopherCon China 2019. For those who are interested in that talk you can see it here

The Tantan team has worked with the MinIO software for quite some time now and have found that it is great on both design and implementation. There are, however, some limitations that we have needed to work around. This post describes our work and our workarounds.

Before reading this article, you should have some knowledge on MinIO. You can start with the quick start guide.

Two Problems

Infinite bucket?

When running MinIO in Distributed mode it has a per tenant limit of minimum of 2 servers and maximum of 32 servers.

The team at MinIO explains their thinking as following:

It was a design philosophy of limiting each cluster by the maximum failure domain operators can tolerate. Buckets do not scale indefinitely even on Amazon S3. Listing will take forever to complete. Any failure of cluster or breach will lead to large catastrophic failure. That is why we force users to not do that.

While reasonable, this is not convenient if you are used to Object Storages like Amazon S3. Clients usually don’t care about the size of buckets or clusters. They just PUT and GET objects from the bucket as if it’s infinite. MinIO addresses the recommended limitation by providing a federation solution to coordinate multiple clusters. The essence is you can look up which cluster a bucket belongs to via CoreDNS and etcd. The bucket size is also limited but you can create an unlimited number of buckets.

Keeping the overall cluster size and count below a reasonable number is meaningful from a management perspective, so we pay more attention on how to coordinate multiple clusters rather than scale one cluster itself.

The solution is pretty simple, Nginx!

We added a HTTP header named minio-cluster-id to each request sending to MinIO, Nginx will forward the request to the corresponding cluster by different header.

How does Nginx forward requests to different cluster

So the first step is adding this header on every request. It’s easy to implement this by leveraging the minio-go package.

client, err := minio.New(endpoint, accessKeyID, secretAccessKey, secure)
if err != nil {
    panic(err)
}
tr, err := minio.DefaultTransport(secure)
if err != nil {
    panic(err)
}
header := make(http.Header)
header.Add(httpHeaderClusterID, clusterID)
client.SetCustomTransport(newTtransportWithFixedHeader(header, tr))

How can Nginx distinguish requests sending to different cluster by header?

The following is a sample Nginx config snippet

upstream  c00 {
     server 10.189.153.36:9000;
     server 10.189.153.37:9000;
     server 10.189.153.38:9000;
     server 10.189.153.39:9000;
    }

    upstream  c01 {
     server 10.189.153.32:9000;
     server 10.189.153.33:9000;
     server 10.189.153.34:9000;
     server 10.189.153.35:9000;
    }

    map $http_x_amz_meta_minio_cluster_id $clusters {
         default "c00";
         c00 "c00";
         c01 "c01";
    }


    server {
     listen 80;
     ignore_invalid_headers off;
     client_max_body_size 0;
     proxy_buffering off;
     location / {
       proxy_set_header Host $http_host;
       proxy_pass http://$clusters;
     }
    }

As you can see, we use a map clusters to forward requests to different cluster dynamically by different HTTP header.

A question that comes up is, “How can I get the cluster ID of the objects I uploaded? “

The answer is that the client who generates the object id takes the responsibility to encode the cluster id into the object key, so it can decode the cluster id from it later.

Small files performance

Another challenge Tantan encountered was the performance of MinIO when handling small files on HDD. At Tantan we mainly store media images, the size of which is roughly around 200 KiB. As we all know it is very slow to read or write randomly on HDD. MinIO amplifies this issue by storing or fetching data from multiple drives. Of course you can deploy MinIO cluster on SSDs if you don’t care about the cost. Here we will discuss a more economical approach.

First of all, let me show you how is the write performance of the original MinIO on small files.

write latency with64k payload on 12-node cluster, each node was equipped with 12 6T HDDs

Since we have investigated various object storage systems, we noticed that some of them merge small files into a bigger one when putting objects. This improvement increases the write performance dramatically because the random writing is avoided.

So we decided to re-implement the POSIX part of MinIO as a file volume. We saved the small files into several bigger (4GiB) files and recorded the offset and the big file id into rocksdb.

Let’s have a look at the latency after optimization.

write latency with64k payload on 12-node cluster, each node was equipped with 12 6T HDD

Everything went as expected with the exception of the read performance which is much worse than before. Note: you have to disable the kernel page cache otherwise it results in misleading GET performance numbers.

By analyzing the flow of getting an object, we figured it out. One GET operation is composed by two parts, one is reading meta, the other is reading data. Meta and data are saved in big file, first it reads from rocksdb to get the index and then read the data itself. The total IO time is four, twice as much as before. Actually most time it’s three, because two index are physically closed in the rocksdb so that the second one is read directly from cache or memory.

To lower read latency, the obvious method is to reduce the read times. So we put the whole meta file into rocksdb after encoding it with protobuf to reduce the file size. We equipped each server with an SSD to make reading index faster. This allows Tantan to get the object with just one IO operation per node. Cool!

Besides, you may decide to setMINIO_ERASURE_SET_DRIVE_COUNT to a smaller value to reduce the drive number which do speed up the request.

Other trick

Heal exactly one disk

Disks fail frequently. In MinIO, you have to heal the entire cluster after replacing a disk. It’s a long job which may last several days if there are millions of objects.

To address this, Tantan came up with a new API which will heal the exact part of an object which previously existed on the failed disk. Now, Tantan does not have to check the integrity of all parts of an object nor heal parts on other disks except the failed one. It is much faster!

We dump the objects list directly from the rocksdb and heal them on the specified disk with a client controlled QPS.

MinIO recommends running deep heal operation which checks the integrity of an object every six months to avoid bit rot.

Summary

Currently, Tantan has four clusters with more than a 1PB under management. The MinIO team was very nice to provide us support during these months. Also their Slack channel is very active, and delivers rapid responses. If you are looking for a better object storage system, MinIO should be one of your alternatives.