Using S3 as a datastore

Sai Peddy
2 min readFeb 20, 2018

--

This post will be fairly short, covering some must-know information that we encountered while trying to query our data via thrift. Using S3 as a datastore is great in terms of costs and ease to use; however, S3 can also make certain aspects more complicated. Some of the concerns we had when approaching this was being able to effectively use the partitions and metadata of the data to improve querying. In order for that to work — the filesystem must allow something called Pushdown Predicate. There were two concerns with this aspect that we’ll discuss below. Finally, knowing how S3 works as a data store in terms of request capacity is also very useful when working with a high volume of data. Some of the aspects discussed in the blog and other documentation is still experimental, so be warned that things may change. Test everything before taking my word for it 😛

The Concerns:

  1. Partial Reads
  2. PushDown Predicate
  3. Efficient storage

S3 is an object store not a filesystem and therefore can lack some aspects that you would expect. The first being partial reads, which allows the filesystem to figure out the information of the file without having to read/download the whole thing. This is one of the requirements for pushdown predicate, which we will get to shortly. This doc is pretty useful in figuring out the settings to apply. The docs suggest setting fs.s3a.experimental.input.fadviseto random in order to allow the random IO — aka partial reads.

The documentation also suggests setting filter pushdown — which is the same as pushdown predicate. Pushdown Predicate essentially refers to pushing the process of filtering your data down to the filesystem. For example, you query a table with a month’s worth of data but want to look at a specific time range. If pushdown filtering does not exist, your query would be downloading the entire months’ worth of data prior to filtering the data it needed to return. By enabling this setting, the filesystem does the filtering and you download just the data required by the query (based on partitioned fields).

Finally, another aspect to pay attention to is how your data is distributed when saved to S3. This is usually only a concern when your requests exceed certain thresholds — which can be found in the official documentation. The documentation explains ways to optimize your s3 object keys for improved performance(object key is equivalent to the filepath of the s3 object).

--

--

Sai Peddy

Data Engineer | Love to Learn | Interested in…too many things