Making Sense of Cloud File Storage Options

What to do when storage gets cloudy

Randy Pitcher II
Hashmap, an NTT DATA Company
9 min readAug 23, 2018

--

At Hashmap, we have clients with varying levels of cloud experience. Some customers are completely at home in a cloud operations console and have no trouble understanding how to make the most of the cloud and the value that it can bring to their organization. However, many others are still new to this space and deal with uncertainty when comparing cloud and on-premises costs.

For many organizations looking to make the move to the cloud, one of the first use cases considered for the cloud is simple file storage. In this post, we’ll explore the most common ways to store files in AWS (equivalent options are available in Azure and Google) and discuss which tiers fit an example Oil and Gas refining use case.

Hashmap’s Data & Cloud Migration and Modernization Workshop is an interactive, two-hour experience for you and your team to help understand how to accelerate desired outcomes, reduce risk, and enable modern data readiness. We’ll talk through options and make sure that everyone has a good understanding of what should be prioritized, typical project phases, and how to mitigate risk. Sign up today for our complimentary workshop.

Benefits of Cloud Storage

Before we dive into the main file storage options in AWS, let’s take a moment to discuss the value case for a cloud-based storage approach.

In general, there are 2 big reasons to consider a cloud-driven storage approach:

Flexibility

When dealing with on-premises storage decisions, you must always plan for peak storage needs. Because of the generally long enterprise procurement process, organizations are often maintaining and purchasing much more storage than they actually need. Additionally, the quality of available storage is locked in from procurement to procurement cycle. This means organizations may be years behind advances in storage technology due to very slow and expensive upgrade cycles.

By moving to the cloud, you only pay for exactly what you need; not just in terms of capacity, but also in terms of performance. As your needs change, you can adapt your services within hours, not months.

Durability

What would it mean to your business if all your data were lost? What would be the consequences of a week-long system outage? I imagine the results would be catastrophic.

With the distributed, disaster-ready AWS infrastructure managing your files, you do not need to worry about building multiple data centers with expensive redundancy layers across the globe. You can directly utilize one of the most robust and battle-proven storage networks in the world.

AWS Storage Options

While there are a ton of ways to actually store a file in AWS, there are 2 main services that are most commonly used: S3 and Glacier.

S3

S3 stands for Simple Storage Service and is one of the oldest services in AWS. S3 is an object store, meaning you can store any and all files (Documents, Spreadsheets, Text, etc.), but you cannot store block-level data here. In practice, think of S3 as you would something like Google Drive: good for storing files, bad for storing operating systems. S3 also has no limit on storage size and can infinitely scale to meet your needs.

S3 exists in 3 tiers:

  • Standard
  • IA (Intermittent Access)
  • IA One Zone

Standard S3 has the best availability at 99.99%, IA has slightly less at 99.9%, and IA One Zone has the lowest availability with 99.5%.

From a durability standpoint, S3 Standard and IA has replication across at least 3 different data centers. IA One Zone will be replicated within a single data center.

Glacier

Glacier is a more flexible storage option for storing any kind of data, from block-level storage to files. It has the same replication as S3 Standard (replication across at least 3 data centers), but it has very slow access times that vary from 1–5 minutes to as long as 5–12 hours. Standard access times are between 3 and 5 hours.

In exchange for this slow access response, the rates for glacier are very low. Faster access times cost more than longer times, but all rates are lower than S3.

Glacier is a great fit for archiving important, hard to recreate data that is accessed very rarely with no need for sub-second access times.

Pricing

While it is tempting to simply compare the standard S3 file storage rates with on-premise costs, this is not a true apples-to-apples comparison and misses the flexibility of cloud offerings to adapt to each client’s business needs.

Depending on your redundancy and access time needs, you can choose a storage strategy that fits your business needs while keeping costs to a minimum.

While pricing varies region-by-region (see exact pricing here), the us-east-1 storage pricing for each service is below:

Let’s break these prices down a bit into some general trends.

The first trend to notice is that the faster a service responds, the lower the monthly storage cost. This makes sense as faster response times take more resources than slower response times.

However, latency is not the only factor that impacts total cost. You are also charged based on how frequently you read and write your data. This pricing trend is an inversion of the storage cost trend; that is, the slower your storage access times, the more expensive it is to read or write to that storage.

The general incentives are therefore to keep your “hot” data in an S3 Standard store (data that is accessed and edited often) and keep your “cold” storage in Glacier (data that is rarely edited or accessed).

As this “hot” to “cold” need is a spectrum, S3 IA and IA One Zone are there to fill the gap. The pricing is similar, but One Zone is slightly cheaper as it has less redundancy in the data being stored. Therefore, use One Zone for data that can be easily recreated or has limited impact if it is lost.

Lastly, the cost rate per GB read incentivizes bulk reading from your “colder” stores and provides less penalty for iterative reads from your “hotter” stores.

Use Case — O&G Refining Hybrid Cloud Architecture

To build more understanding of these storage tiers, lets describe a use case that would cover all the storage types.

Imagine that you work in Oil and Gas Refining and are designing a hybrid cloud approach for storing and distributing important files in the cloud.

The following requirements impact your design decisions:

  • Keep cloud costs to a minimum.
  • Store lab-generated reports of product quality for compliance reasons. These reports are not often accessed, but must be accessible immediately when needed and absolutely cannot be lost or destroyed.
  • Collect maintenance reports for global distribution to your engineering teams. These reports are accessed often for the first month after being generated but are then almost never referenced, but you’re planning on having your data science team look at these reports sometime this year. The reports are not backed up anywhere else and they’re not possible to recreate if they are lost.
  • Store auto-generated weekly maintenance summaries that are used by only a handful of managers (10 people globally) for an experimental Natural Language Processing program you are rolling out. The reports can get rather large in size due to high quality chart graphics and due to the number of reports currently being generated (a total of 1GB generated weekly).

How would you design a cloud-based file storage architecture that meets all the above requirements? Take a second to consider what you would do (for real, this will help you build intuition for AWS file storage).

Lab Reports

To start, let’s consider the lab reports. These should be a no brainer as a bad fit for S3 Standard as they are so infrequently accessed. Glacier would be the cheapest storage option, but the access latency (~4 hours) is not a good fit for compliance needs. This leaves you to consider S3 IA or IA One Zone. Because the files are so critical and cannot be lost, One Zone’s reduced redundancy is a bad fit. For this case, I would recommend storing the reports in S3 IA for now and explore archiving them in Glacier as permitted by regulatory compliance when they exceed a certain retention period.

Maintenance Reports

Secondly, we’ll need to store these maintenance reports. This one is interesting and would likely benefit from a tiered storage approach.

For the first month, since the reports are distributed globally across engineering teams, we’ll need to keep access costs down. This make S3 Standard a good choice. In addition, S3 Standard’s built-in redundancy will make sure that we don’t lose these important files.

However, after the first month we’ll be spending too much by storing these files in S3 Standard if no one will be accessing them. One option is to discard the reports, but since we’re hoping to have the data science team mine these reports at some point, we need to keep them.

I’m willing to bet the data science team will be able to wait 4 hours to access these reports when they’re ready to begin their analysis (we can transition to a more frequently accessed storage option at that time), so I would suggest using Glacier to archive the reports after 30 days in S3 Standard. Glacier has the same built-in redundancy as S3 Standard, so there’s still no worry of losing reports.

NLP Summaries

Since there are so few people accessing these reports, we should consider these files as being infrequently accessed, so S3 Standard is out. The managers accessing the reports likely need to see them immediately as well, so Glacier is also out.

The key in this use case for deciding between S3 IA and IA One Zone is in the required durability of this data. Is there a big impact if we lose this data?

Since these reports are derived from durably-stored raw data by an NLP algorithm, it should be very cheap and fast to recreate the summaries if they’re ever lost. Because of that, I would recommend storing these summaries in the cheaper S3 IA One Zone storage class.

However, since this is a pilot program that may expand if successful, we may end up needing to upgrade to S3 Standard as access rates increase. Luckily, this is simple to monitor with the built-in usage monitoring in AWS and we can automatically transition to the cheaper option when usages begin to rise.

Get Started

I hope this post has jogged your imagination around how your organization could benefit from simple file storage in the cloud, and you should now have a good understanding of the basic costs and benefits of file storage in AWS.

There are many ways to onramp your organization to the cloud and alot more to learn on this topic, so stay subscribed to the Hashmap blog for more Big Data, IoT, and Cloud tips.

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Randy Pitcher is a Cloud and Data Engineer (and OKC-based Regional Technical Expert) with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers.

Be sure and connect with Randy on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes or to schedule a hands-on workshop to help you go from Zero to Snowflake.

--

--