How We Slashed Our S3 Costs by Over 60% Using Zesty and AWS Native Tools

Eran Levy
Zesty Engineering
Published in
4 min readJul 2, 2024

In today’s cloud-native world, optimizing costs is crucial for maintaining both a lean, efficient infrastructure and a well-managed budget. As organizations increasingly rely on cloud services, striking the right balance between performance and cost-effectiveness becomes paramount for sustainable growth and operational success.

Zesty helps organizations minimize cloud waste with optimization insights and automations. We are constantly “dog-fooding” our platform to minimize our own waste as well. We recently tackled the challenge of rising AWS S3 storage & request costs. This blog post will walk you through our journey to optimize our costs and usage, detailing the tools and techniques we used to identify the cost drivers.

Understanding S3 Cost Components

Before diving into our optimization process, it’s essential to understand the primary factors contributing to S3 costs:

  1. Storage: The amount of data stored in your S3 buckets.
  2. Requests: API calls made to interact with your S3 objects.
  3. Data Retrieval: Costs associated with accessing data, especially from colder storage tiers.

Our specific use case centered around a data lake S3 bucket containing various prefixes and serving multiple workflows, including data pipelines, data science, and analytics processes.

Leveraging Zesty Platform for Cost Visibility

Zesty S3 Service Cost & Usage Explorer
Zesty S3 Service Cost & Usage Breakdown

Our optimization journey began with the Zesty platform, which provides comprehensive visibility into S3 costs. Zesty’s S3 service view offers:

  • Total S3 cost visibility with a breakdown of storage, requests, and transfer costs
  • A holistic table view of S3 cost and usage data
  • Actionable recommendations generated by Zesty’s engine
S3 Potential Savings & Recommendations

These recommendations range from simple suggestions like optimizing data transfer rates to more complex strategies like storage tiering. The insights provided by Zesty prompted us to investigate the root causes of our high storage and request rates further.

Diving Deeper with AWS S3 Storage Lens

To gain more granular insights into our S3 prefix-level usage, we turned to AWS S3 Storage Lens. This tool allows you to drill down into prefix-level analytics. Here’s how to set it up:

1. Navigate to the S3 console and select “Dashboard” right under “Storage Lens” from the left sidebar

2. Click “Create dashboard” and provide a name for your dashboard

3. Choose & select the metrics and buckets you want to track

NOTE: Enabling S3 Storage Lens comes with additional charges based on the number of objects and other parameters. Refer to the AWS S3 Pricing page for more information.

Using S3 Storage Lens, we quickly identified the prefixes consuming the most storage in our data lake bucket.

Analyzing S3 Access Patterns to Optimize Requests

To understand our high request costs, we needed to analyze access patterns. This required enabling S3 “access logging”, which can be done as follows:

1. Open the S3 console and select your bucket

2. Go to the “Properties” tab

3. Scroll down to “Server access logging” and click “Edit”

4. Enable logging and choose a target bucket for the logs

Note: Enabling S3 access logs for high-traffic buckets can incur significant costs. To mitigate this, start by enabling logging for a brief period.

Once logging was enabled, we created an “External” table in AWS Athena to query the logs efficiently.

Here’s the SQL we used to create the table, refer to this guide for more information:

CREATE EXTERNAL TABLE `s3_access_logs_db.mybucket_logs`(
`bucketowner` STRING,
`bucket_name` STRING,
`requestdatetime` STRING,
`remoteip` STRING,
`requester` STRING,
`requestid` STRING,
`operation` STRING,
`key` STRING,
`request_uri` STRING,
`httpstatus` STRING,
`errorcode` STRING,
`bytessent` BIGINT,
`objectsize` BIGINT,
`totaltime` STRING,
`turnaroundtime` STRING,
`referrer` STRING,
`useragent` STRING,
`versionid` STRING,
`hostid` STRING,
`sigv` STRING,
`ciphersuite` STRING,
`authtype` STRING,
`endpoint` STRING,
`tlsversion` STRING,
`accesspointarn` STRING,
`aclrequired` STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET1-logs/prefix/';

Identifying High-Cost Operations

With our logging table in place, we ran queries to identify the most frequent requesters and operations:

SELECT requester, operation, request_uri, COUNT(*) AS total 
FROM "mybucket_logs"
GROUP BY requester, request_uri, operation;

To find the most frequently accessed keys:

SELECT split_part("key", '/', 1) AS grouped_key, COUNT(*) AS total 
FROM "mybucket_logs"
WHERE operation='REST.GET.OBJECT'
GROUP BY split_part("key", '/', 1)
ORDER BY total DESC;

Important: When running queries on large datasets, always use date filters to control costs.

For example:

WHERE parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z') 
BETWEEN timestamp '2024-07-01 00:00:00' AND timestamp '2024-07-02 08:00:00'

Our analysis revealed an unoptimized Athena query running as part of a daily job, which was scanning a large partition set unnecessarily. By refining this query, we significantly reduced our API request costs. In addition, one of the prefixes was growing exponentially because of some unstable background process. Fixing this operation reduced the amount of storage we persist significantly.

Conclusion

By leveraging the power of Zesty’s recommendations and AWS native tools, we achieved a substantial 60% reduction in our S3 costs. This process not only saved us money but also provided valuable insights into our data usage patterns, allowing us to optimize our workflows further.

Cloud cost optimization is an ongoing process.

--

--