How to reduce cost of Data Engineering pipelines while improving performance in AWS?

Puneet Saha
Towards Data Engineering
3 min readDec 9, 2023
North Gate, Section 1, Zhongxiao West Road, Zhongzheng, Taipei City, Taiwan(source :unsplash)

Data pipelines can be for data engineering, data analytics, or ML training. As we know, these pipelines are data-heavy and compute-heavy (to process humongous data). These data pipelines can execute in offline mode and need not be in the path of real-time critical workflows. We can collect data from all the different regions and push it to one zone/region (let’s call it the compute site). Once data is ingested from different regions to one particular region/zone, we can have compute instances (from the same AZ/region) fetch this data and execute the data engineering, data analytics, and ML training pipelines.

Co-location

What are the benefits of having co-located storage and compute in a single region/AZ? Low total cost of ownership + highly performant pipelines. Let’s look into various aspects which would impact performance and cost.

  1. We all know that ingestion of data in an AWS region is free. So, we can push all the data to one region. And have data pipelines execute on a schedule, in trigger mode. These pipelines churn out processed data which are relatively much less in volume compared to the input data collected from different regions. Therefore, egressing data is relatively cheaper than executing these pipelines in every region where they were collected and later merged in one region, if there is such a use case.
  2. Having compute in all regions to do data engineering tasks would be expensive and less efficient compared to having one region process all the data in one place. Also, the operational burden to support just one AZ/Region is a bit less compared to supporting multiple AZs/Regions.
  3. The latency is greatly reduced when both storage and compute are in the same AZ/Region due to lower network latency. Generally, data for data analytics or machine learning or data engineering are of low sizes. Retrieval of low-size data has lower latency compared to big-size data which needs to be streamed. Because the seek time, fetch time, and network transfer are lower compared to that of big-size data. Thus, data pipelines generally would be super fast and faster than pipelines processing where compute and storage are in different AZs/Regions.

Amazon S3 One Zone-Infrequent Access

And now let’s talk about the new storage class, which is the cherry-on-top: Amazon S3 One Zone-Infrequent Access. This storage class provides extremely low latency with costs about 50% lower than the standard S3 storage class. So we save major bucks here. Therefore, the total cost of ownership of such pipelines is way reduced while increasing the performance (around 10x times as per AWS). So what is the catch? Obviously, single AZ storage means we are trading off redundancy and availability of data. If, for some unlikely event of a natural disaster or any reason, we lose data in one AZ, we lose the data forever. So we need to be conscious to store data which is not halting the main applications, or there are workarounds such as pulling up such data from cold storage. Availability is reduced from 99.99% to 99.95%. We can live with this degradation for offline jobs.

Oh, another minor gotcha with this new storage class is, at the moment, it is present only in 4 regions — US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm) Regions, with plans to expand to others over time.

Conclusion

TL;DR — Co-locating compute and data for these pipelines + using Amazon S3 Express One Zone storage(for AWS Cloud).

Hope this helps. Please share your opinion.

References : https://aws.amazon.com/s3/storage-classes/express-one-zone/

--

--