Efficiency in the Fast Lane: Refining Redshift in a World That’s Always Rushing

Unraveling Complex Challenges and Discovering Opportunities in a Half-Baked Tech Landscape

Alberto Jaen
Ostinato Rigore
4 min readNov 2, 2023

--

In the fast-paced world of technology, where companies often prioritize rapid deployment over meticulous optimization, inefficiencies can creep in. I experienced this firsthand when I joined a new team, grappling with a Redshift cluster in AWS maxed out on nodes and instance types.

The cluster was struggling, and a closer inspection revealed a complex web of issues, largely created due to the team’s fast-moving approach. But, as with many modern systems hastily set up, the path to efficiency was surprisingly straightforward.

The Issues at Hand:

The ingestion system was mainly built out of the following services:

  • Lambdas: Serverless functions used for short and frequently run tasks.
  • Kinesis Streams: Where data is stored temporarily to decouple processes when moving data downstream.
  • Kinesis Firehose: A service that writes data to S3 in a partitioned and compressed manner.
  • ECS Fargate tasks: Containers that use the team’s images, hosted by AWS.
  • Redshift cluster: AWS proprietary Data Warehouse.
  • The broker: A third-party service used to decouple the apps generating data from the analytics infrastructure.
Starting point
  1. Double Trouble with Data Ingestion: The team had introduced a new data ingestion system to enrich data before it entered Redshift. However, in the hustle to innovate, they didn’t retire the old system. As a result, both were left running, effectively doubling our data ingestion. Worse, the older ingestion was sourcing directly from our broker, which kept executing deduplication queries due to the potential of receiving duplicate events. This put a continuous strain on the already overburdened cluster.
  2. The Piling ECS Tasks: New data was funneled from S3 to Redshift using ECS tasks every 15 minutes. But these tasks lacked the intelligence to determine if their predecessors had completed their job. On days with heightened cluster latencies, these tasks would pile up, locking tables and creating significant query backlogs.
  3. The Missing Test Environment: Our data team was operating in the dark, with no testing environment. This meant that any hefty query was directly executed on our primary database, adding more pressure to the system.

The Road to Resolution:

  1. Simplifying Data Streams: Initially, we archived all data from the previous ingestion process to S3 using unload queries. Subsequently, we migrated the tables that hadn’t yet transitioned to the new system. Once consolidated, we decommissioned the outdated, redundant ingestion system. This decisive measure halved the amount of data being fed into Redshift.
  2. Smarter Task Management: A swift yet effective fix was introduced. We equipped the ECS tasks with a simple mechanism to check if another task of the same type was already running, preventing unnecessary pile-ups.
  3. Creating a Safety Net with Test Environments: Two distinct testing environments were set up, one with 3 months of data and the other with 6 months. This provided a safe sandbox for the data team to test their weighty queries without affecting the main database.
  4. Shedding Unnecessary Weight: A significant part of our optimization journey involved scrutinizing the existing data storage. As we dove deeper, we recognized that a massive chunk of our Redshift cluster was occupied by obsolete data. We unloaded the data from the old ingestion process and purged other test datasets. This effort bore fruit — the cluster’s data storage saw a dramatic reduction, plummeting from 90 TB down to a streamlined 30 TB. Not only did this clear out superfluous information, but it also enhanced the speed and responsiveness of our Redshift operations, proving that sometimes less is often more.

The Takeaway:

The technical enhancements we implemented clearly led to observable efficiency gains, with the Redshift cluster now functioning at its peak. This exercise, however, offered insights that went beyond mere technological progress. It illuminated a recurring theme in the contemporary tech landscape: in the fervor to innovate, organizations can sometimes roll out half-completed solutions. Such a rush often masks glaring inefficiencies that, though readily fixable, are frequently ignored.

Our actions resulted in a considerable financial breakthrough. By refining our data processes and system infrastructure, we could notably downsize our Redshift resources. The economic repercussions were clear: monthly expenditures were cut from $20,000 to a striking $6,000. This adjustment means yearly expenses were curtailed from an exorbitant $240,000 to a manageable $72,000. These numbers not only emphasize the critical nature of operational efficiency but also showcase the monetary advantages of systematic, careful optimization.

For the astute observer, scenarios like these unearth immense opportunities. The inadvertent benefit of the tech world’s accelerated pace is that rectifying its resultant inefficiencies often requires straightforward solutions. This entire episode stands as a testament to the value of pausing, evaluating, and refining. The potential rewards, both in performance and financial savings, are immense.

--

--

Alberto Jaen
Ostinato Rigore

AWS Architect in the making. Passionate about cloud architectures and software development.