How to build highly scalable Data Persistence Pipeline with AWS S3 Sink Connector

Ishita Virmani
Naukri Engineering
Published in
3 min readFeb 28, 2024

Introduction

Naukri.com has millions of active job-seekers per day, generating billions of events across platforms. As part of the Data Engineering team handling data of multiple such sites under Info Edge umbrella, one of many challenges is to make this data available on persistent storage as quickly as possible to cater to near-realtime use cases.

In our case, we use AWS S3 as the persistent storage, and the data is sitting within ~2500 Kafka topics, waiting to be persisted.

Traditional Approach

Our initial strategy entailed executing a Spark job, wherein a Kafka consumer sequentially subscribed to each of these topics, followed by data polling and writing to the S3. To achieve parallelism, we grouped Kafka topics into logical clusters, allowing us to run a single Spark job for each group.

Quite straightforward, isn’t it? But the simplicity comes at a cost — a single job used to run for more than 4 hours to persist ~350 Kafka topics with an average of 3M records/ topic. Moreover, there was significant overhead of managing multiple such long-running jobs.

Efficient & Effective Approach

Out came the Kafka S3 sink connector to our rescue!

The most important feature that makes AWS S3 Sink connector “The Catch” is its ability to efficiently upload objects to S3 through AWS multipart uploads. It continuously uploads the smaller chunks of objects as parts to S3. The completion of multipart upload, as well as committing Kafka offset back happens only once the number of records has reached flush.size configuration value. This ensures improved throughput as well as resilience.

Is that it?

The Kafka S3 Connector is just one of the components within our Persistence Pipeline. In the next section, we’ll look at the entire process of persistence.

Optimized Persistence Pipeline

Here are some of the reasons contributing to the need for an entire persistence pipeline, which we call “Optimized Persistence Pipeline” —

  1. Files written by the sink connector cannot be directly read via Spark or any other distributed computing framework, as these are too small to give us optimal read time.
  2. Moreover, we have many ETL jobs scheduled to run daily, for which we need to identify if the respective event’s data for yesterday is available.
  3. We also have a couple of derived columns that need to be included along with.

We perform the following set of tasks —

Once all partitions data of Kafka topic have been written onto the S3 interim store for yesterday, the listener tags the same in meta.

  1. Basis the records in meta, spark job performs compaction/ transformations, and writes to the final S3 store, subsequently tagging the event within meta to be Available. This final S3 store is used to read all the data.

Conclusion

With the usage of optimized persistence pipeline, most of the data is persisted by 1 a.m., allowing all our ETL jobs to run during the night itself. Moreover, since the data is available on the interim S3 store almost instantly, we could empower near-real-time use cases (for smaller datasets) as well.

--

--