Late arriving data : Challenges and Traditional Solutions

Rishika Idnani
Art of Data Engineering
2 min readFeb 9, 2024

In my experience with data engineering, a recurring challenge I’ve encountered is the issue of late-arriving data. This encompasses various scenarios such as users arriving late to the data store, events being delayed in reaching the data store. Moreover, ETL processes typically run with today’s partition date as the reference point, which means they process only the data that has arrived on the same day in the data store. Consequently, ETL processes often overlook the attributes associated with late-arriving data, resulting in their omission from the aggregated or transformed datasets generated by the ETLs. As a consequence, the derived datasets are incomplete, lacking crucial data from the source due to the challenge of late-arriving data.

Data Loss in traditional ETL

Traditional ETL running for today’s date only

In traditional data engineering practices, a common approach to address this challenge is by extending the scope of ETL processes beyond just the current day’s data. Essentially, the ETL pipeline is designed to process not only the data from the current day but also to reprocess data (partitions) from the preceding N days (look-back window) to accommodate any late-arriving data. While this strategy ensures that late-arriving data is accounted for, it also entails recomputing data that has already been processed. As a result, this approach consumes additional resources and leads to the overwriting of previously processed partitions, potentially causing inefficiencies in data processing and storage utilization.

Workaround to mitigate data loss

ETL running with look back window

Furthermore, this approach is not foolproof, particularly in cases where the data delay extends over an extended period. Implementing a lookback window to read old partitions spanning several months to accommodate reprocessing of historical data and re-writing old partitions becomes impractical and resource-intensive.

To address this challenge, the industry has embraced technologies enabling incremental processing, thereby avoiding the need for re-processing all historical data repeatedly. Concepts such as change data capture (CDC) have emerged, along with corresponding technologies like Apache Hudi. These solutions enable data engineers to capture and process only the changes or updates to existing datasets, significantly improving efficiency and reducing resource overhead associated with traditional data processing methods.

Thank you for reading! If you found this interesting, follow me on Medium and subscribe to my latest articles. Also, you can catch me on LinkedIn

--

--