Infrastructure failures during big data processing

Nikhil (Srikrishna) Challa
Google Cloud - Community
6 min readJan 15, 2024

--

How is hardware failure accounted in large volume data processing that lasts for days, weeks and months?

It’s important to consider recovery and restarting of data processing applications especially when processing large volumes of data. In order to discuss a fault tolerant design for a big data processing application, I am considering Google Cloud’s Dataproc for the purposes of this blog as the execution engine.

There are quite a few scenarios where we will need to process super large volumes of data, whether it is big data processing or Machine Learning model training. This might go on for days and weeks and rarely months too on few occasions (LLMs — Large language models take months to train)

One of the biggest risks in such a scenario is the possibility of infrastructure failure. The machines that the jobs are running on may fail leading into the failure of the job. How to handle that is what this blog is about, which aims to discuss the possible options and recommendations

Use case — Lets say, we are implementing a fault tolerant Data processing pipeline that processes large volumes of transactional data for sales trend insights and customer behaviour analytics

The technology of choice here is Apache Spark and Google cloud Dataproc.

Apache Spark → Unified engine and distribution processing framework fwhich processes data in memory resulting in faster computation as compared to Map reduce that reads and writes data from the disk.

Google Cloud Dataproc → Managed Hadoop service on Google cloud which acts as an execution engine for Spark and Mapreduce jobs.

In large volume data processing that runs over long period of time, it is important that the progress is not lost due to failure of the nodes or restarts of the nodes. And the good news is, Spark has options to tackle this challenge and make the system fault tolerant.

A reference architecture that reads data from sources and processes it into data storage solutions in GCP:

Option 1:

Checkpointing:

A feature that enables saving of the state of computations at specified intervals. This is very useful in case of failures. The job can be restarted from the last checkpoint rather than starting over. You can set up checkpointing to a reliable distributed storage system like HDFS or GCS

# Enable checkpointing - directory in GCS

checkpoint_directory = "gs://your-temporary-bucket/checkpoints/"

spark.sparkContext.setCheckpointDir(checkpoint_directory)

Assuming this job runs for 20 long days, the best option is to take a checkpoint at the end of everyday for the data processed on that given day, so that in case the job fails on day 2 or day 3, there is always previous day’s processed data output checkpointed and the processing can reinstated from that point.

Every design decision should be considered for pros & cons.

Pros of checkpoint:

  • Provides Fault tolerance which is the basic reason why we use checkpoint as it reliably allows for the application to be restarted
  • Spark maintains lineage graph of all the transformations that the RDDs have undergone. (RDD stands for resilient distributed dataset in Spark which improves the efficiency of a job significantly). For large and complex transformations, these lineage graphs are extremely complex and since Spark uses them for troubleshooting in case of a node failure to understand what happened to a given RDD, identifying the problem becomes challenging for complex transformations which lead to intricate and very long lineage graphs. When an RDD is checkpointed, Spark saves its current data, and the lineage graph for that RDD is effectively cut off. The checkpointed RDD doesn’t need to remember its lineage before the checkpoint. If there’s a failure after checkpointing, Spark doesn’t need to go all the way back through a long lineage graph. It can start the recovery process from the last checkpoint, which is much quicker

Cons of checkpoint:

  • I/O Overhead: Writing checkpoints to disk (HDFS or other storage systems) involves I/O operations, which can be expensive and time-consuming, potentially impacting the performance of your Spark job
  • Storage Space: Checkpoints require additional storage space. The frequency and size of the data being checkpointed can lead to significant storage requirements.

Option 2:

Data replication:

Ensure that your data is replicated across multiple nodes in your distributed environment. This way, if one node fails, you have copies of your data on other nodes.

Cloud Dataproc which is a managed service has an option to create a highly available cluster which has multiple master nodes providing extreme fault tolerance. Additionally these clusters has an implicit feature that redundantly stores the data across 3 nodes, which means if one node goes down, the copy of data is available in other machine where it can be retrieved from.

How HDFS Replication Works in the Event of Failures:

  • Node Failures: If a node in the Dataproc cluster fails, HDFS ensures that there are additional copies of the data blocks that were stored on that node, replicated across other nodes. HDFS constantly monitors the health of its data blocks. If it detects that a block has fallen below its desired replication factor (default is 3), it automatically replicates the existing blocks to other healthy nodes to maintain the replication factor.
  • Data Processing Recovery: For data processing tasks, if a node fails, the processing tasks (e.g., Spark tasks) running on that node will also fail. Spark, running on Dataproc, is designed to handle such failures. It can restart the failed tasks on other nodes where the data is replicated. This means that if a part of your data processing job fails due to a node failure, Spark can continue processing by accessing the replicated data from other nodes, without having to start the entire job from scratch.

There are advantages and disadvantages of this option as well:

Pros:

  • Increased fault tolerance due to the duplicated availability of data across multiple nodes
  • Improved data availability and disaster recovery

Limitations:

  • While HDFS replication handles node-level failures, it does not protect against cluster-wide failures or issues that affect the entire storage system.
  • It’s also important to note that HDFS replication is only effective within the lifespan of the Dataproc cluster. If the entire cluster is terminated, the data in HDFS is lost unless it has been persisted to a more durable storage service like Google Cloud Storage (GCS).
  • Increased storage costs
  • overhead in writes and associated performance

Option 3:

Incremental Processing: If feasible, design your pipeline to process data incrementally. This means instead of processing the entire dataset in one go, you process smaller batches. This reduces the risk and impact of failures.

This process is also called as Chunking which is to cut the data into equal or symmetric chunks and process them in parallel. However the process is completely manual and involves multiple logics to handle the chunking process, its offset, reconciliation logic etc which increase the operational overhead of the pipeline.

Incremental processing reduces the impact of failures and also eases up the recovery process

Advantages of this approach:

  • Improved fault tolerance → The impact of failure is limited down to the smaller chunk making the system more resilient
  • Efficient resource management → Processing smaller data sets at a time can lead to more efficient use of computational resources, as it avoids overloading the system

Disadvantages of this approach:

  • Complexity in implementation → Implementing incremental processing can be complex, especially in ensuring that each batch is processed correctly and that data consistency is maintained.
  • Possible data skew → If data is not evenly distributed across batches, some batches may take significantly longer to process than others, leading to inefficiencies

In addition, managed services like Kubernetes provide automated recovery options as well which compliments the recovery process

Automated recovery: Use cluster management tools like YARN or Kubernetes, which offer features to automatically restart failed tasks on different nodes.

Kubernetes has got a feature that automatically performs health checks to compare the current state to the desired state and ensures the applications are always running by replacing the failed pods. This will ensure the job does not go to a grinding halt rather continues its execution.

However it is not the responsibility of Kubernetes to maintain the state of the application which can only be done by techniques like checkpointing.

Monitoring and Alerts: Implement robust monitoring to quickly detect failures. Set up alerts so that your team is notified immediately in case of failures.

Conclusion:

Considering all the options above, checkpointing along with managed services can mitigate the risk that arises due to infrastructure failures during large data processing that may need to run for multiple days

Google’s Dataproc is that managed service on GCP that can run large volume data processing jobs programmed using Apache Spark.

--

--