5 Do’s and Don’ts to restart a Hadoop cluster with no downtime

Criteo Labs
Apr 11 · 5 min read
  • We don’t have more than x % of dead nodes
  • Compute nodes are reported to master nodes within the last x seconds
  • We don’t have missing blocks
  • We don’t have under-replicated blocks

5 Don’ts

DON’T restart all nodes at once;

Avoid restarting all datanodes/nodemanager at the same time. Doing so can increase the load on the namenode (by slowing down HDFS calls) and creating the risk of job failure.

DON’T miss important details

Make sure you immediately halt the rolling restart if you don’t see the recently restarted services coming back online.

DON’T start with the most critical services

Avoid implementing the rolling restart on all the components at the same time. It’s better to implement it step by step, from less critical to more critical services to ensure safety.

DON’T lose control

Being able to easily stop the rolling restart in case of emergency is crucial. Make sure you don’t forget to implement a way to stop or manually trigger the rolling restart.

DON’T rush it

Keep in mind the important role the nodes play in maintaining uptime. Make sure you err on the side of caution when doing a restart so that you have sufficient compute activate at any one time.

5 Do’s

DO enforce strong constraints

Have a check-up procedure in place to ensure that the cluster is healthy and there is no more than 10% of dead. For master nodes we first ensure the availability of standby nodes before restarting service (Failover is also triggered before on namenodes)

DO apply patches on master nodes first

In case of a new Hadoop release, ensure that you have the updated release installed on the masternodes before then updating other nodes, e.g. datanodes or nodemanager.

DO use pre-production for testing

Conduct regular scheduling of a full rolling restart of a test cluster to ensure everything works well. Any errors can then be picked up before going into production.

DO automate restart processes

Use automated scheduling to regularly test the rolling restart. Rolling restart mechanism based on chef choregraphie can be also triggered by “OS” changes (e.g. a reboot or network restart). In such cases, the process will automatically be performed in a rolling way.

DO monitor the restart process

Ensure you have a good way to monitor the cluster in order to be alerted in case of a rolling restart issue.

Building a reliable restart framework

Running a Hadoop cluster is an evolving process and there will always be more to learn and new ways for improvement. In our experience the best way to do this is to work with a testing and optimisation mindset that’s continually asking: ‘how can we do this better?’ Applying this approach has helped us to effectively remove one of the greatest risks from our environment — unscheduled downtime — saving countless hours of engineer time and making substantial cost savings to the wider business.

Criteo R&D Blog

Tech stories from the R&D

Criteo Labs

Written by

The engineers building the advertising platform for the open Internet.

Criteo R&D Blog

Tech stories from the R&D