Our Hadoop cluster plays a pivotal role in our business operations. Operating a network of 3,000 advertisers, 5,000 publishers and 1.5 billion active shoppers simply wouldn’t be possible without a storage and compute environment that scales. Since beginning our Hadoop journey in 2012 we’ve continued to build on our big data operations, with a strong and ever-growing technical team that handles 300k jobs and 30 million containers per day.
For the Hadoop cluster to continue to deliver the business’ needs, regular maintenance and update is essential. The processes that are in place to ensure all nodes are operating healthily, including pinpointing failures and identifying slow or over-provisioned nodes, are important to us as it only takes a small number of under-performing nodes to affect the performance of the entire cluster.
As our cluster continues to grow, the process of regularly restarting and checking nodes must be conducted with proper diligence and safety or else it could have serious implications. A poorly handled daily restart, for example, could result in prolonged downtime due to the time lag on nodes coming back online. In a worst case scenario, all nodes could be rendered inactive at once, resulting in significant downtime and effectively bringing our sales operations to a standstill.
It’s with these implications in mind that our engineering team has paid close attention to what’s worked and what hasn’t, and have noted some important lessons along the way.
How we conduct restarts
Based on these learnings, our processes for conducting restarts have become more sophisticated. Previously, restarts were conducted manually, but as our environment is now quite big it became challenging to operate in this way. As a result, all restarts are now automated based on configuration management software tool called Chef. We use a Criteo open source cookbook for Chef, which provides primitives to protect resources and helps ensure orchestration of service downtime. This library is available in our github.
Once the configuration is updated or a patch is applied on a node, the service impacted must be restarted for this to be taken into account. Due to Chef’s subscription mechanism on resources, restart of the impacted service is automatically triggered. To ensure that not all nodes restart at the same time, Chef’s service resources are protected by the Choregraphie cookbook. This will trigger a set of actions performed before the restart (lock consul, healthy checks, restart, release lock). Depending on the node type, the lock can be taken per node or per rack in order to speed up the restart.
The platform health checks include:
- The expected Hadoop version has to be the same as the namenode
- We don’t have more than x % of dead nodes
- Compute nodes are reported to master nodes within the last x seconds
- We don’t have missing blocks
- We don’t have under-replicated blocks
By automating this process and applying the safety checks, we’ve managed to make the restart seamless while simultaneously avoiding the risk of downtime. Based on our experience, here are our top five do’s and don’ts for handling daily restarts.
DON’T restart all nodes at once;
Avoid restarting all datanodes/nodemanager at the same time. Doing so can increase the load on the namenode (by slowing down HDFS calls) and creating the risk of job failure.
DON’T miss important details
Make sure you immediately halt the rolling restart if you don’t see the recently restarted services coming back online.
DON’T start with the most critical services
Avoid implementing the rolling restart on all the components at the same time. It’s better to implement it step by step, from less critical to more critical services to ensure safety.
DON’T lose control
Being able to easily stop the rolling restart in case of emergency is crucial. Make sure you don’t forget to implement a way to stop or manually trigger the rolling restart.
DON’T rush it
Keep in mind the important role the nodes play in maintaining uptime. Make sure you err on the side of caution when doing a restart so that you have sufficient compute activate at any one time.
DO enforce strong constraints
Have a check-up procedure in place to ensure that the cluster is healthy and there is no more than 10% of dead. For master nodes we first ensure the availability of standby nodes before restarting service (Failover is also triggered before on namenodes)
DO apply patches on master nodes first
In case of a new Hadoop release, ensure that you have the updated release installed on the masternodes before then updating other nodes, e.g. datanodes or nodemanager.
DO use pre-production for testing
Conduct regular scheduling of a full rolling restart of a test cluster to ensure everything works well. Any errors can then be picked up before going into production.
DO automate restart processes
Use automated scheduling to regularly test the rolling restart. Rolling restart mechanism based on chef choregraphie can be also triggered by “OS” changes (e.g. a reboot or network restart). In such cases, the process will automatically be performed in a rolling way.
DO monitor the restart process
Ensure you have a good way to monitor the cluster in order to be alerted in case of a rolling restart issue.
Building a reliable restart framework
Running a Hadoop cluster is an evolving process and there will always be more to learn and new ways for improvement. In our experience the best way to do this is to work with a testing and optimisation mindset that’s continually asking: ‘how can we do this better?’ Applying this approach has helped us to effectively remove one of the greatest risks from our environment — unscheduled downtime — saving countless hours of engineer time and making substantial cost savings to the wider business.
By reviewing and documenting our learnings, we’ve been able to create a solid framework for handling maintenance and upgrades, which will help inform and enhance the next chapter of our Hadoop journey.
Thanks to Nicolas Fraison and Anthony Rabier for making this article happen.