How to retain Hadoop logs longer on data nodes without worrying about high disk usage?

Abhishek Gupta
WhatfixEngineeringBlog
6 min readJun 4, 2021
Courtesy: https://techdifferences.com/

With distributed systems, come distributed challenges. When we focus extensively on improving one area, and not paying enough attention to the others, unforeseen challenges are bound to come. Things can turn ugly pretty quickly if they remain unaddressed in time.

At Whatfix, we strive to provide improved experiences to both, the customers and fellow engineers. As an improvement in one of our infrastructure components, we had to upgrade the Hadoop version from 1 to 3. The performance improvements of using Hadoop 3 were significantly visible, right from day 1 of the upgrade. However, it can be treated as a different discussion altogether.

Setting up the ground work

I will be using job and application interchangeably, but they imply the same things with the context of this article

By default, for any Hadoop job that runs, the associated logs and temporary resources (found in /tmp/hadoop-<user> by default) get deleted almost immediately. However, we wanted to retain the logs for 15 days duration for addressing some use cases as required by one of the teams who would be using Hadoop for their jobs. We could achieve that by setting the following property in the yarn-site.xml. The value is in seconds.

yarn.nodemanager.delete.debug-delay-sec=abc

Good so far. But wait! Little did we know we were falling into a self-built trap. The retention duration not only applies for logs, but also for the temporary data associated with the jobs, which gets stored in /tmp/hadoop-<user>/nm-local-dir/usercache/<user>/appcache/application_*. That is where the problem starts.

The /tmp resource folder stores and tracks the intermediate resources needed by the application for a successful run. It consists of job.xml, relevant jar files, container wise information, map/reduce attempt information, intermediate and final outputs, etc. Subjected to the nature of a job, the size of the application /tmp resources could vary. It could be anywhere from Kbs to Gbs, or even beyond that. Besides, these resources are managed entirely by Hadoop. The user’s application or job has no idea of the processings involved where it eventually led to bulky resource folders.

The Impact

One fine early morning around 4 a.m IST, we started getting alerts via AppDynamics on low disk space available on a couple of data nodes. At first, we thought it to be a temporary glitch and might settle down by itself. However, in a matter of a few minutes, almost all the data nodes started alerting of low disk space. It took us some time to realise the issue with /tmp folder consuming more than 90% of the disk space. On digging further, we found that an application was trying to process downloaded data from an external data source. But since there was not enough disk space for storing the processed data, Hadoop would kill the container and re-launch the application on another data node, without deleting the information from the previous data node. The application would then again start downloading data on the subsequent data node and process. This single application almost brought down the entire Hadoop cluster. As an immediate resolution, we killed the running application and cleared the /tmp data associated with the application. The RCA was the lack of sufficient disk space for the application to process.

Solution Approaches

We came up with some approaches as described below. It is beneficial to discuss the pros and cons of each one before implementation.

  1. Manual cleaning of /tmp resources — As software engineer, we should never resort to such a thing, if we are looking for a long-term solution. Besides, it becomes difficult to manually monitor and clean up when the system tends to scale up.
  2. Reduce the retention duration — As an immediate solution, this approach could work. As described above, retention time is applicable for application logs as well as /tmp resources. We would then need to compromise with the retention duration for logs, as well. However, it would also go on toss if the number of applications in a given unit of time increase or start processing more unit of data. Hence, this is not a scalable solution either. Besides, any changes made to configuration files would require Hadoop to restart and incur additional downtime.
  3. Hadoop level management — Unfortunately, no such privilege exists with Hadoop to date for managing the /tmp resources alone. If it did, life would have been full of roses!
  4. Moving logs to a separate location — We started working on moving logs to Kibana so that we can reduce the retention duration to only a few hours or minutes before Hadoop clears them up. However, it was required to identify the logging patterns for parsing so that the logs can be accessed using Kibana Query Language. The only challenge was with the parsing as we did not want to leave out any crucial information related to the application which would require us to monitor logs through Kibana and double-check that it throws out whatever we expect out of it. Hence, we could not entirely switch to this solution before being 100% confident on the delivery through KQL (Kibana Query Language).
  5. Script to clear /tmp on each data node — We felt that it would be easier if we could have a service that could easily remove the /tmp resources of the finished application periodically. We chose to write a Linux service as it is a lightweight process that does not need an overhead of restarting any additional component. This service would need to run on each data node.

This article is to discuss Solution#5 in detail.

Before we discuss the exact solution, let us take a look at the elements involved in it.

  1. Namenode IP address — If you are running namenode on the same machine, then 0.0.0.0 can work for you. We will use this address to hit the namenode server to extract information for each application.
  2. Temporary resource directory location on disk- usually it is under /tmp folder by default. But in case you have overridden, you need to provide the path. Ex: On my machine path would be like /tmp/hadoop-abhishek/nm-local-dir/usercache/abhishek/appcache
  3. Terminal states — Hadoop supports three terminal states — FINISHED, KILLED and FAILED. It is easy to work with these states than with all of them. These states are useful to track the status of each application.
  4. Error handling — Interestingly, Hadoop manages only up to the latest 1000 finished/completed jobs at once(also configurable through the property) of which we can track status. Accessing information through the URL that we will be using to get the state of a job that goes beyond the latest 1000 jobs would result in a Hadoop thrown exception. Hence, we need to handle the same in our script.
  5. Get status of a application : http://<namenode IP>:8088/ws/v1/cluster/apps/<application_id>/state

Following is a ready-to-use script: Github Link

This script can be run as a standalone application or as a Linux based service on every data node. However, in production, it is preferred to have it as Linux based service. We are using systemctl (a Linux based service) to manage our script.

Few caveats before using the above script:

  1. Provide the correct path to your resource directory that is used by Hadoop to manage application resources. (Field = application_dir)
  2. Provide the correct namenode IP address for your environment. (Field = namenode_ip)
  3. This script will run at every 120-sec interval. You might not want that. Hence make sure to change the sleep time based on your requirements.

Having done this, you are all set to use the Hadoop cluster without worrying about the high disk consumption of the finished applications and also retaining Hadoop application logs longer.

--

--