Logging in Spark with Log4j. How to customize the driver and executors for YARN cluster mode.

In this article, I’m going to describe several configurations for logging in Spark. There are a lot of posts on the Internet about logging in yarn-client mode. At the same time, there is a lack of instruction on how to customize logging for cluster mode ( --master yarn-cluster). By default, Spark uses $SPARK_CONF_DIR/log4j.properties to configure log4j and straightforward solution is to change this file. However, this approach can be not acceptable due to absence write permissions. Moreover, it is even worse if you have permission because this file is used all application deployed to the cluster. Solving your own issue you can destroy a day to another cluster user. Before you start the reading please note that in the scope of this post terms “container” and “executor” are interchangeable. The sample application has been tested on Spark 1.6.3 and Spark 2.2 You can download it from my Github

Common custom log4j.properties
Fortunately, there is a pretty simple solution — custom log4j.properties can be loaded alongside with application using — files option. Please keep in mind
that in cluster mode a driver is running on the client machine and local path/storage can be used as a path to the file. It means configurations file must be accessible for any container. HDFS solves this task in a natural way. The source code below

It’s a quite effective and obvious approach which configures your driver as well as the executors. But what if it’s not still flexible enough? Say,
we need DEBUG level for driver and WARN for executors

Custom log4j.properties for driver only
It’s to introduce such configuration properties as spark.driver.extraJavaOptions and spark.executor.extraJavaOptions. They pass additional options to driver and executor JVMs respectively. The option for our case is -Dlog4j.configuration. The next listing shows how to configure the driver to use custom configuration while executors will use default one.

Distinct custom log4j.properties for driver and executors
If you want more flexibility to deploy two different configurations for driver and executors you can extend the previous listing:

How to read it easy
Now we can configure any in a YARN-cluster. But it leads us to the next question: “How to read the logged data?” The point is that every Spark/YARN node writes its log locally. It means we should collect log-files from all executors and driver which are distributed among different physical hosts. It looks untrivial however this is the case. But here we can rely on YARN which ships a utility doing the aggregation for us. Just run in the command line

yarn logs --applicationId <appidentifier>

and aggregated log for all containers will be sent to a console. Please note that containers are executed on the different physical machines and path /var/spark-log is notional just to refer any local log directory.

Yarn log aggregation workflow

At the same time, it can be senseless for jobs with a lot of containers — you’ll lose yourself at this infinity. This is why it’s better to redirect to a file and then analyze it. Someone can ask: “Does it mean that with the same appender all loggers for the same container will be flushing to the same file? Can I separate Spark logs and application logs and have them in different files?” Yes, it’s possible — just add one more appender which the application logger will be writing to. This solution origins next question — where the appender will store the additional log file? How to read it? Yes, it clearly understood that every container will create one more log file. Therefore it’s necessary to define the path for the additional log file and it seems to be tricky. If we set such path /var/spark-log/my.log there is a probability that some issue will arrive because of no guarantees that this directory will be open to writing in. Even if it’s open… It’s quite exhausting to research where, say, a driver was executed, go to that machine and manually download the required file. Let’s split this problem into two subproblems:

· Where to write a separate log file

· How to read then that file

Official Spark doc gives a clear answer to both questions: property spark.yarn.app.container.log.dir refers to the directory where YARN stores logs so that YARN can effectively collect them and send to console. For example, you can configure your additional logger and appender:

and all the magic will be done by YARN itself, it will place the log into the correct directory. Also when requested it’s will aggregate this log with remaining logs and output the result with mentioned above yarn logs utility. Now it’s recommended to add one option --show_container_log_info to find the container with a needed file. It can be convenient if additional appender was added for driver only.

yarn logs --applicationId application_1518439089919_3998 --show_container_log_info

Having known the container identifier this utility can be used one more time to retrieve log data from the specified file

yarn logs --applicationId application_1518439089919_3998 -containerId container_e34_1518439089919_3998_01_000001 -log_files bowels.log

and the only file we are interested in will be printed out. If you know that there is only one file you want to analyze you can skip--containerId option and pass to print out the content of the requested file

yarn logs --applicationId application_1518439089919_3998 -log_files bowels.log
Log output for specific file

It worth mentioning that logging gives a lot of opportunities to researchers who want to understand how Spark internally works: they can configure loggers for inspected packages and classes, setting different levels, redirecting logs to separate files to analyze them without mix-up.

Examples source code can be downloaded from my Github