Spark logging configuration in YARN

Riccardo Iacomini
8 min readJun 10, 2018

How to get logging right for Spark applications in the YARN ecosystem

Logging has always been used mainly for troubleshooting, maintenance and monitoring of applications. It is always perceived more as “something you have to do”, more than something you actually want to take care of. With the paradigm shift brought by the Big Data era, we have been increasingly evaluating the importance of log data, and using log analysis as a direct source of learning. The idea of an application that is “self-improving” by analyzing its own logs may seem commonplace nowadays, but again, this has been made possible only by the acknowledgement of the central role that data should play.

In this short article I’ll try to share my knowledge about logging in the Hadoop ecosystem. In particular, we will see how to configure log4j in Spark applications when deployed under YARN.

You will find configuration snippets to run a Spark application in YARN mode, having all your logs from driver and executors collected and stored in HDFS.

Article outline:

  • Log4j basics
  • YARN basics
  • Configuration example

Log4J basics

Log4j is one of the most popular logging libraries available in the Java ecosystem. It comes…

--

--