How to Assess a Spark Job Using Sparklens

Well, choosing option(s) for our spark-submit is quite tricky. At the first attempt, maybe we can guess the option(s) for our spark-submit such as spark.driver.memory, spark.executor.cores, etc. based on the amount of data that will be processed, the logical complexity of our code, the availability of cluster resource, and so on to decide. After several attempts of trial and error, voila! Our spark job finishes running smoothly. So, congratulations! But, are we sure that the spark job uses the provided resources efficiently? What if our spark job wasted so much resources that actually can be allocated to other jobs? Let’s talk about efficiency. Wasted resources mean that we spend our money — or maybe our company’s, oops — on unproductive computing power. Luckily enough, Qubole made a python package named Sparklens (https://github.com/qubole/sparklens) which specifically made for us to do a profiling of our spark job, so we can understand our spark code’s scalability and efficiency shown by several metrics such as the performance of each stage, the time spent on our driver and executor, and even wasted driver and executor resource.

How to do the magic on Qubole Notebook

First, setting the interpreter:

  • Open the interpreter setting page on the Qubole notebook
  • Open the repository information by clicking gear icon on the top right of the panel and this section should appear on our screen :
  • After that, expand the user interpreter setting and add qubole:sparklens:0.1.2-s_2.11 to the artifact of the interpreter and click save
  • Make sure our user interpreter setting used as default interpreter

After that, add some spells on our notebook paragraph:

  • Put this code on the first paragraph
  • Write the spark code as usual between the code above and below
  • Put this code below on the last paragraph and run the notebook

Reading the Sparklens Assessment

If we are doing the step above correctly, the variablesparklens_result should contains unicode like below:

The first section (line 1–65) explains each stage metrics which can provide us some informations about how much stage performed on our spark code, the shuffle process, the wasted computing power per core per hour, PRatio, TaskSkew, TaskStageSkew, and other metrics which have explained by Sparklens result itself.

The second section (line 67–94) explains about the modelling of the estimation of our spark job on different number of executor. More than that, this section also stating how long our code spend its time on the driver and executor.

The last section (line 96–115) explains about cluster utilization and of course, the easiest metrics for us to understand (phew, finally!) are metrics about the wasted driver and executor per core per hour which can help us to adjust the provided spark.driver.memory and spark.executor.memory on our spark-submit options.

PS : If you think this technical documentation possibly misleads you to assess your spark job or maybe you have a critic(s) or suggestion(s) , please let me know. Just greet me on misbach.imaduddin@gmail.com