Beginner’s Troubleshooting Guide for Spark (IBM Analytics Engine)

Wendy Wang

Follow

Published in

IBM Data Science in Practice

6 min readSep 18, 2018

--

By Wanting Wang, IBM Data Science Elite Team
Ricardo Balduino, IBM Data Science Elite Team

In the previous blog, we presented tips to configure Spark clusters in IBM Analytics Engine (IAE). In this blog, we present tips for troubleshooting Spark in IAE.

Access to Ambari, Resource Manager, and Spark UI

For troubleshooting, the Spark UI is extremely helpful, allowing you to monitor what’s going on in Spark. You first need to go to Ambari to find the access to the Spark UI, as shown in the following:

1. Open your AE service, either through Watson Studio Compute Services or IBM Cloud.

Fig 1. Open AE service through Watson Studio Compute Services

Fig 2. Open AE service through IBM Cloud

2. Launch the console (Ambari) and access it using the given username/password. When you input the username and password, make sure there’s no padding space. (Sometimes it happens.)

3. On the Ambari dashboard, Yarn memory shows how much memory is occupied by all of your running Spark sessions. We want this number to be over 90% so that we’re utilizing AE as much as possible. Here you need to go to YARN from the menu on the left-hand side.

4. Choose ResourceManager UI from Quick Links. You need to input the same username/password to access Resource Manager.

Fig 5. YARN -> Quick Links -> ResourceManager UI

5. The resource manager shows a history of all your Spark sessions, including the running ones and the finished ones, that were created with this Spark cluster. In Cluster Metrics (the table on the top), you can find memory used (the memory practically allocated, including overhead memory) and memory total (the total memory available). Here, the memory used (39.50GB) is almost equal to the result of math equation, (3GB + 0.3GB) * 12. Note that the VCores here doesn’t matter because Yarn doesn’t track the vCores allocated to executors properly. Now you can click on ApplicationMaster for your running Spark session to access the Spark UI.

6. The Spark UI shows active jobs currently running, completed jobs, and failed jobs if there are any. From here you can monitor whether Spark is running a jobs and whether jobs are stuck. From the description of the job you will be able to have a rough idea what operation is causing the issue. This is static, not dynamic, so you will need to refresh the page to check the latest progress.

In the Storage tab you can find all the data you loaded and where they are.

In executors, you’ll see potential GC issues highlighted in red, if there are any.

7. When there’s an error, check Spark Jobs for details, if the detail is not returned in R for some reason. As mentioned before, from the name of the job you can get a rough idea about which operation failed (the name is not very clear), and more details can be found inside of the job.

Errors & (Potential) Solutions

All the wrapper libraries interacting with Spark (e.g., PySpark in Python, SparkR and Sparklyr in R) ultimately call commands in Spark, and almost all the annoying errors actually come from Spark, not from the wrapper library you use. Here we use the errors in Sparklyr (R) as an example, but the same idea can be applied on errors returned by other libraries too.

Q. Error: Unexpected state in sparklyr backend, terminating connection: failed to invoke spark command

A. This means that the driver JVM shut down abruptly. This error says nothing, and the worst thing is the Spark session is terminated because of this error and thus you can’t get access to the Spark log in R. But you can go to the resource manager, enter the Spark UI of the correct Spark session (a finished one), check which job and which stage failed, and in the details search for “caused by” to see what caused the failure in this stage.

Q. java.lang.OutOfMemoryError: GC overhead limit exceeded

A. This is a typical GC issue that you might see in many situations. As usual, the Spark error tells on high level what failed, but doesn’t tell you what caused the failure. Usually this means that there’s not enough executor memory. For example in the case of random forest modeling, we overcame this error by increasing executor memory from 3GB to 6GB.

Q. Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 2008.0 failed 4 times, most recent failure: Lost task 7.3 in stage 2008.0 (TID 21479, chs-rbg-016-dn001.bi.services.us-south.bluemix.net, executor 4): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct<V10:double,V11:double>) => vector)

A. You may have found “caused by” is not always useful, and failures to turn variables of some type to vector or some other type like this error are very confusing. The following checks may be helpful:

Check the variables specified in the warning, here V10 and V11. sdf_describe() is a useful method to check counts, distribution and the range of variables. Counts is the most important one here — compared to previous variables (variables before these two when you pass names to the vector assembler), see whether these two variables have different counts. Smaller counts than others indicates a missing value, telling you to find out why there is still a missing value in this variable after imputation (if you did). Obviously, modeling on data frame with too many missing values would cause Spark to throw an error.
When creating multilayer perceptron models, we saw a similar error message. We were able to fix it after we realized that the number of hidden units in the input layer (in fact the first number in parameter layers) should be equal to the real width of the expanded input matrix, where dummy coding needs to be taken into consideration.

Beginner’s Troubleshooting Guide for Spark (IBM Analytics Engine)

Access to Ambari, Resource Manager, and Spark UI

Errors & (Potential) Solutions

Q. Error: Unexpected state in sparklyr backend, terminating connection: failed to invoke spark command

Q. java.lang.OutOfMemoryError: GC overhead limit exceeded

Further Reading

Written by Wendy Wang