Migrating Big data Applications from Hadoop to IBM Cloud Pak for Data — Part 1

Rishi S Balaji
IBM Data Science in Practice
14 min readJun 14, 2021
a set of wavy lines ending in hazy looking letters such as B, V, M, N, and T.
Picture credit: anton maksimov juvnsky

Migrating Spark Jobs from HDP to IBM Cloud Pak for Data

This series of blogs on migrating big data applications from Hadoop to IBM Cloud Pak for Data are being authored with the collective inputs from the IBM Financial Crimes and Insights (FCI) team based on their experience in migrating the FCI product to IBM Cloud Pak for Data.
Special thanks to the reviewers of the series Rachit Arora, Software Architect at IBM Analytics Engine(Cloud Pak for Data) and IBM Watson Studio Spark Environments and Srinivasan Muthuswamy, Senior Technical Staff Member IBM India Software Labs.

Introduction

This article is the first in the series of articles on migrating Big data applications running on Hadoop to IBM Cloud Pak for Data. This first post focuses on migrating Spark jobs from Hadoop to IBM Analytics Engine Powered by Apache Spark on IBM Cloud Pak for Data.

Why move away from Hadoop?

There are a number of articles that have been written , providing the storage and compute perspectives of moving away from Hadoop (please see the references section of this post for links to some of them). From a compute perspective, having a single Kubernetes-based cluster simplifies infrastructure management and reduces cost by having to maintain a single cluster for Spark and non-Spark payloads, unlike Hadoop which requires a dedicated cluster for Spark payloads and other options for things like user interfaces and developer tools. From a storage perspective, moving from HDFS-based storage to cloud-based storage is driven by the cost effectiveness, scalability and durability of the latter.

IBM Cloud Pak for Data

IBM Cloud Pak for Data simplifies and automates how data is collected, organized, and analyzed by businesses that want to infuse AI across their organization. The Analytics Engine powered by Apache Spark (hereafter referred to as Analytics Engine) is an add-on that provides the compute engine needed to run analytical and machine learning jobs on the Kubernetes cluster hosted on IBM Cloud Pak for Data. The Analytics Engine provides a single central point of control for creating and managing Spark environments in the cloud. It supports all essential features such as:

  • token-based platform authentication/authorization
  • programmatic REST interfaces
  • job management UI
  • seamless integration with various tools from IBM Watson Platform
  • Support for various s3 based data storage options such as Cloud Object Storage(COS) and Openshift Container Storage (OCS), besides the traditional NFS and HDFS.

Assumptions

This post provides the details of some of the key steps which can help in accelerating the migration of Spark jobs from Hadoop to IBM Cloud Pak for Data. In this context, this post refers to the Hortonworks Data Platform (HDP) which is one of the popular platforms for running Spark payloads on HDFS. On the Cloud Pak for Data side, this post refers to NFS as the storage option.

Intended Audience and Pre-requisites

This blog is intended for developers who are at a beginner to intermediate level experience in using Spark on HDP (Yarn) and are getting started migrating Spark jobs to IBM Cloud Pak for Data.

Note : This document assumes basic understanding of running Spark jobs on HDP and on Analytics Engine Powered By Spark. Please see the references section for more information on how to install and use Analytics Engine Powered by Spark on IBM Cloud Pak for Data

The contents of this blog are based on IBM Cloud Pak For Data 3.5.2 (Spark version 2.4) and HDP version 3.0

Note: The Analytics Engine powered by Apache Spark on IBM Cloud Pak for Data at the time of this writing supports Spark 2.4, Spark 2.4.7 and Spark 3.0.

Architectural Differences between Yarn and Analytics Engine

Before getting into the details of the actual job migration steps, it is essential to understand the key architectural differences between Yarn architecture and Analytics Engine architecture with respect to Spark job submissions. The following table summarizes the key differences:

a screenshot of a table with headers “Topic”, “HDP”, and “IBM Cloud Pak for Data”
For accessible version of this table, see here

The following figure provides a high level view of these differences :

A diagram showing the mapping between Spark Job on HDP with Yarn/HDFS and Spark Job on IBM Cloud Pak for Data, IBM Analytics Engine, and NFS
Figure: Spark cluster on Yarn vs Analytics Engine Powered by Apache Spark — high level view

The key points to note from the above figure are :

  • A Spark job is typically submitted to Yarn on HDP using the command line tool (or Livy APIs), whereas on IBM Cloud Pak for Data, the job is submitted to the Analytics Engine using REST APIs provided by the engine. (Note that an upcoming version of the Analytics Engine will support submitting jobs in a similar manner to spark-submit.)
  • When a Spark job is submitted on HDP, Yarn schedules and runs the job on a static set of nodes on the Yarn cluster. This cluster is setup during the installation of HDP and thereafter, all jobs are run across this cluster. On IBM Cloud Pak for Data, when a Spark job is submitted, the Analytics Engine creates a set of pods (one for master and one for each worker). There is a set of pods for each Spark job and runs the job on its own cluster. This dynamically-created cluster is deleted after the job completes.
  • Application data is typically stored in HDFS or HBase on HDP. The Analytics Engine supports various storage options like Cloud Object Storage, OpenShift Container Storage, NFS and HDFS.
  • Application jars/Python code and their dependencies are stored either in the operating system file system or on the HDFS on HDP. On IBM Cloud Pak for Data, these are all stored on the chosen storage class (for example NFS). The Analytics Engine makes them available to the Spark job at run time.

Note : This blog focuses specifically on NFS as the storage class for the Spark jobs on the Analytics Engine. For information on other supported storage classes, please refer to the links in the references section of this post.

Steps involved in migrating a Spark job from HDP to Cloud Pak for Data

The key steps involved in this migration process are as follows:

  1. Make the application executable (jar/Python code) available to Spark
  2. Make the dependent jars and libraries available to Spark (Note: this involves several scenarios, explained further in this blog)
  3. Relocate application data from HDFS to storage on IBM Cloud Pak for Data
  4. Relocate log4j properties file
  5. Map the spark-submit command line to Analytics Engine payload.
  6. Cloud cluster sizing considerations
  7. Make the application executable (jar/Python code) available to Spark
___________________________________________________________________
On HDP

Place the jar on HDFS :
hdfs dfs -put /root/myJob.jar /user/hdfsuser/myJob.jar
The jar can alternately be on the OS file system as well
____________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
1) Obtain the platform token :
curl -i -k -X GET "$CP4D_URL/v1/preauth/validateAuth" -H "username: admin" -H "password: password"

2) Use the token from the above curl result in the follow command to copy the spark job jar to the Spark application volume:
curl -ik -X PUT https://internal-nginx-svc:12443/zen-volumes/fciiappvolume/v1/volumes/files/myApp%2FmyJob.jar' -H 'Authorization: Bearer $TOKEN' -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/myJob.jar'
___________________________________________________________________

A Spark job submission requires a reference to the location of the application jar file. This file is typically located at any path of choice on the operating system’s file system on the master and the worker nodes on the HDP cluster. The file can also be stored on the HDFS and referred with a hdfs:// path during job submission.

On IBM Cloud Pak for Data, the application jar needs to be moved to the mount folder on the (NFS) application volume that has been setup for the Spark instance for the Analytics Engine. The Analytics Engine will mount this volume on to the master and worker pods of the Spark job.

2. Migrate dependent jars/Python libraries

Migrate Java/Scala jar dependencies

___________________________________________________________________
On HDP

Place the jar on HDFS :
hdfs dfs -put /root/myJob.jar /user/hdfsuser/myDependency.jar
The jar can alternately be on the OS file system as well.
____________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
1) Obtain the platform token :
curl -i -k -X GET "$CP4D_URL/v1/preauth/validateAuth" -H "username: admin" -H "password: password"

2) Use the token from the above curl result in the follow command to copy the spark job jar to the Spark application volume:
curl -ik -X PUT https://internal-nginx-svc:12443/zen-volumes/fciiappvolume/v1/volumes/files/myApp%2FmyDependency.jar ' -H 'Authorization: Bearer $TOKEN' -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/myDependency.jar'
____________________________________________________________________

The dependent jars for Java and Scala jobs are typically handled the same way as the application jar. To migrate the jar file dependencies for the Spark job, copy the dependent jars from the HDP master and place them under the location of choice on the application volume configured for the Spark job.

Handling Python Library dependencies

On a typical HDP setup, the Python libraries required by the Spark jobs are installed for a specific version of Python through the pip utility. The libraries are made available across all the nodes in the cluster. The PYSPARK_PYTHON environment variable under Spark2-env configuration in the Ambari console points to the location of this version of Python.

On IBM Cloud Pak For Data, the Analytics Engine comes with the Python runtime, PySpark libraries, and many other common Python libraries such as scikit-learn, Pandas and others that are used commonly in data analytics applications.

Scenario 1: The required library already exists in IBM Cloud Pak for Data

___________________________________________________________________
On HDP

pip install urllib3
___________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
No action required
____________________________________________________________________
  • First, check if the required library is already available on Cloud Pak for Data. If the required library is already available (make sure to check for the right version), nothing has to be done in terms of making the library available to the Spark job.

Scenario 2: The required library does not exist in IBM Cloud Pak for Data and has no .so library dependencies:

___________________________________________________________________
On HDP

pip install urllib3
___________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
1) Package the library on HDP master node:
cd /usr/local/lib/python3.7/site-packages/
tar -czvf /root/urllib3.tar.gz urllib3
scp /root/urllib3.tar.gz root@ocbasevmhost:/root/
2) Obtain the platform token : curl -i -k -X GET "$CP4D_URL/v1/preauth/validateAuth" -H "username: admin" -H "password: password" 3) Use the token from the above curl result in the follow command to copy the libary tar to the Spark application volume: curl -k -X PUT $CP4DROUTE/zen-volumes/myappvolume/v1/volumes/files/pylibs%2Furllib3.tar.gz?extract=true -H 'Authorization: Bearer $TOKEN' -H 'Content-Type: multipart/form-data' -F upFile='@/root/urllib3.tar.gz'
____________________________________________________________________
  • If the required library is not available in Cloud Pak for Data and if the library is a simple one which does not have too many dependencies and also no C library dependencies (.so file), the library folder can be copied directly from the HDP master location and saved on the application volume that is being used by the Spark jobs.

Scenario 3: The required library does not exist in IBM Cloud Pak for Data and the dependency on .so library

___________________________________________________________________
On HDP

pip install urllib3
___________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
1) Install the library on Conda installation. The version of Conda needs to be same as the one available on the Spark instance of IBM Cloudpak for Data
cd /usr/local/lib/python3.7/site-packages/
tar -czvf /root/urllib3.tar.gz urllib3
scp /root/urllib3.tar.gz root@ocbasevmhost:/root/

2) Obtain the platform token :
curl -i -k -X GET "$CP4D_URL/v1/preauth/validateAuth" -H "username: admin" -H "password: password"

3) Use the token from the above curl result in the follow command to copy the libary tar to the Spark application volume:
curl -k -X PUT $CP4DROUTE/zen-volumes/myappvolume/v1/volumes/files/pylibs%2Furllib3.tar.gz?extract=true -H 'Authorization: Bearer $TOKEN' -H 'Content-Type: multipart/form-data' -F upFile='@/root/urllib3.tar.gz'

____________________________________________________________________

If the library has too many dependent libraries and/or dependency on C libraries, then the library will have to installed on a Conda installation and the library folder copied from the the Conda installation to the application volume used by the Spark job. Note that this Conda installation should be the same version as that used by the Analytics Engine.

Scenario 4: The required library exists in IBM Cloud Pak for Data but the version is different.

___________________________________________________________________
On HDP

pip install urllib3
___________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
Use the available version of the library. Refactor application code if required.
____________________________________________________________________

When a different version of the required library is available, the required version has to be made available using the instructions in Scenarios 2 or 3 as applicable, as long as this library is not used internally by Spark. If Spark requires only the already existing version, this may cause runtime errors when the job runs. To avoid this issue, it is recommended to refactor the job to use the version of the library that is already available in the Analytics Engine.

3. Relocate application data from HDFS to storage on IBM Cloud Pak for Data

___________________________________________________________________
On HDP

Copy the data on to HDFS :
hdfs dfs -put /root/appData.csv /user/hdfsuser/data/appData.csv
___________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
1) Obtain the platform token :
curl -i -k -X GET "$CP4D_URL/v1/preauth/validateAuth" -H "username: admin" -H "password: password"

2) Use the token from the above curl result in the follow command to copy the spark job jar to the Spark application volume:
curl -ik -X PUT https://internal-nginx-svc:12443/zen-volumes/fciiappvolume/v1/volumes/files/myApp%2Fdata%2FappData.csv ' -H 'Authorization: Bearer $TOKEN' -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/appData.csv'
____________________________________________________________________

Data that is typically stored on HDFS will have to be moved to NFS or s3 compatible object storage such as Cloud Object Storage or OCS on IBM Cloud Pak for Data. Migrating the data to NFS is relatively straightforward. The exact same path used on the HDFS needs to be created on the application volume used by the Spark jobs running on Cloud Pak for Data.

Any application code that refers to the file using the hdfs:// URL will have to change to use file:// instead. If the code is written generically to detect the file system, no change will be required.

Note: It is common practice to also use a s3 compatible cloud storage in place of NFS. For information on supported storage options other than NFS, please refer to the links in the references section of this blog. Also note that the Analytics Engine can also work with HDFS if the data is remotely located on HDFS.

4. Configuring log4j properties for Spark Jobs

___________________________________________________________________
On HDP

1.) Create and save the log4j.properties file to <HDFS_CONF_DIR>/log4j.properties. A template located at the <SPARK_CONF_DIR> can be used as a starting point.

2.) Use the --files option in spark-submit command to specify the location of the log4j.properties file:
/usr/bin/spark-submit --class com.ibm.MySparkMain --files $HDFS_CONF_DIR/log4j.properties --master yarn --deploy-mode cluster --driver-memory 5G --executor-memory 3G /root/myApp.jar

Note: The location of the log4j.properties can be changed to any location on the file system as long as the user running the spark job has permissions on the file and the file has to be accessible on all the nodes in the cluster. Seperate properties for driver and executors can be provided using the -Dlog4j option in the spark-submit, as follows :
/usr/bin/spark-submit --class com.ibm.MySparkMain --files $HDFS_CONF_DIR/log4j-driver.properties,$HDFS_CONF_DIR/log4j-executor.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-driver.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-executor.properties" --master yarn --deploy-mode cluster --driver-memory 5G --executor-memory 3G /root/myApp.jar
___________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data1.
1.) scp <SPARK_CONF_DIR>/log4j.properties root@ocbasevmhost:/root/

2) Obtain the platform token :
curl -i -k -X GET "$CP4D_URL/v1/preauth/validateAuth" -H "username: admin" -H "password: password"

3) Use the token from the above curl result in the follow command to copy the log4j properties file to the Spark application volume:
curl -ik -X PUT https://internal-nginx-svc:12443/zen-volumes/fciiappvolume/v1/volumes/files/myApp%2Flog4j%2F/log4j.properties ' -H 'Authorization: Bearer $TOKEN' -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/log4j.properties'

4.) Use the following config in the spark payload to specify the location of the log4j.properties file:
"conf": {
"spark.driver.extraJavaOptions": "-Dlog4j.configuration = /zen-volume-home/myApp/log4j/log4j.properties",
"spark.executor.extraJavaOptions": "-Dlog4j.configuration = /zen-volume-home/myApp/log4j/log4j.properties"
}

____________________________________________________________________

On a HDP environment, the log4j properties template is located in the Spark installation folder in HDP (<HDFS_CONF_DIR>/log4j.properties). This file will have to be edited and the path to this file has to be provided in the “ — files” option in the spark-submit. Seperate files can provided for driver and executor using the -Dlog4joption as shown in the code block above.

When the application is migrated to Cloud Pak for Data, the log4j.properties needs to be copied from HDP to a location under the application volume, such as: /zen-volume-home/log4j/log4j.properties. The location of this file has to be provided in the Spark job submission payload to the Analytics Engine using the -Dlog4j option as shown in the code block above. Note that, unlike the spark-submit on yarn, there is no “ — files” option here. Please refer to Section 5 below on submitting the Spark job for a full payload.

The log4j properties will take effect and the desired log level can be seen in the Spark job logs.

5. Map spark-submit command line for Yarn to Analytics Engine Payload:

___________________________________________________________________
On HDP


/usr/bin/spark-submit --class com.ibm.MySparkMain --files $HDFS_CONF_DIR/log4j-driver.properties,$HDFS_CONF_DIR/log4j-executor.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-driver.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-executor.properties" --master yarn --deploy-mode cluster --driver-memory 5g --executor-memory 3g --num-executors 3 --driver-cores 2 --executor-cores 2 /root/myApp.jar
___________________________________________________________________
____________________________________________________________________
On IBM Cloudpak For Data
{
"engine": {
"size": {
"driver_size": {
"memory": "5g",
"cpu": 2
},
"num_workers": 3,
"worker_size": {
"memory": "3g",
"cpu": 2
}
},
"conf": {
"spark.driver.extraJavaOptions": "-Dlog4j.configuration = /zen-volume-home/myApp/log4j/log4j.properties",
"spark.executor.extraJavaOptions": "-Dlog4j.configuration = /zen-volume-home/myApp/log4j/log4j.properties"
},
"type": "spark",
"env": {
"my_env_var": "value"
},
"volumes": [{
"mount_path": "/zen-volume-home/",
"volume_name": "myappvolume"
}]
},
"mode": "local",
"main_class": "com.ibm.MySparkMain",
"application_jar": "/root/myApp.jar",
"application_arguments": []
}
____________________________________________________________________

The code block above shows an example of spark-submit on Yarn and a corresponding payload for the Analytics Engine on IBM Cloud Pak for Data. The following figure shows the mapping between the two. Note that it is not an exhaustive reference of all the supported parameters, but it covers the most commonly used ones.

a diagram showing a mapping between an Analytics engine Job Submission Payload and HDP spark-submit
Figure : Yarn spark-submit to Analytics Engine Payload

The key points to note in the above figure are :

  • The sizing parameters are straightforward except for differences in naming.
  • The conf section takes the same set of configuration parameters supported by the conf section in spark-submit. Note that some of the items like the — files and — jars that are supported by Yarn are handled through the conf section in the Analytics Engine payload.
  • The environment section provides a mechanism to pass environment variables to the Spark job running on IBM Cloud Pak for Data. On the HDP, this is done by setting the variables in Ambari (under the Spark config section).
  • Note the need for the volumes section is specific to the Analytics Engine and has no corresponding mapping to the spark-submit on Yarn.

For a complete listing of the Analytics Engine payload, please refer to the references section of this blog.

Cluster Sizing Considerations

When the Spark payload is moved from HDP to IBM Cloud Pak for Data, the corresponding resource (CPU/memory) requirements shift as well. So the cluster on the IBM Cloud Pak for Data needs to be sized to handle the Spark job requirements. In addition, the cluster will also need to provide additional resources for the IBM Cloud Pak for Data control pane and the Analytics Engine control pane. These can typically be referred to from the respective product documentation.

Conclusion
This blog showed the steps involved in migrating a Spark job deployed on Yarn (HDP) to IBM Cloud Pak for Data. It also provided some tips on sizing the IBM Cloud Pak for Data cluster to run the migrated Spark jobs.

Thanks for reading this blog post. Begin your technology transformation journey from Hadoop to IBM Cloud Pak for Data starting with your Spark payloads! This is the first in the series of posts on migrating enterprise applications from HDP to IBM Cloud Pak for Data. Stay tuned for more!

References

Analytics Engine Powered by Spark Documentation
Efficient way to connect to Object storage in IBM Watson Studio — Spark Environments
System Requirements for IBM Cloudpak for Data
Cloudera Documentation
Spark Documentation
Is it the death of Hadoop
Why is fortune 500 dumping Hadoop

--

--

Rishi S Balaji
IBM Data Science in Practice

Rishi is an Application Architect at IBM Cloud and Cognitive Software