How to run jobs using Analytics Engine powered by Apache Spark on IBM Cloud Pak for Data

7 min readDec 3, 2019

IBM® Cloud Pak for Data is a cloud-native solution that enables you to put your data to work quickly and efficiently. Your enterprise has data. Lots of data. You need to use your data to generate meaningful insights that can help you avoid problems and reach your goals.

Your data is useless if you can’t trust it or access it. Cloud Pak for Data lets you do both by enabling you to connect to your data, govern it, find it, and use it for analysis. Analytics Engine powered by Apache Spark service in IBM Cloud Pak for Data can be used to run a variety of workloads on your IBM Cloud Pak for Data cluster Spark applications that run Spark SQL, Data transformation jobs, Data science and Machine learning jobs without using Watson Studio

In this blog, we will learn how to setup an instance of Analytics Engine powered by Apache Spark on IBM Cloud Pak for Data and run spark jobs on it.

Spark instance creation using browser interface

Sign into homepage of IBM Cloud Pak for Data or sign up as a new user.

After signing in you will be taken to this page and click on services

Search for Analytics Engine in search bar

Click on the 3 dots next to Analytics Engine powered by Apache Spark and click on Provision instance

Spark applications run using Analytics Engine powered by Apache Spark, a common way to reference the Spark job, input data or the output data is through external storage volumes that you can manage by using the IBM Cloud Pak for Data volume API.

Spark instance requires a volume to store logs like spark master, worker and driver logs and spark-events.

Either create a new volume by entering a new name for the volume or use an existing volume for spark instance and click on next.

Enter a name for the spark instance and click on provision at the top right.

Spark instance with name myspark is created now and click on the 3 dots at the end of myspark instance to manage access, view details or delete the instance.

Instance details page will show the status of the spark instance, endpoint to submit spark jobs, view spark history server and access token.

Spark instance creation using REST apis

We can also create spark instance programmatically instead of using the UI, by following below steps

Generate access token and save it in TOKEN variable

curl -i -v -k -X GET  https://<your_IBM_CP4D_url>/v1/preauth/validateAuth   -H 'password: <password>'  -H 'username: <user_name>'

Create the instance volume myvol, which will be used to store spark logs and events

curl -vk -iv -X POST "https://<your_IBM_CP4D_url>/zen-data/v2/serviceInstance" -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json'  -d '{
    "createArguments": {
        "metadata": {
            "storageClass": "nfs-client",
            "storageSize": "2Gi"
        },
        "resources": {},
        "serviceInstanceDescription": "volume 1"
    },
    "preExistingOwner": false,
    "serviceInstanceDisplayName": "myvol",
    "serviceInstanceType": "volumes",
    "serviceInstanceVersion": "-",
    "transientFields": {}
}'

Create spark instance mySpark

curl -i -k -v  -X POST -H 'Content-Type: application/json' https://<your_IBM_CP4D_url>/zen-data/v2/serviceInstance -H "Authorization: Bearer $TOKEN" -d '{
 "serviceInstanceType":"spark",
 "serviceInstanceDisplayName":"mySpark",
 "serviceInstanceNamespace":"icpd-lite",
 "serviceInstanceVersion":"1.0.0.0",
 "preExistingOwner":false,
 "createArguments":{
    "metadata":{
       "volumeName":"myvol",
       "storageClass":"",
       "storageSize":""
    }
 },
 "parameters":{
 },
 "serviceInstanceDescription":"Sample for instance creation",
 "metadata":{
 },
 "ownerServiceInstanceUsername":"",
 "transientFields":{
 }
}'

Response

{"_messageCode_":"200","id":"1574964929883","message":"Started provisioning the instance"}

Submit spark jobs

To submit spark job copy and save the endpoint to submit spark jobs and access token from the instance details page. It can also be accessed from Navigation menu on the IBM Cloud Pak for Data home page, click My instances and view details as shown previously.

We cannot use the volume (myvol) that is associated with spark instance myspark for application and dataset storage. Either create a new volume or use an existing volume different than spark instance volume to upload the spark application and datasets.

Generate a token first using below curl

curl -i -v -k -X GET  https://<your_IBM_CP4D_url>/v1/preauth/validateAuth   -H 'password: <password>'  -H 'username: <user_name>'

Save the accessToken from above curl response to a variable called TOKEN

Note: user_name and password in above curl is your IBM Clouda Pak for Data sigin details.

By using the volume API, we can create one or more volumes of the required sizes, upload data and applications and then pass the volume IDs as parameters in the Spark jobs API.

To create a new volume with name appvol, run below curl:

curl -vk -iv -X POST "https://<your_IBM_CP4D_url>/zen-data/v2/serviceInstance" -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json'  -d '{
    "createArguments": {
        "metadata": {
            "storageClass": "nfs-client",
            "storageSize": "2Gi"
        },
        "resources": {},
        "serviceInstanceDescription": "volume 1"
    },
    "preExistingOwner": false,
    "serviceInstanceDisplayName": "appvol",
    "serviceInstanceType": "volumes",
    "serviceInstanceVersion": "-",
    "transientFields": {}
}'

Note: nfs-client Value of storageClass parameter in above payload can be other storage class name like portwox or the other storage class name provided to you.

Start the file server on volume appvol where you want to upload application file and dataset using below curl

curl -v -ik -X POST 'https://<your_IBM_CP4D_url>/zen-data/v1/volumes/volume_services/appvol' -H "Authorization: Bearer $TOKEN" -d '{}' -H 'Content-Type: application/json' -H 'cache-control: no-cache'

Note: Replace appvol with your volume name in url above

Upload your application to the volume appvol to directory python/app using below

curl -v -ik  -X PUT 'https://<your_IBM_CP4D_url>/zen-volumes/appvol/v1/volumes/files/python%2Fapp%2FSparkify.py'  -H "Authorization: Bearer $TOKEN"   -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/Users/lathaappanna/Desktop/Sparkify.py'

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: openresty
Date: Thu, 21 Nov 2019 11:40:52 GMT
Content-Type: application/json
Content-Length: 113
Connection: keep-alive
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=31536000; includeSubDomains

{"_messageCode_":"Success","message":"Successfully uploaded file and created the necessary directory structure"}

Note: %2F is used for separating the directory names in the volume and don't forget to replace appvol with your volume name in url above. If upload curl returns 502 bad gateway error, wait for few minutes and retry the curl to upload file again.

Upload dataset required for the spark application to directory mydata in the same appvol volume. You may put your dataset in a different volume, but in this tutorial, we will upload to the same volume.

curl -v -ik  -X PUT 'https://<your_IBM_CP4D_url>/zen-volumes/appvol/v1/volumes/files/mydata%2Fmini_sparkify_event_data.json'  -H "Authorization: Bearer $TOKEN"   -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/Users/lathaappanna/Desktop/TutorialImages/mini_sparkify_event_data.json'

Stop the file server on volume appvol with below curl

curl -v -i -X DELETE 'https://<hostname>/zen-data/v1/volumes/volume_services/appvol' -H "Authorization: Bearer $TOKEN"

Now that our spark application file and dataset is uploaded to volume appvol, we can run spark application using spark endpoint and access token (save as variable JTOKEN) that we saved from the MyInstance details page.

Payload for the spark job (sparkifyJob.json)

{
	"engine": {
		"type": "spark",
		"env": {
			"export PYSPARK_PYTHON": "/opt/ibm/conda/miniconda3.6/bin/python"
		},
		"size": {
			"num_workers": 2,
			"worker_size": {
				"cpu": 2,
				"memory": "8g"
			},
			"driver_size": {
				"cpu": 1,
				"memory": "4g"
			}
		},
		"volumes": [{
			"volume_name": "appvol",
			"source_path": "",
			"mount_path": "/Sparkify"
		}]
	},
	"application_arguments": [],
	"application_jar": "/Sparkify/myapp/python/Sparkify.py",
	"main_class": "org.apache.spark.deploy.SparkSubmit"
}

Submit spark jobs with below command

curl -ivk -X POST -d @sparkifyJob.json -H "jwt-auth-user-payload: $JTOKEN" <endpoint_to_submit_spark_jobs_from_my_instance_details_page>

Response

{"id":"87db093d-a700-49dc-bea8-e36d83ab45bc","job_state":"RUNNING"}

Alternatively, we may have our spark application/dataset in IBM Cloud Object Storage and run the spark job directly without having to upload application/data to a volume in IBM Cloud for Data. In that case, use the below payload for submitting spark job.

{
  "engine": {
      "type": "spark",
      "conf": {
      "spark.app.name": "MyJob",
      "spark.hadoop.fs.cos.<REPLACE_WITH_COS_SERVICE_NAME>.endpoint":"<ENTER_COS_ENDPOINT>",
      "spark.hadoop.fs.cos.<REPLACE_WITH_COS_SERVICE_NAME>.secret.key":"<ENTER_SECRET_KEY>",
      "spark.hadoop.fs.cos.<REPLACE_WITH_COS_SERVICE_NAME>.access.key":"<ENTER_ACCESS_KEY>"
      },
      "env": {
          "export PYSPARK_PYTHON": "/opt/ibm/conda/miniconda3.6/bin/python"
      },
      "size": {
          "num_workers": 1,
          "worker_size": {
              "cpu": 1,
              "memory": "1g"
          },
          "driver_size": {
              "cpu": 1,
              "memory": "1g"
          }
      }
  },
  "application_arguments": ["cos://<REPLACE_WITH_BUCKET_NAME>.<REPLACE_WITH_COS_SERVICE_NAME>/<REPLACE_WITH_OBJECT_NAME>"],
  "application_jar": "cos://<REPLACE_WITH_BUCKET_NAME>.<REPLACE_WITH_COS_SERVICE_NAME>/<REPLACE_WITH_OBJECT_NAME>",
  "main_class": "org.apache.spark.deploy.SparkSubmit"
}

There are two ways to check the job status, either by clicking the Jobs tab on My Instances page

or by submitting the below curl

curl -ik -X GET -H "jwt-auth-user-payload: $JTOKEN" <endpoint_to_submit_spark_jobs_from_my_instance_details_page>/87db093d-a700-49dc-bea8-e36d83ab45bc

Note: 87db093d-a700-49dc-bea8-e36d83ab45bc in above endpoint is the value of id parameter in job submit curl response.

Response

{"id":"87db093d-a700-49dc-bea8-e36d83ab45bc","job_state":"RUNNING"}

We can view the logs of the job 87db093d-a700-49dc-bea8-e36d83ab45bc, by clicking on the Download logs option as shown in job status page. And it will be downloaded as logs.tar.gz and it can viewed using cat logs.tar.gz command.

We can view the spark history server for the job, by copying the history server url from `MyInstance` details page to a browser.

Once job moves to FINISHED state, we can delete the job by clicking on Delete option as shown in job status screenshot earlier. We can also delete the job by using curl

curl -ik -X DELETE -H "jwt-auth-user-payload: $JTOKEN" <endpoint_to_submit_spark_jobs_from_my_instance_details_page>/87db093d-a700-49dc-bea8-e36d83ab45bc

This Story is co-authored with Shyamalagowri and Surbhi Bakhtiyar, Developers of IBM Watson Studio Spark Environments and Analytics Engine powered by Apache Spark on IBM Cloud Pak for Data.

How to run jobs using Analytics Engine powered by Apache Spark on IBM Cloud Pak for Data

Spark instance creation using browser interface

Spark instance creation using REST apis

Submit spark jobs

Written by Latha