PySpark Sentiment Analysis on Google Dataproc

A Step-by-Step Tutorial

Ricky Kim
Towards Data Science
11 min readDec 24, 2018

--

Photo by Joshua Sortino on Unsplash

I recently had a chance to play around with Google Cloud Platform through a specialization course on Coursera; Data Engineering on Google Cloud Platform Specialization. Overall I learned a lot through the courses, and it was such a good opportunity to try various services of Google Cloud Platform(GCP) for free while going through the assignments. Even though I’m not using any of GCP’s services at work at the moment, if I have a chance I’d be happy to migrate some parts of my data works to GCP.

However, one thing that the course lacks is room for your own creativity. The assignments of the course were more like tutorials than assignments. You basically follow along already written codes. Of course, you can still learn a lot by trying to read every single line of codes and understand what each line does in detail. Still, without applying what you have learned in your own problem-solving, it is difficult to make this knowledge completely yours. That’s also what the instructor Lak Lakshmanan advised at the end of the course. (Shout out to Lak Lakshmanan, thank you for the great courses!)

*In addition to short code blocks I will attach, you can find the link for the whole Git Repository at the end of this post.

Requirements

Creating a Free Trial Account on GCP

So I have decided to do some personal mini projects making use of various GCP services. Luckily, if you haven’t tried GCP yet, Google generously offers a free trial which gives you $300 credit you can use over 12 months.

You can easily start your free trial by visiting https://cloud.google.com/gcp/

The first project I tried is Spark sentiment analysis model training on Google Dataproc. There are a couple of reasons why I chose it as my first project on GCP. I already wrote about PySpark sentiment analysis in one of my previous posts, which means I can use it as a starting point and easily make this a standalone Python program. The other reason is I just wanted to try Google Dataproc! I was fascinated by how easy and fast it is to spin up a cluster on GCP and couldn’t help myself from trying it outside the Coursera course.

If you have clicked “TRY GCP FREE”, and fill in information such as your billing account (Even though you set up a billing account, you won’t be charged unless you upgrade to a paid account), you will be directed to a page looks like below.

Home screen of GCP web console

On the top menu bar, you can see “My First Project” next to Google Cloud Platform. In GCP, “project” is the base-level entity to use GCP services, enable billing, etc. On the first login, you can see that Google automatically created a “project” called “My First Project” for you. Click on it to see ID of the current project, copy it or write it down, this will be used later. By clicking into “Billing” on the left-side menu from the web console home screen, “My First Project” is automatically linked to the free credit you received.

Enabling APIs

In GCP, there are many different services; Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Cloud Dataproc to name a few. In order to use any of these services in your project, you first have to enable them.

Put your mouse over “APIs & Services” on the left-side menu, then click into “Library”. For this project, we will enable three APIs: Cloud Dataproc, Compute Engine, and Cloud Storage.

In the API Library page, search the above mentioned three APIs one by one by typing the name in the search box. Clicking into the search result, and enable the API by clicking “ENABLE” button on the screen.

When I tried it myself, I only had to enable Cloud Dataproc API, since the other two (Compute Engine, Cloud Storage) were already enabled when I clicked into them. But if that’s not the case for you, please enable Compute Engine API, Cloud Storage API.

Installing Google Cloud SDK

If this is your very first time to try GCP, you first might want to install the Google Cloud SDK so that you can interact with many services of GCP from the command-line. You can find more information on how to install from here.

Install Google Cloud SDK by following instructions on https://cloud.google.com/sdk/

By following instructions from the link, you will be prompted to log in (use the Google account you used to start the free trial), then to select a project and compute zone (project: choose the project you enable the APIs from the above steps if there are more than one, compute zone: To decrease network latency, you might want to choose a zone that is close to you. You can check the physical locations of each zone from here.).

Creating Bucket

Since you have installed Google Cloud SDK, you can either create a bucket from the command-line or from the web console.

Web Console

Click into “Storage” from left-side menu, then you’ll see a page like the above. Click “Create bucket”

For convenience, enter project ID you checked at the end of “Creating a Free Trial Account on GCP” stage. You can just click “create” without changing any other details, or choose the same location as your project.

Google Cloud SDK

Replace your_project_id with the project ID that you copied and run the below line on your terminal to set BUCKET_NAME variable to your project ID and make it available to sub-processes. (A Bash script you need to run later will make use of this)

export PROJECT_ID='your_project_id'

Then create a bucket by running gsutil mb command as below.

gsutil mb gs://${PROJECT_ID}

The above command will create a bucket with the default settings. If you want to create a bucket in a specific region or multi-region, you can give it -l option to specify the region. You can see available bucket locations from here.

#ex1) multi-region europe
gsutil mb -l eu gs://${PROJECT_ID}
#ex)2 region europe-west1
gsutil mb -l europe-west1 gs://${PROJECT_ID}

Cloning Git Repository

Now clone the git repository I uploaded by running below command in terminal.

git clone https://github.com/tthustla/pyspark_sa_gcp.git

Preparing Data

Once you clone the repository, it will create a folder named pyspark_sa_gcp. Go into the folder and check what files are there.

cd pyspark_sa_gcp/
ls

You will see three files in the directory: data_prep.sh, pyspark_sa.py, train_test_split.py. In order to download the training data and prepare for training let’s run the Bash script data_prep.sh. Below is the content of the script and I have added comments to explain what each line does.

The original dataset for training is “Sentiment140”, which originated from Stanford University. The Dataset has 1.6million labelled tweets.
50% of the data is with negative labels and the other 50% with positive labels. More info on the dataset can be found from the link. http://help.sentiment140.com/for-students/

In the above Bash script, you can see it’s calling a Python script train_test_split.py. Let’s also take a look at what it does.

Now we can run the Bash script to prepare the data. Once it’s finished, it will have uploaded prepared data to the cloud storage bucket you created earlier. It will take 5~6 mins to upload the data.

./data_prep.sh

Checking the Uploaded Data

Web Console

Go to the Storage from the left side menu and click into your bucket -> pyspark_nlp -> data. You will see two files are uploaded.

Google Cloud SDK

Or you can also check the content of your bucket from your terminal by running below command.

gsutil ls -r gs://${PROJECT_ID}/**

Creating Google Dataproc Cluster

Cloud Dataproc is a Google cloud service for running Apache Spark and Apache Hadoop clusters. I have to say it is ridiculously simple and easy-to-use and it only takes a couple of minutes to spin up a cluster with Google Dataproc. Also, Google Dataproc offers autoscaling if you need, and you can adjust the cluster at any time, even when jobs are running on the cluster.

Web Console

Go to Dataproc from the left side menu (you have to scroll down a bit. It’s under Big Data section) and click on “Clusters”. Click “Create clusters”, then you’ll see a page like below.

Give it a name (for convenience, I gave the project ID as its name), choose Region and Zone. To decrease the latency, it is a good idea to set the region to be the same as your bucket region. Here you need to change the default settings for worker nodes a little, as the free trial only gives you permission to run up to 8 cores. The default setting for a cluster is one master and two workers all with 4 CPUs each, which will exceed the 8 cores quota. So change the setting for your worker nodes to 2 CPUs, then click create at the bottom. After a couple of minutes of provisioning, you will see the cluster created with one master node (4 CPUs, 15GB memory, 500GB standard persistent disk) and two worker nodes (2 CPUs, 15GB memory, 500GB standard persistent disk each).

Google Cloud SDK

Since we need to change the default setting a little bit, we need to add one more argument to the command, but it’s simple enough. Let’s create a cluster and give it the same name as the project ID, and set worker nodes to have 2 CPUs each.

gcloud dataproc clusters create ${PROJECT_ID} \--project=${PROJECT_ID} \--worker-machine-type='n1-standard-2' \--zone='europe-west1-b'

You can change the zone to be close to your bucket region.

Submitting Spark Job

Finally, we are ready to run the training on Google Dataproc. The Python script (pyspark_sa.py) for the training is included in the Git repository you cloned earlier. Since I commented on the script to explain what each line does, I will not go through the code. The code is a slightly refactored version of what I have done in Jupyter Notebook for my previous post. Below are a few of my previous posts, in case you want to know more in detail about PySpark or NLP feature extraction.

And let’s take a look at what the Python script looks like.

Since I commented inside the script to explain what each line does, I will not go through the code extensively. But in a nutshell, the above script will take three command line arguments: Cloud Storage location where the training and test data are stored, a Cloud storage directory to store prediction result of the test data, and finally a Cloud storage directory to store the trained model. When called, it will first do the preprocessing of the training data -> build a pipeline -> fit the pipeline -> and make predictions on the test data -> print the accuracy of the predictions -> save prediction result as CSV -> save fitted pipeline model -> load the saved model -> print the accuracy again on the test data (to see if the model is properly saved).

Web Console

In order to run this job through the web console, we need to first upload the Python script to our cloud storage so that we can point the job to read the script. Let’s upload the script by running below command. (I’m assuming that you are still on pyspark_sa_gcp directory on your terminal)

gsutil cp pyspark_sa.py gs://${PROJECT_ID}/pyspark_nlp/

Now click into Dataproc on the web console, and click “Jobs” then click “SUBMIT JOB”.

From the above screenshot replace the blurred parts of the texts to your project ID, then click “submit” at the bottom. You can inspect the output of the machine by clicking into the job.

The job is finished after 15 minutes, and by looking at the output, it seems like the cluster struggled a bit, but nonetheless, the prediction looks fine and the model seems to be saved properly.

Google Cloud SDK

If you submit a job from the command-line, you don’t even need to upload your script to Cloud Storage. It will be able to grab a local file and move to the Dataproc cluster to execute. (Again I’m assuming that you are still on pyspark_sa_gcp directory on your terminal)

gcloud dataproc jobs submit pyspark pyspark_sa.py \--cluster=${PROJECT_ID} \-- gs://${PROJECT_ID}/pyspark_nlp/data/ gs://${PROJECT_ID}/pyspark_nlp/result gs://${PROJECT_ID}/pyspark_nlp/model

Again the cluster seemed to struggle a bit, but still got the result and model saved properly. (I have tried to submit the same job on my paid account with 4 CPUs worker nodes, then it didn’t throw any warnings)

Checking the Results

Go to your bucket, then go into pyspark_nlp folder. You will see that the results of the above Spark job have been saved into “result” directory (for the prediction data frame), and “model” directory (fitted pipeline model).

Finally, don’t forget to delete the Dataproc cluster you have created to ensure it will not use up any more of your credit.

Through this post, I went through how to train Spark ML model on Google Dataproc and save the trained model for later use. What I showed here is only a small part of what GCP is capable of and I encourage you to explore other services on GCP and play around with it.

Thank you for reading. You can find the Git Repository of the scripts from the below link.

https://github.com/tthustla/pyspark_sa_gcp

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Ricky Kim
Ricky Kim

Written by Ricky Kim

The Rickest Ricky. Love data, beer, coffee, and good memes in no particular order.

Responses (7)