Seamless Apache Spark Set up for Big Data Analytics

About Apache Spark

Apache spark is a unified analytics engine for large-scale data processing. In lay man terms, Apache spark is unified in the sense that it allows you to do Batch Processing, Streaming Analytics i.e Real-time analytics, Graph Analytics and Machine Learning all in one engine.

Also, it handles large-scale data easily because, it’s first a cluster computing framework build to bridge the limitation of MapReduce, and second it expresses a paradigm that makes it fault tolerant, fast and with great in-memory processing.

Building on top of Map Reduce paradigm (which is the basics of scalable data analysis anyways), spark has come to be a force in the big data world, Allowing for fault tolerant batch analysis with programming languages of choice for data scientist. i.e Python, R and Scala, and the ability to write SQL query on in-memory data. Also, you can also leverage Hadoop’s Hive to query your data in spark using Hiveql a perfect fit for many semi-structured data and that’s a tip of the iceberg on how much you can use spark for your analytics project.

Challenge of Apache Spark Set up

Apache spark framework has different components that come together for it’s capability. They include:

  • A cluster for distributed processing
  • A cluster manager
  • Spark core
  • A distributed storage system

Bringing this together to actually provision a real-world engine to try out Apache Spark can be daunting. Thanks to google, you can however provision a real life engine like you will in production on cluster.

I am going to show how to set up Spark on a 2 or more cluster machine seamlessly using google cloud.

Google’s Solution

Let’s get started with what Google has to offer for scalable big data analytics. Their solution is Google Cloud Dataproc. According to Google, Cloud Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, most cost-efficient way. True to their talk, Dataproc is truly simple to deploy and excellently cost-efficient.

Dataproc has this to offer seamlessly:

  1. Apache Hadoop and it’s file system HDFS which uses Google File System backend (Google version of HDFS)
  2. Apache Spark
  3. Apache Zeppelin
  4. Jupyter Notebook and added python packages
  5. Apache Hive
  6. Intel BigDL to do Scalable Deep Learning
  7. Apache Zookeeper
  8. Apache Kafka (My favourite for Streaming Analytics)
  9. And some other Hadoop Frameworks, can check the list here

Let’s dive into the setup.

Basic Requirement

A system and Internet connection

  1. Create a google cloud account here — note you must have a gmail to create a free google cloud account. With this, you are generously given $300 dollar to spend wisely on google’s cloud infrastructures for a year. You can go through the first part of this setup tutorial to create an account and set up a project for your account.
    PS: The setup ends before the topic “Create an image from our provided disk”.
  2. Enable Billing for your project, this you can do by clicking on Upgrade after your account has been create, then from the left side menu select Billing, then select Go to linked billing account to initiate billing for the current project. More details can be found here
  3. Install google gcloud client. You can follow this google set-up for installing gcloud client depending on your operating system version.
    PS: I love Ubuntu, Windows now have an Ubuntu interface for windows 10 that you can set up with these tutorials here ,here or here
  4. Provision you hadoop. Here is the code from the command line that seamlessly provision your hadoop system

a. Create a cluster name

$ CLUSTER_NAME=your_cluster_name

b. Create your cluster with some Initialization Actions

gcloud dataproc clusters create $CLUSTER_NAME \
--initialization-actions=gs://dataproc-initialization-actions/bigdl/bigdl.sh,gs://dataproc-initialization-actions/zeppelin/zeppelin.sh,gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh,gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata 'JUPYTER_CONDA_PACKAGES="dask numpy pandas matplotlib",PIP_PACKAGES=pandas-gbq' \
--initialization-action-timeout=10m

PS: Initialization actions are the added setups for your cluster and once added, they are automatically set-up with your cluster once it’s created. Initialization actions set up a base for Metadata which can be thought of as other configurations for the initialization actions. Optional Initialization action timeout is a time period for the cluster creation to terminate if the initializations haven’t be initialized, I think Google added this to control resource initialization, default if you don’t pass an argument is 10m

5. Start Gcloud

gcloud compute ssh --ssh-flag="-nND 1080" $CLUSTER_NAME-m

And with this your cluster should be created.

Connect to jupyter notebook and apache zeppelin

  1. To connect to Jupyter notebook or Apache zeppelin from you locals, it’s simple, the idea is basically tunneling proxy server to your local. The only requirement is that you have chrome or any other browser you will be using installed on your system.
    I prefer Chrome.
$ chrome.exe --proxy-server="socks5://localhost:1080" \ 
--host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
--user-data-dir=/tmp/

PS: The chrome.exe is the installation path of your Chrome browser e.g for windows will be something like

"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" 
--proxy-server="socks5://localhost:1080" \
--host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
--user-data-dir=/tmp/
Optionally - You can customize your web browser to show that it's using your cluster hence to route your jupyter with cluster name just add this to the above code:"http://{insert_cluster_name}-m:8123" \

On your cluster server, Jupyter notebook runs on port 8123 by default and Apache zeppelin runs on port 8080, you can now go to your browser and connect via ‘localhost:8080’ and localhost:8123 / {cluster_name}-m:8123.

Congratulations, you just provisioned a cluster to try out big data projects.

In the next article, I will be working through a simple analysis of a Json file in Apache Zeppelin.

Some Other helpful tips

Ssh into master with

gcloud compute ssh $cluster_name-m

Restart any service running in your cluster with systemctl PS: This is only present in Ubuntu

To restart any of the services, you ssh into the master, list out the services running on your server and restart any of the services with the following commands respectively

systemctl list-unit-files --type=servicesystemctl restart {service}

--

--

Babatunde
From Data Analytics to Artificial Intelligence

Research and Development Engineer at Seamfix Nigeria Ltd. I do AI stuffs with scalable technologies.