Using Jupyter Notebook with Big Data: A guide on how to use Jupyter Notebook with big data frameworks like Apache Spark and Hadoop, including recommended libraries and tools.

5 min readJul 19, 2023

Introduction

Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It is an excellent tool for data analysis and visualization.

In this guide, we will see how Jupyter Notebook can be used with big data frameworks like Apache Spark and Hadoop for data analysis and machine learning on large datasets.

Note

If you are looking to quickly set up and explore AI/ML & Python Jupyter Notebook Kit, Techlatest.net provides an out-of-the-box setup for AI/ML & Python Jupyter Notebook Kit on AWS, Azure, and GCP. Please follow the below links for the step-by-step guide to set up the AI/ML & Python Jupyter Notebook Kit on your choice of cloud platform.

For AI/ML KIT: AWS, GCP & Azure.

Why did you choose Techlatest.net VM, AI/ML Kit & Python Jupyter Notebook?

In-browser editing of code
Ability to run and execute code in various programming languages
Supports rich media outputs like images, videos, charts, etc.
Supports connecting to external data sources
Supports collaborative editing by multiple users
Simple interface to create and manage notebooks
Ability to save and share notebooks

Apache Spark

Apache Spark is a fast and general-purpose cluster computing framework for big data. It provides APIs in Java, Scala, Python, and R.

To use Jupyter Notebook with Spark, you need to:

Install Jupyter Notebook and PySpark. PySpark is the Python API for Spark.

Configure the SparkContext within a Jupyter Notebook cell:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('your app name') 
sc = SparkContext(conf=conf)

Now you can access Spark SQL, Spark Streaming, MLlib, and GraphX APIs from within Jupyter.

For example, to read a CSV file into a data frame:

df = spark.read.csv('file.csv')

You can then visualize the data using libraries like Matplotlib, Seaborn, etc.

Reading CSV Data

You can read both CSV files and CSV directories. While creating a Data Ingestion Framework using Spark DataFrame to read CSV data, you have to let Spark know the schema of the data. To achieve this, you can either define the schema in the code itself or allow Spark to infer the schema by using a command like CSV (…, inferSchema=” true”) in Python.

For instance, consider the example shown below that reads a /users/SchoolBus.csv CSV file from the “folders” container into a CF DataFrame variable.

schema = StructType([
    StructField("start_time", LongType(), True),
    StructField("end_time", LongType(), True),
    StructField("student_count", LongType(), True),
    StructField("route_distance", DoubleType(), True),
    StructField("stops_count", LongType(), True),
    StructField("teacher_count", LongType(), True)
])
CF = spark.read.schema(schema) 
    .option("header", "false") 
    .option("delimiter", "|") 
    .csv("v3io://folders/users/SchoolBus.csv")

Note that the header and delimiter parameters are optional here.

Link to GitHub Repo for CSV data: https://gist.github.com/virtuaCode/b8b74e86227f217b3af8dde6528ff60d

Apache Hadoop

Apache Hadoop is a framework for distributed storage and processing of large datasets.

To use Jupyter Notebook with Hadoop, you will need:

- Pandas library to work with data frames
- Pyarrow library for high-performance data analysis
- Installing the Hadoop dependencies on the Jupyter Notebook server.

Then you can access the Hadoop filesystem (HDFS) from within Jupyter using the `hdfs` library. For example:

import hdfs
client = hdfs.Client('http://hdfs_host:50070')
# Read a file from HDFS 
data = client.read('path/to/file')

This gives you a quick start on using Jupyter Notebook with big data frameworks like Apache Spark and Hadoop. You can install other relevant libraries like TensorFlow, Keras, sci-kit-learn, etc to build machine learning models on large datasets within Jupyter.

PySpark in Jupyter

There are two ways to get PySpark available in a Jupyter Notebook:

Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook
Load a regular Jupyter Notebook and load PySpark using the findSpark package

The first option is quicker but specific to Jupyter Notebook, and the second option is a broader approach to getting PySpark available in your favorite IDE.

Method 1 — Configure PySpark driver

Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Restart your terminal and launch PySpark again:

! sudo pyspark

Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.

Copy and paste our Pi calculation script and run it by pressing Shift + Enter.

Done!

You are now able to run PySpark in a Jupyter Notebook :)

Method 2 — FindSpark package

There is another and more generalized way to use PySpark in a Jupyter Notebook: use the findSpark package to make a Spark Context available in your code.

findSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.

To install findspark:

! sudo pip install findspark

Launch a regular Jupyter Notebook:

$ jupyter notebook

Create a new Python [default] notebook and write the following script:

import findspark
findspark.init()

import pyspark
import randomsc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1count = sc.parallelize(range(0, num_samples)).filter(inside).count()pi = 4 * count / num_samples
print(pi)sc.stop()

Conclusion

Jupyter Notebook, combined with big data frameworks like Apache Spark and Hadoop, equips data scientists with a powerful platform to extract insights from massive datasets. By following the steps outlined in this guide and utilizing recommended libraries and tools, you can seamlessly integrate Jupyter Notebook with big data technologies, facilitating efficient data exploration, analysis, and visualization. Embrace the power of Jupyter Notebook to unlock the potential of big data and drive data-driven decision-making processes.