How to set up PySpark for your Jupyter notebook

--

Apache Spark is one of the hottest frameworks in data science. It realizes the potential of bringing together both Big Data and machine learning. This is because:

  • Spark is fast (up to 100x faster than traditional Hadoop MapReduce) due to in-memory operation.
  • It offers robust, distributed, fault-tolerant data objects (called RDDs)
  • It integrates beautifully with the world of machine learning and graph analytics through supplementary packages like MLlib and GraphX.

Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language which runs on the JVM.

However, for most beginners, Scala is not a great first language to learn when venturing into the world of data science.

Fortunately, Spark provides a wonderful Python API called PySpark. This allows Python programmers to interface with the Spark framework — letting you manipulate data at scale and work with objects over a distributed file system.

Why use Jupyter Notebook?

--

--

Tirthajyoti Sarkar
We’ve moved to freeCodeCamp.org/news

Sr. Director of AI/ML platform | Stories on Artificial Intelligence, Data Science, and ML | Speaker, Open-source contributor, Author of multiple DS books