How to set up PySpark for your Jupyter notebook
Apache Spark is one of the hottest frameworks in data science. It realizes the potential of bringing together both Big Data and machine learning. This is because:
- Spark is fast (up to 100x faster than traditional Hadoop MapReduce) due to in-memory operation.
- It offers robust, distributed, fault-tolerant data objects (called RDDs)
- It integrates beautifully with the world of machine learning and graph analytics through supplementary packages like MLlib and GraphX.
Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language which runs on the JVM.
However, for most beginners, Scala is not a great first language to learn when venturing into the world of data science.
Fortunately, Spark provides a wonderful Python API called PySpark. This allows Python programmers to interface with the Spark framework — letting you manipulate data at scale and work with objects over a distributed file system.