Photo by Eric Han on Unsplash

PySpark on macOS: installation and use

René-Jean Corneille
The Startup
Published in
4 min readOct 21, 2019

--

Spark is a very popular framework for data processing. It has slowly taken over the use of Hadoop for data analytics. In memory processing can yield up to 100x speed compared to Hadoop and MapReduce. One of the main advantages of Spark is that no more need to write map reduce jobs. Moreover, the spark engine is compatible with a large number of data sources (txt, json, xml, sql and nosql data stores). Spark is with Hadoop, SQL, Python and R one of the most sought after skills for data scientists.

A spark application is made of:

  • several execution processes which perform the data processing task.
  • a driver process: which is responsible for managing the resources allocated to the executors and distribute the data processing work load among the execution processes.

Users interact with the driver trough their code. Spark is initially written in Scala but also has APIs in other languages: R, Java and more importantly Python. Spark is meant to be run on a cluster of machines but can also be run locally as driver and executors are merely processes. This can be useful to prototype applications locally before sending them to the cloud. Google Cloud DataProc is (among other solutions) a very convenient tool to launch Spark jobs on the cloud.

--

--

René-Jean Corneille
The Startup

Director of ML. I write about data science, mlops, python and sometimes C++