How I used Spark with Python to analyse and process data
When we’re looking for Big Data tools solutions on the web the main tecnologies as a result of the search are: Hadoop and Spark.
So, I decided to write this post to share my experience of working with public data and Spark.
What you can expect about this post?
A short introduction about Hadoop, Spark and Pyspark technologies and examples of how I used Pyspark to analyse and process data files.
What is Hadoop?
According with Apache:
"Is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models."
The basic Hadoop modules:
- Commons: lib and utils that supports others Hadoop modules;
- HDFS: Hadoop Distributed File System;
- MapReduce: System for parallel processing of large data set;
- YARN: A framework for job scheduling and cluster resource management;
Overview of Hadoop Ecosystem:

What is Spark?
According with Apache:
"Is a unified analytics engine for large-scale data processing."
Spark is a framework with focus on:
- Speed: benchmarks that demonstrates that Spark is 100x faster than Hadoop for large-scale data processing;
- Ease to use: provide APIs (Application Programming Interface) to operate large datasets, you can use Scala, Python, R and SQL;
- Sophisticated analysis: support SQL (Structure Query Language) queries, data streaming, Machine Learning and graph processing;
Overview of Spark Ecosystem:

- Spark SQL: is a module to work with strutured data (SQL and Data Frame API);
- Spark Streaming: enable to build scalable fault-tolerant streaming applications (Integrates with several data sources like HDFS, Kafka eand Twitter);
- MLLib: Is a scalable library of Machine Learning that delivers algorithms of high quality and speed ;
- GraphX: API for graphs and graph-parallel computation;
- Core: High level APIs in R, Python, Scala and Java;
What is RDD?
Resilient Distributed Dataset is the data unit of Spark. Basically, it’s a collection of distributed elements through cluster nodes.
Some RDD features:
- In-memory computation: while the data is saved in the RDD at the same time is saved in memory;
- Lazy Evaluation: the data aren't evaluated/calculated/estimated on the go;
- Fault Tolerance: when a node fail is possible to compute/process only a part of the RDD that was lost with the original base;
- Immutability: RDD can't be manipulated, when a transformation's operation is executed a new RDD is created;
- Persistence: keep in memory the RDD common used to memory load instead of the disk (persist() and cache());
- Partitioning: data distribution through severel cluster nodes;
- Parallel: RDD process the data in parallel through clusters;
Hadoop X Spark:
Hadoop is essential a computation distributed platform that offers two main services:
- Stores any kind of data with a low cost and high scale;
- Realize complex data analysis quickly;
Spark is a framework that implement RDD concept, that allows distributed data re-use in a diversity of applications and efficient mechanism to recover from failures.
Pyspark
According with Apache:
“Is a Python API implementation to Spark.”
What we can do with pyspark?
- Distributed processing of high data volumes in cluster;
- Extract data directely from a cluster machine instance running Spark without download or local data copy;
- Create Data Frames Pandas from Spark;
Checkout Spark Python API Docs.
To access Spark funcionalities we can use:
- pyspark.SparkContext: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.
- pyspark.sql.SparkSession: The entry point to programming Spark with the Dataset and DataFrame API.
There’s a really handy cheatsheet that you should definitely checkout.
To install pyspark you can use pip and just run:
$ pip install pysparkTo launch pyspark from terminal:
$ pysparkOr you can execute the file using spark-submit:
$ spark-submit example_file.pyExample using SparkContext:
Example using SparkSession:
I hope that this post can help you in some way.
