How I used Spark with Python to analyse and process data

When we’re looking for Big Data tools solutions on the web the main tecnologies as a result of the search are: Hadoop and Spark.

So, I decided to write this post to share my experience of working with public data and Spark.

What you can expect about this post?

A short introduction about Hadoop, Spark and Pyspark technologies and examples of how I used Pyspark to analyse and process data files.

What is Hadoop?

According with Apache:

"Is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models."

The basic Hadoop modules:

  • Commons: lib and utils that supports others Hadoop modules;
  • HDFS: Hadoop Distributed File System;
  • MapReduce: System for parallel processing of large data set;
  • YARN: A framework for job scheduling and cluster resource management;

Overview of Hadoop Ecosystem:

Hadoop Ecosystem

What is Spark?

According with Apache:

"Is a unified analytics engine for large-scale data processing."

Spark is a framework with focus on:

  • Speed: benchmarks that demonstrates that Spark is 100x faster than Hadoop for large-scale data processing;
  • Ease to use: provide APIs (Application Programming Interface) to operate large datasets, you can use Scala, Python, R and SQL;
  • Sophisticated analysis: support SQL (Structure Query Language) queries, data streaming, Machine Learning and graph processing;

Overview of Spark Ecosystem:

Spark Ecosystem
  • Spark SQL: is a module to work with strutured data (SQL and Data Frame API);
  • Spark Streaming: enable to build scalable fault-tolerant streaming applications (Integrates with several data sources like HDFS, Kafka eand Twitter);
  • MLLib: Is a scalable library of Machine Learning that delivers algorithms of high quality and speed ;
  • GraphX: API for graphs and graph-parallel computation;
  • Core: High level APIs in R, Python, Scala and Java;

What is RDD?

Resilient Distributed Dataset is the data unit of Spark. Basically, it’s a collection of distributed elements through cluster nodes.

Some RDD features:

  • In-memory computation: while the data is saved in the RDD at the same time is saved in memory;
  • Lazy Evaluation: the data aren't evaluated/calculated/estimated on the go;
  • Fault Tolerance: when a node fail is possible to compute/process only a part of the RDD that was lost with the original base;
  • Immutability: RDD can't be manipulated, when a transformation's operation is executed a new RDD is created;
  • Persistence: keep in memory the RDD common used to memory load instead of the disk (persist() and cache());
  • Partitioning: data distribution through severel cluster nodes;
  • Parallel: RDD process the data in parallel through clusters;

Hadoop X Spark:

Hadoop is essential a computation distributed platform that offers two main services:

  • Stores any kind of data with a low cost and high scale;
  • Realize complex data analysis quickly;

Spark is a framework that implement RDD concept, that allows distributed data re-use in a diversity of applications and efficient mechanism to recover from failures.

Pyspark

According with Apache:

“Is a Python API implementation to Spark.”

What we can do with pyspark?

  • Distributed processing of high data volumes in cluster;
  • Extract data directely from a cluster machine instance running Spark without download or local data copy;
  • Create Data Frames Pandas from Spark;

Checkout Spark Python API Docs.

To access Spark funcionalities we can use:

  • pyspark.SparkContext: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.
  • pyspark.sql.SparkSession: The entry point to programming Spark with the Dataset and DataFrame API.

There’s a really handy cheatsheet that you should definitely checkout.

To install pyspark you can use pip and just run:

$ pip install pyspark

To launch pyspark from terminal:

$ pyspark

Or you can execute the file using spark-submit:

$ spark-submit example_file.py

Example using SparkContext:

Example using SparkSession:

I hope that this post can help you in some way.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade