PySpark EDA Basics: Practical Parallel Processing

don’t calculate, delegate

Published in

The Startup

6 min readJun 7, 2020

This is likely the easiest way to get started with Big Data, and it’s still not that easy. However, watching your swarm of worker nodes take 5 minutes on a task that would have taken your computer 5 hours is tremendously rewarding.

Picture it: You stored your data in S3, configured security in EC2, revved up some clusters on EMR, and you’re ready to launch a PySpark-configured Jupyter notebook through AWS. You’re ready to analyze.

No pandas, no problem

Spark’s core data structure is the Resilient Distributed Dataset. They’re immutable, fail-safe and efficient, because they’re stored across different partitions of multiple nodes. The actual structure is more primordial than a classic relational database table; RDDs are technically a collection of Java/Scala objects bound together in memory. In-memory storage makes for much faster calculation than writing to disk, especially when leveraging the RAM of interlinked computers.

They use “lazy evaluation”, which means they don’t actually crunch the numbers until the last possible moment. This…

PySpark EDA Basics: Practical Parallel Processing

don’t calculate, delegate

No pandas, no problem

Written by Mark Cleverley