
Apache Spark: What is it and How to Get it on Windows
Apace Spark is a tool used by many data scientists, data engineers, data analysts etc. Basically, if you’re working with data, it will be helpful to learn it especially if you’re working with a LOT of it. In technical terms, Apache Spark is an open source distributed general-purpose cluster computing framework. What this basically means though is that you can do stuff to data such as large transformations. You can do this in chunks, or all at once if you desire. You can work with your data locally, or in the cloud since it’s compatible with Amazon EC2 or IBM Bluemix. It really is compatible with most tools that professionals working with data usually use, so it’s no surprise that Spark is commonly used.
“At a high level, any Spark application creates RDDs out of some input, run (lazy) transformation of these RDDs to some other form(shape), and finally perform actions to collect or store data.”
The quote above eloquently explains what Spark is without diving in to the smaller details. But what is an RDD? RDD stands for resilient distributed dataset. It is a distributed memory abstraction that lets users perform in-memory computations on large clusters.
What programming languages can you use with Spark? Spark is based on the Scala programming language. However, it can also be used with Python, Java, R, and SQL. If you wanted to use it with Python, pyspark is a great library that can help with that. Spark can be used to work with a variety of data formats and connect to a variety of databases.
To get a deeper dive into understanding what Apache Spark is and why it’s useful, I found this article very helpful:
Installing Apache Spark on Windows

This is a challenge on Windows. This solution isn’t as easy as typing “pip install” or just downloading and installing all the different packages you’d need. There are actually a lot of tiny tweaks that must be made, or else you will run into issues. Instead of reinventing the wheel, I will refer to the most helpful resources I used to help me get Spark on my computer.
This guide was fairly helpful in understanding the components you’d need to install before installing Spark. It also covers using spark with python and with jupyter notebooks.
This video was also very helpful as it guides you step by step through the entire procedure of getting spark on your PC.
