Analytics Vidhya
Published in

Analytics Vidhya

Let’s take a peek into PySpark

Photo by Warren Wong on Unsplash

Welcome to my blog in Pyspark, in which we can try getting a gist of pyspark and its use cases. We can also try using a CSV file using Pyspark as our first hands-on.

Spark — set up model

Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

Parallelism

In general, Spark uses concept of parallelism which enables the reading of files to be simultaneous, where each node takes the role of reading certain set of rows of files which makes the process faster since it is being done parallelly.

Where to use Spark ?

Now, this requires some practice and experience, to decide whether the spark is the best business solution for the problem put forward. In my personal experience, I have used Spark in places where files are usually large. Large files in terms of rows and columns take time to be processed with the help of other libraries such as pandas, this is where Spark comes in handy, where even big data files can be processed faster compared to libraries like pandas. But again, a good framework like Spark, comes with greater complexity.

PySpark

Since Spark is an open source engine, it provides different programming API interfaces in order to enter into it’s spark setup and once such allowed languages in python. Other languages include scala, java and R.

Inner workings

Spark usually has the following set up , where it consist of one master node and multiple worker nodes.

  • The master node is responsible for the splitting of the data between worker nodes and also for managing computation.
  • Master sends the data to be processed and the calculations to run and the worker sends back the results to the master once the processing is over.
  • Spark works under the concept of the cluster, where the cluster (consisting of master and workers) is hosted on a remote machine that is being connected to other nodes.
  • When starting to work with Spark, it is better to run the cluster locally.
  • Now to enter into the Spark cluster setup, we just have to create a SparkContext which serves as an entry point to Spark.
  • SparkContext takes a few optional arguments to configure the cluster settings.
  • To install pyspark use the following command: pip install pyspark

Let’s get into some coding

In the above code, we are creating a spark context with master to be configured in local and the name of the app for the spark context given. Here sc is the spark context.

Some of other customization available in sc are:

The core data structure of spark is RDD (Resilient Distributed Dataset). This is the object which lets spark split across multiple nodes. But it’s a bit hard to work, so as discussed we can read a CSV file in pyspark using a data frame.

Dataframe is similar to SQL tables which consist of rows and columns. It is also a better option for data processing compared to RDD.

To work with data frames we have to create a Spark session object from our SparkContext. Take a look at the below code.

Let us now read the CSV file using this my_spark session created.

In the above image, we are reading a file called 5_a.csv. In the configurations we have mentioned the read format to be CSV and infer schema to be True, which helps in identifying the data type of the column correctly, however, we have to keep in mind that the data type identification is not accurate in spark.

df. show() will display the columns and data. As you would have noticed that though y and proba are the names of the column, it is being identified as just another row, and column names are being identified as _c0 and _c1. To avoid this we need to give an extra configuration as below.

Giving extra configuration as header to be true will identify the first row to be the name of the column in a CSV file.

And yes, we have come to the end of this blog. Feel free to add any of your thoughts in the comment section, I would be happy to answer them.

Happy Coding!!!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store