Spark Simple Project Using DataFrames
Project- Finding the most popular movies
1. Downloading the Datasets
2. Steps to run a spark project- Finding the most popular movie
It’s a simple project where you can get the feel of how spark uses dataframes and do the manipulation on data as per requirement.
In this project we will try to find out the most popular movie using spark, it’s a basic project which can give you an understanding of how things work in spark.
I will be using Dataframes for the processing and python’s API pyspark. first, we need to download the datasets for our small project.
I am using the dataset given by movielens, you can download it from here https://grouplens.org/datasets/movielens/
dataset name is ml-latest-small, you can download the other datasets as well. as this dataset is small in size so I have downloaded the same but the concept/code is will be the same for other datasets as well.
When you download the ml-latest-small.zip you will get these 4 files now you can join these file and can perform various operations. we can do this in other tools as well but you can’t process a huge amount of data. With spark, you can do the data processing with millions of records.
Ratings.csv file contains many columns and we are interested in movieId and rating column.
2. Steps to run this project
- Create a python file
- Execute python file in spark cluster
We have created a python file called movies.py, you can use any IDE for the same. within that python file, I have done the below steps
- Import all required libraries and packages
- Create a Dataframe.
- Dataframe operations.
With Dataframe operations you will get the result, it’s like running a SQL commands against a database.
So when I run this from terminal spark-submit movies.py command I get an output like this.
- It shows the output as top 8 movies along with the movieID.
- Knowing only the MovieID will not be a good option so we can join the other files which we got when we downloaded our datasets.
- We can perform this join in various ways as datasets join.
- Or you can use broadcast variables it means these variables will always be present for each executor whenever they needed.
- We can get broadcasting by sc.broadcast() .value to get the value of object.
If you notice the above snip then we can get the same result either with data frame operations or with the SQL query. If you want to use SQL query then you need to create tempview like this dataframe1.createTempView(“ratings”)
Now you can perform the various operations to get the desired result like I have joined these 2 dataframes by join.
Well, you can perform various operations on these DataFrames that entirely depends on your business requirements.
If you notice in the above snippet then movieID comes twice, which should not be the case, but you can fix this problem by renaming the column name of any Dataframe using this
Usual Data Engineering project lifecycle be like this
Data Collection==>Data Cleaning==>Data Analysis==>Data/Feature Engineering==>Data Consuming Application.
If you want to install Spark on your machine, then you can have a look here.