Spark Programming with DataFrames

Vaishnavi Nandakumar
8 min readDec 16, 2023

Introduction

In this blog, we will be exploring how we can user Spark DataFrames and Spark SQL to explore the ML-100K dataset.

A DataFrame is a distributed collection of data organized into named columns. Spark SQL provides a programming interface for data manipulation using SQL queries and DataFrame API.

Both DataFrames and Spark SQL provide a higher-level, more user-friendly interface for working with structured and semi-structured data compared to the lower-level RDD API.

Data

The ml-100k dataset consists of 100,000 movie ratings (1–5) from 943 users on 1682 movies. You can download the dataset from here.

Movie Ratings data in u.data

User data in u.user

Movie Data in u.item

The ml-100k dataset consists of 100,000 movie ratings (1–5) from 943 users on 1682 movies. You can download the dataset from here.

Insert Data

val inputFile = sc.textFile("/Users/vaishnavink/Downloads/ml-100k/u.data") 
val df = inputFile.toDF()
val ratings = df.select(split(col("value"), "\t").getItem(0).as("userId"),
split(col("value"), "\t").getItem(1).as("movieId"),
split(col("value"), "\t").getItem(2).as("rating"),
split(col("value"), "\t").getItem(3).as("timestamp"))
scala> ratings.show(10)
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
| 196| 242| 3|881250949|
| 186| 302| 3|891717742|
| 22| 377| 1|878887116|
| 244| 51| 2|880606923|
| 166| 346| 1|886397596|
| 298| 474| 4|884182806|
| 115| 265| 2|881171488|
| 253| 465| 5|891628467|
| 305| 451| 3|886324817|
| 6| 86| 3|883603013|
+------+-------+------+---------+

Problem Statements

Analyze rating popularity in movie reviews

scala> ratings.groupBy("rating").count().sort("rating").show
+------+-----+
|rating|count|
+------+-----+
| 1| 6110|
| 2|11370|
| 3|27145|
| 4|34174|
| 5|21201|
+------+-----+
scala> ratings.registerTempTable("ratingsData")
scala> spark.sql("select rating, count(*) from ratingsData group by rating order by rating desc;").show
+------+--------+
|rating|count(1)|
+------+--------+
| 5| 21201|
| 4| 34174|
| 3| 27145|
| 2| 11370|
| 1| 6110|
+------+--------+

Generate a sorted list of the most popular movies

scala> ratings.groupBy("movieId").count().sort($"count".desc).show(10)
+-------+-----+
|movieId|count|
+-------+-----+
| 50| 583|
| 258| 509|
| 100| 508|
| 181| 507|
| 294| 485|
| 286| 481|
| 288| 478|
| 1| 452|
| 300| 431|
| 121| 429|
+-------+-----+

scala> ratings.registerTempTable("ratingsData")
scala> spark.sql("select movieId, count(*) as ct from ratingsData group by movieId order by ct desc;").show(10)
+-------+---+
|movieId| ct|
+-------+---+
| 50|583|
| 258|509|
| 100|508|
| 181|507|
| 294|485|
| 286|481|
| 288|478|
| 1|452|
| 300|431|
| 121|429|
+-------+---+

Generate a sorted list of the most highly rated movies

scala> ratings.groupBy("movieId").agg(avg("rating")).sort($"avg(rating)".desc).show(20)
+-------+-----------------+
|movieId| avg(rating)|
+-------+-----------------+
| 1500| 5.0|
| 1536| 5.0|
| 1293| 5.0|
| 1599| 5.0|
| 1653| 5.0|
| 1189| 5.0|
| 814| 5.0|
| 1201| 5.0|
| 1122| 5.0|
| 1467| 5.0|
| 1449| 4.625|
| 1398| 4.5|
| 1642| 4.5|
| 119| 4.5|
| 1594| 4.5|
| 408|4.491071428571429|
| 318|4.466442953020135|
| 169|4.466101694915254|
| 483| 4.45679012345679|
| 114|4.447761194029851|
+-------+-----------------+
scala> spark.sql("select movieId, avg(rating) as avgRating from ratingsData group by movieId order by avgRating desc").show
+-------+-----------------+
|movieId| avgRating|
+-------+-----------------+
| 1500| 5.0|
| 1536| 5.0|
| 1293| 5.0|
| 1599| 5.0|
| 1653| 5.0|
| 1189| 5.0|
| 814| 5.0|
| 1201| 5.0|
| 1122| 5.0|
| 1467| 5.0|
| 1449| 4.625|
| 1398| 4.5|
| 1642| 4.5|
| 119| 4.5|
| 1594| 4.5|
| 408|4.491071428571429|
| 318|4.466442953020135|
| 169|4.466101694915254|
| 483| 4.45679012345679|
| 114|4.447761194029851|
+-------+-----------------+

Find the age distribution of the movie reviewers

scala> val ratingsJoinUser = ratings.join(user, Seq("userId")).groupBy("age").count()
scala> val ageDistribution = ratingsJoinUser.select(col("age").cast("int"), col("count")).sort("age")
scala> ageDistribution.show
+---+-----+
|age|count|
+---+-----+
| 7| 43|
| 10| 31|
| 11| 27|
| 13| 497|
| 14| 264|
| 15| 397|
| 16| 335|
| 17| 897|
| 18| 2219|
| 19| 3514|
| 20| 4089|
| 21| 3020|
| 22| 3979|
| 23| 2683|
| 24| 4556|
| 25| 4013|
| 26| 3503|
| 27| 6423|
| 28| 3619|
| 29| 3650|
+---+-----+
scala> spark.sql("select age, count(*) from ratingsData r join usersData u on r.userId = u.userId group by age;").show(20)
+---+--------+
|age|count(1)|
+---+--------+
| 7| 43|
| 51| 2232|
| 15| 397|
| 54| 214|
| 11| 27|
| 29| 3650|
| 69| 156|
| 42| 1699|
| 73| 56|
| 64| 95|
| 30| 3762|
| 34| 1252|
| 59| 92|
| 28| 3619|
| 22| 3979|
| 16| 335|
| 52| 629|
| 35| 3363|
| 47| 1614|
| 43| 1480|
+---+--------+

Produce a list of the top movies for every age

scala> val ratingsJoinUser = ratings.join(user, Seq("userId"))
scala> val ageMovieGroupDistbn = ratingsJoinUser.select(col("age").cast("int").as("userAge"), col("movieId")).groupBy("userAge", "movieId").count()
scala> ageMovieGroupDistbn.show
+-------+-------+-----+
|userAge|movieId|count|
+-------+-------+-----+
| 44| 515| 8|
| 44| 288| 12|
| 57| 382| 2|
| 57| 101| 1|
| 57| 136| 2|
| 57| 564| 1|
| 34| 154| 4|
| 32| 333| 7|
| 44| 286| 18|
| 36| 808| 1|
| 32| 181| 18|
| 19| 116| 3|
| 49| 307| 3|
| 22| 291| 3|
| 29| 197| 6|
| 40| 760| 2|
| 31| 596| 5|
| 15| 123| 2|
| 39| 580| 4|
| 33| 824| 1|
+-------+-------+-----+

scala> val ageMaxCount = ageMovieGroupDistbn.groupBy("userAge").max("count")
ageMaxCount: org.apache.spark.sql.DataFrame = [userAge: int, max(count): bigint]

scala> ageMaxCount.show
+-------+----------+
|userAge|max(count)|
+-------+----------+
| 31| 18|
| 65| 3|
| 53| 10|
| 34| 11|
| 28| 25|
| 27| 31|
| 26| 24|
| 44| 18|
| 22| 25|
| 47| 11|
| 52| 5|
| 13| 5|
| 16| 5|
| 40| 13|
| 20| 25|
| 57| 7|
| 54| 3|
| 48| 14|
| 19| 21|
| 64| 2|
+-------+----------+

scala> val renamedAgeDistbn = ageMovieGroupDistbn.select(col("userAge").as("age"), col("movieId"), col("count"))
scala> val topMovies = renamedAgeDistbn.join(ageMaxCount, query)

scala> topMovies.show
+---+-------+-----+-------+----------+
|age|movieId|count|userAge|max(count)|
+---+-------+-----+-------+----------+
| 44| 286| 18| 44| 18|
| 73| 580| 1| 73| 1|
| 66| 199| 1| 66| 1|
| 73| 514| 1| 73| 1|
| 73| 216| 1| 73| 1|
| 63| 690| 3| 63| 3|
| 64| 690| 2| 64| 2|
| 62| 100| 2| 62| 2|
| 73| 70| 1| 73| 1|
| 7| 7| 1| 7| 1|
| 51| 286| 17| 51| 17|
| 36| 181| 13| 36| 13|
| 73| 430| 1| 73| 1|
| 54| 237| 3| 54| 3|
| 32| 50| 21| 32| 21|
| 73| 207| 1| 73| 1|
| 73| 191| 1| 73| 1|
| 57| 286| 7| 57| 7|
| 73| 50| 1| 73| 1|
| 11| 151| 1| 11| 1|
+---+-------+-----+-------+----------+

scala> val movieNameList = topMovies.select(col("age"), col("movieId")).join(movies, Seq("movieId")).sort("age")
scala> movieNameList.show(50)
+-------+---+--------------------+
|movieId|age| movieTitle|
+-------+---+--------------------+
| 69| 7| Forrest Gump (1994)|
| 315| 7| Apt Pupil (1998)|
| 2| 7| GoldenEye (1995)|
| 7| 7|Twelve Monkeys (1...|
| 403| 7| Batman (1989)|
| 257| 7| Men in Black (1997)|
| 294| 7| Liar Liar (1997)|
| 531| 7| Shine (1996)|
| 321| 7| Mother (1996)|
| 313| 7| Titanic (1997)|
| 688| 7|Leave It to Beave...|
| 1007| 7|Waiting for Guffm...|
| 683| 7| Rocket Man (1997)|
| 1013| 7| Anaconda (1997)|
| 319| 7|Everyone Says I L...|
| 242| 7| Kolya (1996)|
| 780| 7|Dumb & Dumber (1994)|
| 259| 7|George of the Jun...|
| 304| 7|Fly Away Home (1996)|
| 539| 7| Mouse Hunt (1997)|
| 751| 7|Tomorrow Never Di...|
| 873| 7|Picture Perfect (...|
| 286| 7|English Patient, ...|
| 172| 7|Empire Strikes Ba...|
| 301| 7| In & Out (1997)|
| 174| 7|Raiders of the Lo...|
| 28| 7| Apollo 13 (1995)|
| 135| 7|2001: A Space Ody...|
| 289| 7| Evita (1996)|
| 50| 7| Star Wars (1977)|
| 181| 7|Return of the Jed...|
| 678| 7| Volcano (1997)|
| 538| 7| Anastasia (1997)|
| 252| 7|Lost World: Juras...|
| 231| 7|Batman Returns (1...|
| 161| 7| Top Gun (1986)|
| 255| 7|My Best Friends ....|
| 435| 7|Butch Cassidy and...|
| 29| 7|Batman Forever (1...|
| 892| 7| Flubber (1997)|
| 164| 7| Abyss, The (1989)|
| 258| 7| Contact (1997)|
| 82| 7|Jurassic Park (1993)|
| 71| 10|Lion King, The (1...|
| 501| 10| Dumbo (1941)|
| 627| 10|Robin Hood: Princ...|
| 418| 10| Cinderella (1950)|
| 969| 10|Winnie the Pooh a...|
| 95| 10| Aladdin (1992)|
| 1| 10| Toy Story (1995)|
+-------+---+--------------------+
scala> ratings.registerTempTable("ratingsData")
scala> movies.registerTempTable("moviesData")
scala> user.registerTempTable("userData")

scala> spark.sql("""SELECT t1.age, t1.movieid, t2.mc, m.movietitle FROM
| ( SELECT age, movieid, COUNT(*) AS mcount
| FROM ratingsData r
| JOIN usersData u ON r.userId = u.userId
| GROUP BY age, movieid
| )t1
| JOIN
| (
| SELECT age, MAX(mcount) AS mc
| FROM (
| SELECT age, movieid, COUNT(*) AS mcount
| FROM ratingsData r
| JOIN usersData u ON r.userId = u.userId
| GROUP BY age, movieid
| ) t2
| GROUP BY age
| ) t2
| JOIN moviesData m
| ON t1.age=t2.age and t1.mcount=t2.mc and m.movieId=t1.movieId order by age;""").show(50)

+---+-------+---+--------------------+
|age|movieid| mc| movietitle|
+---+-------+---+--------------------+
| 10| 432| 1| Fantasia (1940)|
| 10| 102| 1|Aristocats, The (...|
| 10| 71| 1|Lion King, The (1...|
| 10| 95| 1| Aladdin (1992)|
| 10| 627| 1|Robin Hood: Princ...|
| 10| 8| 1| Babe (1995)|
| 10| 422| 1|Aladdin and the K...|
| 10| 501| 1| Dumbo (1941)|
| 10| 225| 1|101 Dalmatians (1...|
| 10| 140| 1|Homeward Bound: T...|
| 10| 1219| 1|Goofy Movie, A (1...|
| 10| 878| 1|That Darn Cat! (1...|
| 10| 465| 1|Jungle Book, The ...|
| 10| 151| 1|Willy Wonka and t...|
| 10| 94| 1| Home Alone (1990)|
| 10| 768| 1| Casper (1995)|
| 10| 596| 1|Hunchback of Notr...|
| 10| 99| 1|Snow White and th...|
| 10| 946| 1|Fox and the Hound...|
| 10| 82| 1|Jurassic Park (1993)|
| 10| 404| 1| Pinocchio (1940)|
| 10| 1| 1| Toy Story (1995)|
| 10| 418| 1| Cinderella (1950)|
| 10| 420| 1|Alice in Wonderla...|
| 10| 477| 1| Matilda (1996)|
| 10| 172| 1|Empire Strikes Ba...|
| 10| 932| 1| First Kid (1996)|
| 10| 969| 1|Winnie the Pooh a...|
| 10| 50| 1| Star Wars (1977)|
| 10| 588| 1|Beauty and the Be...|
| 10| 393| 1|Mrs. Doubtfire (1...|
| 11| 1016| 1| Con Air (1997)|
| 11| 815| 1| One Fine Day (1996)|
| 11| 254| 1|Batman & Robin (1...|
| 11| 410| 1| Kingpin (1996)|
| 11| 685| 1|Executive Decisio...|
| 11| 473| 1|James and the Gia...|
| 11| 405| 1|Mission: Impossib...|
| 11| 21| 1|Muppet Treasure I...|
| 11| 849| 1|Days of Thunder (...|
| 11| 363| 1| Sudden Death (1995)|
| 11| 147| 1|Long Kiss Goodnig...|
| 11| 15| 1|Mr. Holland's Opu...|
| 11| 455| 1|Jackie Chan's Fir...|
| 11| 742| 1| Ransom (1996)|
| 11| 109| 1|Mystery Science T...|
| 11| 121| 1|Independence Day ...|
| 11| 117| 1| Rock, The (1996)|
| 11| 1| 1| Toy Story (1995)|
| 11| 222| 1|Star Trek: First ...|
+---+-------+---+--------------------+

Conclusion

To sum up, Spark DataFrames and Spark SQL make working with data in Spark easier. Exploring the ML-100K dataset becomes simpler and more efficient, allowing users to extract valuable insights effortlessly. Embracing these tools simplifies data analysis and enhances the process of extracting valuable insights from structured and semi-structured data.

--

--