RSVP Movie Case Study

Harsh Patel
12 min readApr 28, 2024

--

Introduction

RSVP Movies is an Indian film production company that has produced many super-hit movies. They usually release movies for the Indian audience, but for their next project, they plan to release a movie for the global audience in 2022.

The production company wants to plan its every move analytically based on data and has approached you for help with this new project. You have been provided with data on the movies that have been released in the past three years. You have to analyze the data set and draw meaningful insights to help them start their new project.

Problem Statement

You are a data analyst and an SQL expert. You have to use SQL to analyze the given data and give recommendations to RSVP Movies based on the insights. For your convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. Write an SQL code to answer each business question to help the company with its new project.

Dataset

Find SQL code to import data here

Find table data and ERD Model here

Questions and SQL Codes

Q1. Find the total number of rows in each table of the schema?

SELECT COUNT(*) FROM director_mapping;
Result
SELECT COUNT(*) FROM genre;
Result
SELECT COUNT(*) FROM movie;
Result
SELECT COUNT(*) FROM names;
Result
SELECT COUNT(*) FROM ratings;
Result
SELECT COUNT(*) FROM role_mapping;
Result

Q2. Which columns in the movie table have null values?

SELECT SUM(
CASE
WHEN id IS NULL THEN 1
ELSE 0
END) AS null_clm_id,
SUM(
CASE
WHEN title IS NULL THEN 1
ELSE 0
END) AS null_clm_title,
SUM(
CASE
WHEN year IS NULL THEN 1
ELSE 0
END) AS null_clm_year,
SUM(
CASE
WHEN date_published IS NULL THEN 1
ELSE 0
END) AS null_clm_date_published,
SUM(
CASE
WHEN duration IS NULL THEN 1
ELSE 0
END) AS null_clm_duration,
SUM(
CASE
WHEN country IS NULL THEN 1
ELSE 0
END) AS null_clm_country,
SUM(
CASE
WHEN worlwide_gross_income IS NULL THEN 1
ELSE 0
END) AS null_clm_worlwide_gross_income,
SUM(
CASE
WHEN languages IS NULL THEN 1
ELSE 0
END) AS null_clm_languages,
SUM(
CASE
WHEN production_company IS NULL THEN 1
ELSE 0
END) AS null_clm_production_company
FROM
movie;
country, worldwide_gross_income, languages, and production_company are the columns that have NULL values in the movies table

Q3. Find the total number of movies released each year? How does the trend look month wise?

SELECT  
year,
COUNT(title) AS number_of_movies
FROM
movie
GROUP BY
year;
Part I Result
SELECT  
MONTH(date_published) AS month_num,
COUNT(*) AS number_of_movies
FROM
movie
GROUP BY
MONTH(date_published)
ORDER BY
MONTH(date_published);
Part II Result

The highest number of movies is produced in the month of March.

Q4. How many movies were produced in the USA or India in the year 2019?

SELECT
COUNT(id) AS number_of_movies,
year
FROM
movie
WHERE
(country LIKE "%USA%" OR
country LIKE "%India%")
AND year = 2019
GROUP BY year;
Total 1059 movies were produced by USA or India in the year 2019

Q5. Find the unique list of the genres present in the data set?

SELECT 
DISTINCT genre
FROM
genre
ORDER BY genre ASC;
There are a total of 13 types of genres of movies.

Q6.Which genre had the highest number of movies produced overall?

SELECT 
COUNT(movie_id) AS movie_count, genre
FROM
genre
GROUP BY genre
ORDER BY movie_count DESC
LIMIT 1;
Drama genre produced the highest number of movies 4285 numbers.

Q7. How many movies belong to only one genre?

WITH one_genre_movies AS(
SELECT
movie_id
FROM
genre
GROUP BY
movie_id
HAVING
COUNT(genre)=1)
SELECT
COUNT(movie_id)
FROM
one_genre_movies;
There are 3289 movies that belong to only one genre.

Q8.What is the average duration of movies in each genre?

SELECT 
genre,
ROUND(AVG(duration)) AS avg_duration
FROM
genre g
INNER JOIN
movie m
ON g.movie_id = m.id
GROUP BY
genre
ORDER BY
avg_duration DESC
Result

Q9.What is the rank of the ‘thriller’ genre of movies among all the genres in terms of number of movies produced?

WITH genre_rank_table AS(
SELECT
genre,
COUNT(movie_id) AS movie_count,
RANK() OVER(ORDER BY COUNT(movie_id) DESC) as genre_rank
FROM
genre
GROUP BY genre)
SELECT *
FROM
genre_rank_table
WHERE
genre = "Thriller";
The rank of the “Thriller” genre regarding movie production is 3rd.

Q10. Find the minimum and maximum values in each column of the ratings table except the movie_id column?

SELECT 
MIN(avg_rating) AS min_avg_rating,
MAX(avg_rating) AS max_avg_rating,
MIN(total_votes) AS min_total_votes,
MAX(total_votes) AS max_total_votes,
MIN(median_rating) AS min_median_rating,
MAX(median_rating) AS max_median_rating
FROM
ratings;

Q11. Which are the top 10 movies based on average rating?

SELECT
m.title,
r.avg_rating,
RANK() OVER (ORDER BY r.avg_rating DESC) AS movie_rank
FROM
movie AS m
JOIN
ratings AS r
ON
m.id = r.movie_id
ORDER BY
movie_rank
LIMIT 10;

Q12. Summarise the ratings table based on the movie counts by median ratings.

SELECT
median_rating,
COUNT(movie_id) AS movie_count
FROM
ratings
GROUP BY
median_rating
ORDER BY
movie_count DESC;
Movies with a median rating of 7 are the highest in number.

Q13. Which production house has produced the most number of hit movies (average rating > 8)?

WITH prod_company_ranks as
(
SELECT
m.production_company,
COUNT(m.id) AS movie_count,
RANK() OVER (ORDER BY COUNT(m.id) DESC) AS prod_company_rank
FROM
movie AS m
JOIN
ratings AS r
ON
m.id = r.movie_id
WHERE
r.avg_rating > 8 and m.production_company is not NULL
GROUP BY
m.production_company
)
SELECT
production_company, movie_count, prod_company_rank
FROM
prod_company_ranks
WHERE
prod_company_rank = 1;
Dream Warrior Pictures and National Theatre Live are the two production houses that have produced the most number of movies.

Q14. How many movies released in each genre during March 2017 in the USA had more than 1,000 votes?

SELECT
g.genre,
COUNT(m.id) AS movie_count
FROM
movie AS m
INNER JOIN
ratings AS r
ON
m.id = r.movie_id
INNER JOIN
genre AS g
ON
m.id = g.movie_id
WHERE
r.total_votes > 1000
AND g.genre IS NOT NULL
AND YEAR(date_published) = 2017
AND country LIKE '%USA%'
AND MONTH(date_published) = 3
GROUP BY
g.genre
ORDER BY
movie_count DESC;
Result

Q15. Find movies of each genre that start with the word ‘The’ and which have an average rating > 8?

SELECT
m.title,
r.avg_rating,
g.genre
FROM
movie AS m
INNER JOIN
ratings AS r ON m.id = r.movie_id
INNER JOIN
genre AS g ON m.id = g.movie_id
WHERE
m.title LIKE "THE%"
AND r.avg_rating > 8
ORDER BY
avg_rating DESC;
Result

Q16. Of the movies released between 1 April 2018 and 1 April 2019, how many were given a median rating of 8?

SELECT
r.median_rating,
COUNT(m.id) AS movie_count
FROM
movie AS m
INNER JOIN
ratings AS r
ON
m.id = r.movie_id
WHERE
r.median_rating = 8
AND date_published BETWEEN '2018-04-01' AND '2019-04-01'
GROUP BY
r.median_rating;
8 movies were median ratings between 1 April 2018 and 1 April 2019.

Q17. Do German movies get more votes than Italian movies?

SELECT
"Italian" as language,
Sum(total_votes) AS total_vote_count
FROM
movie AS m
INNER JOIN
ratings AS r
ON
r.movie_id = m.id
WHERE
languages LIKE '%Italian%'

UNION

SELECT
"German" as language,
Sum(total_votes) AS total_vote_count
FROM
movie AS m
INNER JOIN
ratings AS r
ON r.movie_id = m.id
WHERE
languages LIKE '%German%'
ORDER BY total_vote_count DESC;
Yes, German movies get more votes than Italian movies.

Q18. Which columns in the names table have null values?

SELECT
COUNT(CASE WHEN names.name IS NULL THEN 1 END) AS name_nulls,
COUNT(CASE WHEN names.height IS NULL THEN 1 END) AS height_nulls,
COUNT(CASE WHEN names.date_of_birth IS NULL THEN 1 END) AS date_of_birth_nulls,
COUNT(CASE WHEN names.known_for_movies IS NULL THEN 1 END) AS known_for_movies_nulls
FROM
names;
Height, Date of Birth, and Known for Movies have null values.

Q19. Who are the top three directors in the top three genres whose movies have an average rating > 8?

WITH 
top_genres AS (
SELECT
g.genre, COUNT(m.id) as movie_count
FROM
ratings AS r
INNER JOIN
movie AS m ON r.movie_id = m.id
INNER JOIN
genre AS g ON g.movie_id = m.id
WHERE
r.avg_rating > 8
GROUP BY
g.genre
ORDER BY
movie_count DESC
LIMIT 3
-- This gives Drama, Action, Comedy.)
SELECT
name,
COUNT(g.movie_id) AS movie_count
FROM
names n
INNER JOIN
director_mapping dm ON n.id = dm.name_id
INNER JOIN
movie m ON m.id = dm.movie_id
INNER JOIN
genre g ON m.id = g.movie_id
INNER JOIN
ratings r ON r.movie_id = m.id
WHERE
genre IN (select genre from top_genres)
AND avg_rating > 8
GROUP BY
name
ORDER BY
COUNT(g.movie_id) DESC
LIMIT 3;
The top three directors are James Mangold, Joe Russo, and Anthony Russo.

Q20. Who are the top two actors whose movies have a median rating >= 8?

SELECT 
name AS actor_name,
COUNT(r.movie_id) AS movie_count
FROM
ratings AS r
INNER JOIN
role_mapping AS rm ON rm.movie_id = r.movie_id
INNER JOIN
names AS n ON rm.name_id = n.id
WHERE
r.median_rating >= 8
AND rm.category = 'actor'
GROUP BY
actor_name
ORDER BY
COUNT(*) DESC
LIMIT 2;
Mammootty and Mohanlal are the top two actors.

Q21. Which are the top three production houses based on the number of votes received by their movies?

SELECT
m.production_company,
SUM(r.total_votes) AS vote_count,
RANK()OVER(ORDER BY SUM(r.total_votes) DESC) AS prod_comp_rank
FROM
movie m
INNER JOIN
ratings r
ON m.id = r.movie_id
WHERE
m.production_company is not null
GROUP BY
m.production_company
ORDER BY
prod_comp_rank
LIMIT 3;
The top 3 production companies are Marvel Studios, Twentieth Century Fox, and Warner Bros.

Q22. Rank actors with movies released in India based on their average ratings. Which actor is at the top of the list?

WITH actor_ranking AS (
SELECT
n.name AS actor_name,
SUM(r.total_votes) AS total_votes,
COUNT(m.id) AS movie_count,
ROUND(SUM(r.avg_rating*r.total_votes)/SUM(total_votes),2) AS actor_avg_rating
FROM
movie m
INNER JOIN
ratings r
ON m.id = r.movie_id
INNER JOIN
role_mapping rm
ON m.id = rm.movie_id
INNER JOIN
names n
ON n.id = rm.name_id
WHERE
rm.category = "actor" AND
m.country = "India"
GROUP BY
actor_name
HAVING
movie_count >=5)
SELECT *,
RANK() OVER(ORDER BY actor_avg_rating DESC, total_votes DESC) AS actor_rank
FROM
actor_ranking;
The top three actors are Vijay Sethupati, Fahadh Faasil, and Yogi Babu.

Q23.Find out the top five actresses in Hindi movies released in India based on their average ratings?

WITH actress_ranking AS (
SELECT
n.name AS actress_name,
SUM(r.total_votes) AS total_votes,
COUNT(m.id) AS movie_count,
ROUND(SUM(r.avg_rating*r.total_votes)/SUM(total_votes),2) AS actress_avg_rating
FROM
movie m
INNER JOIN
ratings r
ON m.id = r.movie_id
INNER JOIN
role_mapping rm
ON m.id = rm.movie_id
INNER JOIN
names n
ON n.id = rm.name_id
WHERE
rm.category = "actress" AND
m.country = "India" AND
m.languages LIKE "%Hindi%"
GROUP BY
actress_name
HAVING
movie_count >=3)
SELECT *,
RANK()OVER(ORDER BY actress_avg_rating DESC, total_votes DESC) AS actress_rank
FROM
actress_ranking
LIMIT 5;
The top 5 actresses are 1. Taapsee Pannu, 2. Kriti Sanon, 3. Divya Dutta, 4. Shraddha Kapoor, and 5. Kriti Kharbanda.

Q24. Select thriller movies as per avg rating and classify them in the following category:

Rating > 8: Superhit movies
Rating between 7 and 8: Hit movies
Rating between 5 and 7: One-time-watch movies
Rating < 5: Flop movies

WITH thriller_movie_list AS(
SELECT
m.title,
r.avg_rating
FROM
ratings r
INNER JOIN
movie m
ON r.movie_id = m.id
INNER JOIN
genre g
ON m.id = g.movie_id
WHERE
g.genre = "Thriller")
SELECT *,
CASE
WHEN avg_rating > 8 THEN "Superhit movies"
WHEN avg_rating BETWEEN 7 AND 8 THEN "Hit movies"
WHEN avg_rating BETWEEN 5 AND 7 THEN "One-time-watch movies"
ELSE "Flop movies"
END AS movie_category
FROM
thriller_movie_list;

Find the result table here

Q25. What is the genre-wise running total and moving average of the average movie duration?

SELECT
g.genre,
ROUND(AVG(m.duration),2) AS avg_duration,
SUM(ROUND(AVG(m.duration),2)) OVER W1 AS running_total_duration,
AVG(ROUND(AVG(m.duration),2)) OVER W2 AS moving_avg_duration
FROM
movie m
INNER JOIN
genre g
ON m.id = g.movie_id
GROUP BY
g.genre
WINDOW W1 AS (ORDER BY genre ROWS UNBOUNDED PRECEDING),
W2 AS (ORDER BY genre ROWS 10 PRECEDING)
ORDER BY
g.genre;
Result

Q26. Which are the five highest-grossing movies of each year that belong to the top three genres?

WITH top_three_genre AS (
SELECT
COUNT(m.id) AS number_of_movies,
g.genre
FROM
movie m
INNER JOIN
genre g
ON m.id = g.movie_id
GROUP BY
g.genre
ORDER BY
COUNT(m.id) DESC
Limit 3)
, movie_summary AS (
SELECT g.genre,
m.year,
m.title AS movie_name,
CAST(REPLACE(REPLACE(IFNULL(m.worlwide_gross_income, 0), 'INR', ''), '$', '') AS DECIMAL(10)) AS worlwide_gross_income,
DENSE_RANK() OVER(PARTITION BY m.year ORDER BY CAST(REPLACE(REPLACE(IFNULL(m.worlwide_gross_income, 0), 'INR', ''), '$', '') AS DECIMAL(10))DESC) AS movie_rank
FROM
movie m
INNER JOIN
genre g
ON m.id = g.movie_id
WHERE g.genre IN (SELECT
genre
FROM
top_three_genre)
)
SELECT *
FROM
movie_summary
WHERE
movie_rank <=5
ORDER BY
year;
Result

Q27. Which are the top two production houses that have produced the highest number of hits (median rating >= 8) among multilingual movies?

WITH prod_comp_summary AS (
SELECT
m.production_company,
COUNT(m.id) AS movie_count
FROM
movie m
INNER JOIN
ratings r
ON m.id = r.movie_id
WHERE
r.median_rating >=8 AND
POSITION(',' IN m.languages)>0 AND
m.production_company IS NOT NULL
GROUP BY
m.production_company)
SELECT *,
RANK()OVER(ORDER BY movie_count DESC) AS prod_comp_rank
FROM
prod_comp_summary
LIMIT 2;
Star Cinema and Twentieth Century Fox are the top two production houses.

Q28. Who are the top 3 actresses based on number of Super Hit movies (average rating >8) in drama genre?

WITH actress_summary_table AS (
SELECT
n.name AS actress_name,
SUM(r.total_votes) AS total_votes,
COUNT(m.id) AS movie_count,
ROUND(SUM(r.avg_rating*r.total_votes)/SUM(total_votes),2) AS actress_avg_rating
FROM
movie m
INNER JOIN
genre g
ON m.id = g.movie_id
LEFT JOIN
ratings r
ON m.id = r.movie_id
LEFT JOIN
role_mapping rm
ON m.id = rm.movie_id
LEFT JOIN
names n
ON rm.name_id = n.id
WHERE
rm.category = "actress" AND
r.avg_rating >8 AND
g.genre = "drama"
GROUP BY
actress_name
)
SELECT *,
DENSE_RANK() OVER( ORDER BY movie_count DESC) AS actress_rank
FROM
actress_summary_table
ORDER BY
actress_rank
LIMIT 3;
The top 3 actresses based on the number of Super Hit movies in the Drama genre are: 1. Parvathy Thiruvothu, 2. Susan Brown, 3. Amanda Lawrence.

Q29. Get the following details for top 9 directors (based on number of movies):

Director id
Name
Number of movies
Average inter movie duration in days
Average movie ratings
Total votes
Min rating
Max rating
total movie durations

WITH next_publish_table AS (
SELECT
dm.name_id AS director_id,
n.name AS director_name,
dm.movie_id,
m.duration,
r.avg_rating,
r.total_votes,
m.date_published,
LEAD(m.date_published,1) OVER(PARTITION BY dm.name_id ORDER BY m.date_published) AS next_published_date
From
movie m
INNER JOIN
director_mapping dm
ON m.id = dm.movie_id
INNER JOIN
names n
ON n.id = dm.name_id
INNER JOIN
ratings r
ON m.id = r.movie_id),
director_details AS(
SELECT *,
DATEDIFF(next_published_date, date_published) as date_difference
FROM
next_publish_table)
SELECT
director_id,
director_name,
COUNT(movie_id) AS number_of_movies,
ROUND(AVG(date_difference)) AS avg_inter_movie_days,
ROUND(SUM(avg_rating*total_votes)/SUM(total_votes),2) AS avg_rating,
SUM(total_votes) AS total_votes,
MIN(avg_rating) AS min_rating,
MAX(avg_rating) AS max_rating,
SUM(duration) AS total_duration
FROM
director_details
GROUP BY
director_id
ORDER BY
COUNT(movie_id) DESC
LIMIT 9;
Result

Analysis and Recommendation:

  1. Drama takes the lead among genres, boasting 4285 movies, followed by comedy and thriller. Drama is also the most highly-voted genre.
  2. Thrillers and dramas were the highest-grossing among the top three genres, underscoring their financial success.
  3. Analysis shows that 107 minutes is the ideal movie duration, particularly for dramas.
  4. Considering the high votes and multilingual hit movies, Twentieth Century Fox is a strategic collaboration. Partnerships with Dream Warrior Pictures or National Theatre Live enhance the likelihood of producing a hit movie.
  5. James Mangold is recommended for the film project due to his proven success in directing hit movies in the top three genres.
  6. Actors Mammootty and Mohanlal have acted in the most hit movies. In the drama genre, actress Parvathy Thiruvothu and actor Andrew Garfield have the most super hit movies. For a regional touch, Vijay Sethupathi is a strong candidate. To woo Hindi-speaking audiences, foregoing Shraddha Srinath for Tapsee Pannu seems strategic.

--

--

Harsh Patel

Senior Data Analyst with ETL, PySpark, Python and SQL expertise, increasing data efficiency by 30%. Holds Google Data Analytics certificate, MBA & Data Science.