Lets Do Some Data Analysis with SQL-PostgreSQL

Amany Abdelhalim
Analytics Vidhya
Published in
15 min readNov 19, 2020

--

In this post, I will teach you how to do data analysis by writing some interesting SQL queries using the following tables, which are part of a DVD_Rental Database.

actor (actor_id, first_name, last_name)

film (film_id, title, description, release_year,language_id, rental_duration, rental_rate,replacement_cost, rating)

language(language_id, name)

category(category_id, name)

film_category(film_id, category_id)

film_actor(film_id, actor_id)

I used PostgreSQL - pgAdmin, I opened the query tool to start writing my SQL queries.

I prepared 35 questions that cover different topics such as group by, order by, interesting operators and functions, sub queries, join, subqueries, aggregate functions, cube, rollup, fetch first, limit, exists, window functions (e.g. lead(), lag(), first_value(), rank(), percent_rank(), partition over(),etc). I will take you through how I answered them and provide sample of the queries output. For some of the questions I will provide more than one solution. Let’s start.

1) Select the first_name and the last_name of each actor. This query should result in selecting 2 columns.

2) Select the full name of the actor. This query should result in 1 column.

Notes:

The || is a concatenation operator.

full_name here is called an alias which is how we prefer the column to be named.

We can write an alias after the keyword AS or without the AS.

If you like to use an alias that has a space, surround it with double quotations

e.g. “full name”.

3) Select the actors that have names starting with a ‘D’.

Solution 1: using the LIKE operator.

LIKE ‘D%’ means a name that starts with ‘D’ follwoed by anything else.

LIKE ‘d%’ means a name that starts with ‘d’ follwoed by anything else.

OR is a logical operator that combines both of the conditions and will allow the name to be selected if at least one of the two conditions is true.

Solution 2: using the ILIKE operator

ILIKE ignores the case, so it will retrieve those actors whose first names start with d or D.

4) Select all the actor information sorted by the first name ascending.

ORDER BY will sort the result ascending by default.

5) Count the unique actor’s first names

DISTINCT will select the unique first names.

Count(column_name) is an aggregate function, that will count the rows of a specific column. In this case it is counting the unique first names only.

The actor table has 200 actors, only 128 of them are unique first names.

Let's see how many unique full names do we have of those actors.

The result is 199 unique full names out of 200 full names. That means that 2 actors share their full names.

6) Count the number of films based on their rental duration. rental_duaration in the film table referes to how long is the DVD allowed to be rented.

203 movies have a rental duration as 4 days, 212 have a rental duration as 6 days, etc.

Above, there was no ordering, let’s order based on the rental_duration.

By default the ordering goes ascending, let’s order based on the rental_duration but descending. In order to do that we will add the keyword DESC.

Let’s order by the count(tittle) descending,

7) Select the maximum replacement cost

Max(column_name) is an aggregate function, that will select the maximum value for a specific column.

8) Select the titles of the movies that have the highest replacement cost.

In the above query, the selected part is a subquery in the WHERE clause that will select the maximum replacement_cost in the film table, which we found out before that it is 29.99. Then we will select the titles and replacement costs for those movies that their replacement cost equals the maximum replacement cost (29.99).

9) Select the unique different ratings for the movies in the film table.

Solution 1: using Distinct.

Solution 2: using group by.

Both queries will show the following result:

10) Select the number of movies available under each rating.

11) Change the movie language for the first 20 movies from English language to Italian.

All the movies in the film table originally have a language_id = 1, and English is the language that has a language_id as 1. I will change some of the movies to be in Italian. I need to figure out what is the language_id related to Italian from the Language table first.

We have 1000 movies in the film table, now after the previous update statement, that updates the language_id to equal 2 (the Italian language Id) for all the movies that have a film_id ≤ 20. The number of movies in English language should be 980 and those in Italian should be 20 movies. Lets check that in the next question.

12) Select the count of movies grouped by language

The results here confirms what we stated in the previous question.

13) Select the language that most of the movies belong to.

Based on what we saw in the previous question, Language_id (1) has 980 movies and Language_id (2) has 20 movies, so the result here should be the English language.

We need a subquery here that will first select the language_id that has the highest count of movies, which will give back 1. Then we select the language name which has language_id equivalent to the id retrieved by the subquery, which is supposed to be English.

13) Select movie titles and replacement costs and ratings along with the average replacement cost for movies in the rating that the movie belongs to.

In the above query, I am using a window function OVER (PARTITION BY ) to select the average replacement_cost for movies in each rating category. So I am displaying for each movie its title, its replacement_cost, its rating and the average replacement_cost for the rating that the movie belongs to. This is useful so you can compare the replacement_cost of the movie to the other movies in the same rating category of that movie.

I used the AVG(column_name), which is an aggregate function that calculates the average. I used the ROUND() function to round the result to 2 decimal points.

I can get the same results by writing the following query, where I get the average replacement_cost of the movie’s rating by a subquery in the SELECT clause.

14) Select the ratings and average replacement cost for Italian movies.

I am using a subquery in the WHERE clause to get the Italian language_id from the language table. I get all the ratings and average the replacement_cost for each rating only for the Italian movies.

15) Count the movies that has the maximum replacement_cost in the film table.

53 movies out of the 200 movies have a replacement cost as 29.99, which is the maximum replacement cost.

16) Count the number of movies we have for each language in the film table. (Note we have movies that are in English and Italian languages only).

Solution 1: using Inner JOIN.

JOIN is going to join the rows from both tables that have matching language_id’s and then will select the specified coulmns. Since both tables have language_id, I used alias(f for film table and l for language table) for both tables to differentiate between both.

17) Count the number of movies exist under each of the 6 languages that exisit in the language table (English, Italian, French, Mandarine, Japanese and German).

Solution 1: using RIGHT JOIN.

So the above query will join the rows naturally in the two tabels where language_id from film matches that from language. The next step is to check if there is any language_id in the table on the right which is language that didn’t come in the JOIN result and add it to the result. In this case language table has 3 extra languages. So they will be added in the result and the count will be 0.

Solution 2: using LEFT JOIN.

So the above query will join the rows naturally in the two tabels where language_id from film matches that from language. The next step is to check if there is any language_id in the table on the left which is language that didn’t come in the JOIN result and add it to the result. In this case language table has 3 extra languages. So they will be added in the result and the count will be 0.

Both solutions will result of the following:

18) Count the number of movies grouped by language_id and rating and sort the results by the language id ascending.

19) Select the language_id, rating and count the number of movies under each language_id and rollup on language_id and rating then sort by the language id.

In the query above, we are grouping by language_id and rating, selecting the language_id and rating and the count of movies under each group. Rollup() then gives extra information which is the total count() for each language_id regardless of the rating. You can see below that the movie_count for language_id 1 is 980 and for language_id 2 is 20. The total count of movies (1000) will also be calculated regardless of the rating or the language_id.

In the following query, I switched the order of the columns after the rollup. Check below how the output will be different. The rollup will ignore the language_id and calculate the totals for the each rating then ignore both and calculate the total of the movies regardless of their language_id and rating.

20) Select the language_id, rating and count the number of movies under each language_id, then cube on language_id and rating and sort the results by the language id.

In the query above, we are grouping by language_id and rating, selecting the language_id and rating and the count of movies under each group. Cube() is going to do what ROLLUP() did and then add extra information. So it will give the total count() for each language_id regardless of the rating. You can see below that the movie_count for language_id 1 is 980 and for language_id 2 is 20. In addition to that Cube() will provide the total count() of movies for each rating regardless of the language_id (178 movies under ‘G’, 210 movies under ‘NC-17’, 194 movies under ‘PG’, 223 movies under ‘PG-13’ and 195 movies under ‘R’. Cube() will also provide total count() of movies regardless of the language_id and the rating which is the number of rows in the film table (1000) rows.

21) Select movie titles, rental_rates along with a new column that has the rental rate discounted 20%.

You can use the ROUND() function to round the result, TRUNC() function to remove the decimal part, or CEIL() function to return the smallest integer value that is bigger than or equal to the result.

22) Select title, replacement_cost, rating, along with the maximum replacement cost under that rating.

Solution 1: to get the maximum replacement cost under a rating category . We can use MAX() with the window function OVER PARTITION.

In the first solution, I partitioned by the rating and got the maximum value using the MAX() function.

Solution 2: to get the maximum replacement cost under a rating category . We can use first_value() with the window function OVER PARTITION.

In the second solution, I partitioned by the rating and ordered by the replacement_cost descending and picked the first_value() which is going to be the maximum. So it will give the same result as the first solution.

Solution 3: using a subquery in the SELECT clause.

All three solutions will result in the following :

sample of the result

23) Select for every movie it’s title, replacement_cost, rating and the replacement_cost for the movie in the previous row in the same rating.

As you can see above, because we don’t have a previous movie in the ‘G’ rating for the first movie the lag is null.

24) Select for every movie it’s title, replacement_cost, rating and the replacement_cost for the movie in the next row in the same rating.

As you can see above, because we don’t have a following movie in the ‘NC-17’ rating for the last movie the lead is null.

25) Group the rows in the film table into 4 buckets or 4 order groups.

NTILE() will add a column and this column will have a value representing the group number of each row. Since I asked for 4 groups NTILE(4), the values of the ntile column will be from 1 to 4.

To see how the 1000 movies were distributed among the four groups, I did the following:

26) Select for each movie it’s title, rental_rate and the rental_rate for the next two rows partitioned by the language_id.

Since in the query we were asking for lead of 2 rows the last two rows had a null lag because there is not available 2 rows following them.

27) Select for each movie it’s title, rental_rate and the rental_rate for the two previous rows partitioned by the language_id.

Since in the query we were asking for lead of 2 rows the first two rows have a null lag because there is no 2 rows available ahead of them.

28) Show the rank by rating and order the results by rental_rate.

The first 55 movies have a rental_rate of 4.99, so all 55 movies will have the same rank which is equal to 1, the movie 56 has a different rental_rate of 2.99, so the rank started from 56 to compensate for the previous 55 that all shared the same rank of 1.

29) Show the percentage rank by rating and order the results by replacement_cost.

Above instead of the rank the precentage rank is given.

30) Create “temp_table” table with the WITH statement that has all the movies with titles that start with a ‘b’ or ‘B’. Then select all information from the “temp_table”.

In this example you will see the WITH statement, the ILIKE that will allow us to select titles that begin with a b regardless of its case (upper or lower case).

31) Select all the movies excluding those under ‘PG-13’ or ‘G’ ratings.

Solution 1: using the membership operator (NOT IN) and listing the ratings that you want to exclude from the result.

Solution 2: using the membership operator (IN) and listing the ratings that you want to include in the result.

Solution 3: using the not equal (<>) operator and listing the ratings that you want to exclude from the result.

Solution 4:using the equal (=) operator and listing the ratings that you want to include in the result.

32) Select the actors with actor_id from 1 to 20.

Solution 1: using the between operator.

Solution 2: using comparison operators ≥ and ≤.

33) Select the actor’s first_names that have more than 5 letters.

LENGTH() will count the number of caracters.

34) Select language_ids and language names that have movies in the film table with those language_ids.

In the above query, if the subquery (SELECT * FROM film f WHERE l.language_id = f.language_id) has results that mean exists will evaluate to true and the language_id, name will be returned.

The following is another solution to the query:

35) Select the first 5 rows in the actor table.

Solution 1: using LIMIT.

Solution 2: using FETCH FIRST.

Both solutions will get the first five actors with actor_id from 1 to 5.

I hope you found the post interesting and that I was able to demonstrate the subject in a good and clear way.

--

--

Amany Abdelhalim
Analytics Vidhya

PhD. in Computer Engineering | Research Associate | Computer Science Instructor | love Machine Learning & Big Data.