Coding Challenge #2

Mahdi Shadkam-Farrokhi
7 min readJul 18, 2019

Welcome one and all to the second installment of the coding challenges I’m encountering on my learning journey.

This time, I’m going to tackle some SQL queries. Just like before, I’ll be using CodeWars to source my challenge. Without further adieu, let’s get to it!

You can find the challenge here. Feel free to try it on your own!

As usual, what follows is my unedited attempt at this challenge — I’m literally typing what you’re reading while going back and forth from the challenge.

The problem prompt:

We’re given the schema for 3 tables relating actors (actor), films (film), and a relational table of both (film_actor).

The task is to find two actors who cast together the most and list titles of only those films they were cast in together (three columns: first_actor_full_name, second_actor_full_name, and film_name). Lastly, the results should be ordered by the film title alphabetically. Also, actor pairings should not repeat — the first actor should have a smaller id than the second.

Brainstorm: My first thought is to break this query down into nice easy chunks. I’m also recognizing we’re likely going to have to join tables to get the final solution.

My initial concern is finding two actors with the same film id, which I’m not 100% on, but think a nested query might work for this. Once I get that, the rest should fall into place. Here’s my game plan:

  1. Create list of all actor pairs who have the same film ids (probably nested query)
  2. Group by those pairs to get aggregate count, selecting the results matching the max of that count
  3. Order by film title
Exploring the data

Just to get a sense of everything and get familiar with the database, I joined the tables together and viewed the result, which ends up simply being a running list of all film+actor connections filled in with the individual details for the actor and the film.

Hiccup #1

After being genuinely stumped for longer than I care to admit, I did the first thing I could think of in my debugging strategy…I Googled it!

I searched for “sql select rows with same values in two column” and promptly found this generic solution, which takes the form of a sub-query, which I feel makes sense.

Researching Solution

Translating to the problem at hand, I get..

Translated ‘solution’

Which results in…basically the same table as film_actor😅 (with some extraneous single-actor films removed).

At this point, I fully admit my total ignorance of making complex SQL queries. Nevertheless, onward into the breach! I’m still searching around and I’ve come across the idea of a “self join”.

NOTE: I also stumbled upon the concat() SQL function — which will be helpful in forming the full name as expected.

The way I’m thinking of it, I’ll make a self join and make an actor+actor pairing based on film_id, effectively multiplying each instance of an actor by the number of other actors in each film they’re in (there would be a lot of redundant pairing, though 🤔).

So…how do I make that happen…

Hiccup #2

In all honesty, I had to take a break and clear my head. Also, to get some some paper and a pen — I work better writing out the logic and mulling over my options.

And that’s when I found the golden ticket

‘golden ticket’

This general solution succinctly meets the parameters of the problem, even down to removing those excessive duplicates I mentioned before.

All of the requirements are met:

  1. Self join on a common element (in our case the film_id)
  2. We ensure pairs are not repeated by requiring a dependent relationship in the id of the target (in our case the actor_id)

With renewed energy, here’s the sub-query translated for our problem:

Working code! Yay!

Which returns the following results:

Good results showing all unique combinations
Good Result

As we can see, the logic works beautifully. As the query goes through all combinations of pairings, it only singles out those with unique pairing, due to the restraints of having the first actor id to be always less than the second actor id.

I also tried the same query, but without the restrictive AND p1.actor_id < p2.actor_id line and got back the entire list of all possible combinations.

Poor results showing all possible combinations
Not so good result

I don’t recall learning about using logic directly after the ON clause. I thought it would be more appropriate in a WHERE or HAVING, but clearly it’s a tool worth knowing.

Looking more into it, the ON clause can absolutely take a conditional statement — source. What’s interesting is this conditional statement affects the join directly, as in if the condition is not met, the join doesn’t occur in the first place. My original thought was to join everything and THEN filter, which would have been much less efficient use of memory/resources.

I’m definitely going to have to put this in my bag of tricks!

Now, I need to select only the highest occurring pair of actors from this sub-query. I’m first going to rewrite the sub-query to make it easier to work with. I happened upon the idea of Common Table Expressions (CTE), which is a basic syntax for restating sub-queries in a convenient way. The following will have the same result as before:

Restating sub-query

Hiccup #3

So…yeah, trying to use the MAX(COUNT(film_id)) while grouping by actor_1_id and actor_2_id I’m finding it pretty impossible to do in one statement, meaning…I’m going to need another sub-query?

I’m not hesitating this time, to Google!

And yup, as expected, there’s an issue with using aggregate functions an aliases with HAVING, which would be ideal here. The problem is SQL’s execution order, which oddly doesn’t occur in the order written! the SELECT statement actually executes AFTER the HAVING clause executes. Anyway, here’s a possible solution using a sub-query — anyone else getting a little tired of sub-queries? 😋

Selecting for max pairing count

Essentially, in order to work around the issue of SQL’s order of execution, we had to create redundant sub-queries to have access to the MAX count of groupings for actor_1_id and actor_2_id. And yes, I know my variable names aren’t great! I’ll clean it up later, I’m just trying to grasp what’s going on and get something working first 😋

The logic thus far is as follows (in order of execution):

  1. Pair up all actors and create only unique pairs
  2. Group these pairs and total the number of films they’re in together
  3. Group these pairs AGAIN and select the pairing with the max count from previous step

With this (very obnoxious) query, we get the desired highest frequency single pairing.

Single row of highest occurring pairing!

I can FEEL the answer is so close!

Alright, we have the actor_1_id and actor_2_id, so we can reuse the table with paired actors and select only the rows with those actors, which should get the film_id's! However, I don’t think a join makes sense.

After trying a few joins, it’s clear overlapping the columns with a join is not quite right…

[5 minutes later]

…unless you join using a condition stating join ON actor_1_id AND actor_2_id. Using conditional joins is starting to make sense! Here’s what I have so far:

Selecting all rows with actor ids by pairs with film_id
Returning all rows of most paired actors with film_id

We’re basically near the finish line. Now I just need to join in the actual data for each actor and film, select the human-readable parts, and order by film title!

The final version looks soooo messy, but it follows sound-ish logic:

  1. Pair up all actors and create only unique pairs
  2. Group these pairs and total the number of films they’re in together
  3. Group these pairs AGAIN and select the pairing with the max count from previous step
  4. Join the findings from the previous step with the table of unique pairs on BOTH actors by id
  5. Join the human-readable tables on both actors and film by id
Working product! Woot! 😁

The last bit was a little tricky, since joining the same table twice would NOT appear in the results (probably UI issue), though if selected directly, each table still exists in the result. I’m sure there are better, more succinct ways of doing this, so let’s check out some of the better solutions…

Best practice solution

I actually don’t feel too far off from the other solutions I’m seeing. I’m posting one of the better “best practice” solutions, but there’s honestly so many ways of attacking this problem in SQL, it’s hard to say one is significantly better than another (at least to my naive eyes!).

I like this solution’s brevity and clear delineation of focus. The first CTE succinctly takes care of the pair groupings and max count — we see a cleaver tactic using ORDER BY along with a LIMIT 1 to get the max count without needing MAX. This is followed up by some short, but sweet, SELECT statements and lastly a bunch of joins near the end.

In all, I learned a fair amount from this challenge. Using a CTE and WITH and how joins can be made conditionally. Also, sub-queries! Oh, and I learned it’s possible to join a table to itself and you can make multiple joins using the same table.

That’s it for now! Did you discover anything new from this challenge? What’s an interesting SQL trick you’ve learned that would come in handy for a beginner/intermediate SQL practitioner?

I’m still figuring out my next post topic…maybe something funny. Anyway, until next time!

--

--