Do All Hollywood Actors Know Each Other? Breadth-First Search in “Action”
The task is straightforward: given two actor names, find the connection between them through their movie casts. Such as Emma Watson knows Jennifer Lawrence through Daniel Radcliffe (Harry Potter), who knows James McAvoy (Victor Frankenstein), who finally knows Jennifer Lawrence (Dark Phoenix). GitHub.
$ python degrees.py large
Name: Emma Watson
Name: Jennifer Lawrence
3 degrees of separation.
1: Emma Watson and Daniel Radcliffe starred in Harry Potter and the Chamber of Secrets
2: Daniel Radcliffe and James McAvoy starred in Victor Frankenstein
3: James McAvoy and Jennifer Lawrence starred in Dark Phoenix
The task is about solving a search problem, such as a maze, in which we are given an initial state (Emma Watson) and a goal state (Jennifer Lawrence). We need to find the shortest path between the initial state and the goal state, which means that we ought to take advantage of Breadth-First Search that is based on using a queue-frontier. By using a queue-frontier, we can gradually check all the available neighboring nodes from the initial state, meaning that we will always arrive at the shortest path because we will exhaust all states until we reach the shortest solution.
First, we need to see what data is available to us. The data is provided by IMDb and includes three tables: movies.csv with the list of titles, years, and ids; stars.csv with the list of person_ids and their movie_ids; and people.csv with the list of ids, names, and birth years.
Second, we need to establish what the node will store. In our case, the node class is an object with the
def __init__(self, state, parent, action):
self.state = state # the current actors id
self.parent = parent # the previous actors id
self.action = action # the current movie id
parent would prove really handy once we reach the final state and need to trace the path that we did. See it as a Singly Linked List where we have the pointer pointing at the next node (though, in our case, it points to the previous node).
So when it comes to Breadth-First Search, we should use a queue to store all nodes that we need to visit. Using a queue is essential to BFS as we need to traverse all nodes on the same level sequentially to find the shortest path. There is a nice medium article on BFS.
Having said that, now we can begin our
shortest_path function that takes source_id and target_id for the attributes. We first initialize the
frontier, then add the first node that stores
source_id for the state and
None for the parent and action. It is then important to keep the
explored set in order to avoid checking the same people and their movies.
frontier = QueueFrontier()
frontier.add(Node(state=source_id, parent=None, action=None))
explored = set()
We then traverse until we either find the solution or exhaust all possible nodes from the frontier. If the frontier is empty, we can say there is no connection. If it is not empty, we extract the node from the queue, specifically the one we put there first. If the node’s state is equal to the
target_id, it means that we have reached the solution, and we need to determine the path we did. This is where the node's parent comes in handy (yay!). We traverse back using the node and save all the nodes that we find in the
path. Once we put all the nodes in the list, we reverse it, and return it.
if frontier.empty(): # No more available nodes to check against target_id
raise Exception("No solution")
node = frontier.remove() # Take out the first element from the queue
if node.state == target_id: # We found the final state!
path = 
while node.parent is not None:
node = node.parent
# Don't forget to reverse it because we were moving from the final state to the initial
It is not over yet. Bear with me. If the node’s state is not equal to the
target_id, then we add it to the explored set and add all of its neighbors to the frontier.
explored.add((node.action, node.state)) # Record that we've already seen this actor and movie
for action, state in neighbors_for_person(node.state):
if not frontier.contains_state(state) and (action, state) not in explored:
# New node for the next actor and movie
child = Node(state=state, parent=node, action=action)
frontier.add(child) # By adding to the frontier, we put it in the end of the queue
You might be wondering what is the
neighbors_for_person function doing? Well, it simply returns all the neighboring actors that starred in the same movie. For example, given Emma Watson as the initial state, we extract all the ids of her movies (all Harry Potter movies probably) and then go over them and the people who took part in those movies as well. We finally return the set of all those people's ids and the movie ids.
movie_ids = people[person_id]["movies"]
neighbors = set()
for movie_id in movie_ids:
for person_id in movies[movie_id]["stars"]:
The algorithm, in general, is certainly quite slow. The
shortest_path function has a runtime complexity of around O(n⁴) which is terribly bad, but the point of the project was rather about applying BFS in real-world data. I hope you enjoyed reading this post. Please note that this project is a part of my coursework towards Harvard’s CS50 Artificial Intelligence; you can find more about it here. Please see my GitHub repository as well.
python degrees.py large
Name: Eddie Murphy
Name: Bill Clinton
3 degrees of separation.
1: Eddie Murphy and Halle Berry starred in Boomerang
2: Halle Berry and Robert Downey Jr. starred in Gothika
3: Robert Downey Jr. and Bill Clinton starred in The Last Party