Using the Girvan-Newman Algorithm To Recommend Movies

Taylor Hannan
smucs
Published in
5 min readApr 25, 2022

Taylor Hannan is currently a sophomore at SMU studying Computer Science. He is from Austin, TX.

For my project extension, I decided to use the Girvan-Newman Algorithm to create a program where a user can enter in a movie title from a selection of films and be given a list of recommendations that are similar to that film.

The Girvan-Newman algorithm detects communities in graphs by first performing a breadth-first search on every node in the graph and giving each vertex a score based on the amount of shortest paths each vertex appears in. These scores are used to help calculate the betweenness of each edge. Betweenness is the measure of how many shortest paths that contain that particular edge. After every edge’s betweenness is calculated, the edge with the highest betweenness is removed from the graph. This process continues until a certain modularity range is reached. For this I choose between 0.3 and 0.995. I choose these values based on both my research and from testing different values to see which ones gave the best result. The algorithm returns a graph that is divided into communities. [1] [2]

The data I used to create the graph comes from the IMDB website [3]. They have a database of useful information about all types of media, not just movies, and it is available to the public in the form of .tsv (tab separated values) files. The dataset I used comes from the file “title.basics.tsv.gz”, which includes information such as media type, title, genres, runtime, and year released.

I wrote the code for processing the data and converting it into .graphml form in Python, while the user interface and main logic I wrote in C++. I start by reading in the data using Python’s csv library, which can also be used to read .tsv files.

with open("../moviedata/title.basics.tsv", "r") as f:
contents = csv.reader(f, delimiter="\t")
rows = list(contents)[1:]

Since the dataset includes forms of media other than movies, such as tv shows and shorts, I filtered the data to just include movies.

new_rows = []
for row in rows:
if (row[1] == "movie" and row[8] != "\\N"):
new_rows.append(row)

new_rows = new_rows[:205] #Ends up with 200 movies added to graph

Although the ideal scenario would be to include all the movies in my graph, the Girvan-Newman Algorithm only runs so fast and it would most likely take hours, if not days, to run on all the data. So, I just included 200 movies.

At this point, I now have a shortened .tsv file of 200 movies. Each line in the file contains several pieces of useful information, but the only info that is needed are the movie titles and genres. So, I wrote code to extract that information and store them into lists.

def parse_movies():
with open("../moviedata/moviedataset.tsv", "r") as f:
contents = csv.reader(f, delimiter="\t")
rows = list(contents)

title_list = []
genre_list = []

for row in rows:
title_list.append(row[2])
genre_list.append(row[8].split(","))

return title_list, genre_list

I then create a graph of this data using the Networkx Python library. Each node in the graph is a movie title and each edge in the graph connects nodes that share at least one genre.

def create_graph(title_list, genre_list):
graph = nx.Graph()

#Add nodes
for x in title_list:
graph.add_node(x)

#Add edges
for i, x in enumerate(title_list):
current_genre_list = genre_list[i]
for j, y in enumerate(title_list):
if (i != j):
current_genre_list2 = genre_list[j]
for g in current_genre_list2:
if g in current_genre_list:
graph.add_edge(x, y)

return graph

The graph is then written to a .graphml file to be used in the C++ portion of the program. There is one issue though, which is that the boost graph library does not have functionality to store the .graphml node id, which in this case is the movie title. To fix this issue, I wrote code to create a .txt file that stores the move names so that the C++ program can still have access to them.

def make_node_txt_file(graph):
with open("../moviedata/movienames.txt", "w") as f:
for i, x in enumerate(graph):
if (i > 0):
f.write("\n")
f.write(x)

Now the data is ready to be run through the Girvan-Newman algorithm. This only needs to be done once, as the community output is written to a .graphml file and can be read into the program. Now the rest of the code runs in C++.

The graph and the movie name text file are read in at the initialization of the MovieRec object. The code for the extension has the option of being run after the main code finishes executing.

string movie;
cout << "Would you like to run the movie recommendation code? (type 'yes' in all lowercase): " << endl;
getline(cin, movie);

if (movie == "yes") {
cout << "Setting up..." << endl;
string movieFileName = "moviegraph";
MovieRec mr(movieFileName);
mr.interface();
}
}

If the user types yes, the main interface function is called. It starts by asking the user to enter in the name of the movie. This has to be chosen from the options in the .txt file and has to be entered exactly in the same format, but it isn’t case sensitive. The code checks to see if the user input is valid and if it is, it calls a function that returns a random subset of ten movies that are also in that community.

if (validMovie) {
vector<string> recommendations = getRecommendations(movieIndex);
cout << "Here are your recommendations: " << endl;
for (int i = 0; i < recommendations.size(); i++) {
cout << i + 1 << ". " << recommendations.at(i) << endl;
}

string again;
cout << "Would you like to submit another search?: " << endl;
getline(cin, again);

if (again == "yes") {
interface();
}
}

else {
cout << "Not a valid movie. Read the list and try again (Make sure format is correct. Not case sensitive)" << endl;
interface();
}
vector<vertex_descriptor> community =communities.at(movieCommunity);

//select a random subset of movies from community
int indexArray[community.size()];
int count = 0;
for (int i = 0; i < community.size(); i++) {
if (community.at(i) != movieIndex) {
indexArray[count] = i;
count++;
}
}

random_shuffle(indexArray, indexArray + community.size() - 1);

int amount;
if (community.size() < 10) {
amount = community.size();
}

else {
amount = 10;
}

vector<string> recommendations;
for (int i = 0; i < amount; i++) {
if (indexArray[i] != movieIndex) {
recommendations.push_back(nameVector.at(indexArray[i]));
}
}

return recommendations;

As a whole, I’m pleased with the results. The Grivan-Newman algorithm did a decent job as it did not make too many communities but also did not make too little communities. There is always room for improvement however, both as far as the algorithm goes and the movie recommendations. For the algorithm, I did have to have a slightly higher max modularity value (0.995) than the one I used for my main program (0.9). This is because the movie dataset has a higher amount of edges per vertex, so it was dividing into way too many communities.

One way I would like to improve my movie recommendations is to be able to include more movies in the dataset. The main data file from [3] contained thousands of movies, but I was only able to include 200. Adding more would have significantly increased the runtime of Girvan-Newman and would have been unrealistic in terms of the timeline of this project. Also it would be cool to add more categories to recommend by, such as release date, actors or actresses that appear, and directors. Another idea would be to expand to include the other forms of media included in the main dataset.

References

[1]: http://infolab.stanford.edu/~ullman/mmds/book0n.pdf

[2]: https://medium.com/analytics-vidhya/girvan-newman-the-clustering-technique-in-network-analysis-27fe6d665c92

[3]: https://www.imdb.com/interfaces/

--

--