Jaccard Similarity — Movie Director

Andrew Dziedzic
Web Mining [IS688, Spring 2022]
5 min readMar 10, 2022

A non-obvious insight that I would like to generate and extract from my data is to identify specific directors who have directed movies in more than one country, this is to potentially inform decisions on which directors can produce a film in a foreign country, and which director should be hired by a foreign movie company to make the country’s next popular film. These insights would clearly show movie corporations the director’s who can produce movies in foreign countries, and these directors should be carefully discussed as potentially creating another foreign movie.

The source of my data is from the Kaggle website (kaggle.com), and the specific dataset is ‘Wikipedia Movie Plots’, the dataset has just under 35,000 records of movies with the corresponding release year, title, origin/ethnicity, director, cast, genre, wiki page, and plot. The feature(s) I am using to determine similarity will be the origin/ethnicity, director, along with the release year, and director’s first name. Additionally, the similarity metric I will be using will be Jaccard similarity. Also known as the Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard similarity does not depend on the order of elements in the set. I will be using Python programming language to capture, manipulate, and query the entities. Lastly, I have performed a raw data download for the entire raw data and have saved the data onto my local laptop.

I will be providing three query entities, and there corresponding 10 most similar entities for each query. I will then compute the Jaccard similarity between the corresponding queries for each of the three query entities. There were many issues (bugs) that were encountered in the process while trying to produce the proper coding for the query entities as well as the similarity score. The initial data frame within the csv file had to be split apart between the two varying origin/ethnicities. The df1/df2 split had to be performed for each entity type. Lastly, correctly computing the Jaccard Similarity score by using import jaccard_score from sklearn.metrics was extremely useful and efficient. The only limitations I can see is that there could be more data points, maybe potentially have close to 100,000 datapoints. I also think that the addition of other origins/ethnicities would add greater value to the data as well as more datapoints would be collected.

The main takeaways from my similarity analysis are that there does not exist a similar director in the American/British combination of directors, at least for the 1st — 20 datapoints that appear in each of the datasets. This would entice further analysis as to what specific genre, and release year ranges could potentially show an extremely high Jaccard similarity score. Secondly, the British/Canadian combination of directors, looking specifically for directors whose release year ranges from 1976–2005 and whose first name is Charles is 0.2000. We can infer that Charles Jarrott directed films in both Britain and Canadian ethnicities and can be utilized again to direct a similar movie in the foreign country. Lastly, the Russian/British combination of directors, looking specifically for directors whose first name is Sergei is 0.1666. We can infer that Sergei Bodrov directed a Russian and British movie during his lifetime. If there was high interest for Sergei Bodrov to direct a British movie very similar to the movies he directed in Russian, he would be the best choice. Complete understanding of the code is provided below as well as the corresponding output:

path_dataset ="wiki_movie_plots_deduped.csv"  
import pandas as pd
data = pd.read_csv(path_dataset)
data.head()
len(data)
import numpy as np
np.unique(data['Origin/Ethnicity'])
print("# of American movies" + ' ' + str(len(data.loc[data['Origin/Ethnicity']=='American'])))
print("# of British movies" + ' ' + str(len(data.loc[data['Origin/Ethnicity']=='British'])))
Number of movies from America 17377
Number of British movies 3670
df1 = pd.DataFrame(data.loc[data['Origin/Ethnicity']=='American'])
df2 = pd.DataFrame(data.loc[data['Origin/Ethnicity']=='British'])
data = pd.concat([df1, df2], ignore_index = True)
print ("# TOTAL movies in the American & British dataset now" + ' ' + str(len(data)))
# TOTAL movies in the American & British dataset now 21047finaldata = data[["Title", "Plot"]]
finaldata = finaldata.set_index('Title')
finaldata.head(10)
from sklearn.metrics import jaccard_score
def jaccard_similarity(mythology, devotional):
s1=set(American)
s2=set(British)
return float(len(s1.intersection(s2))/len(s1.union(s2)))
American = ['Edwin S. Porter',
'Wallace McCutcheon',
'Wallace McCutcheon and Edwin S. Porter',
'Francis J. Marion and Wallace McCutcheon',
'Edwin S. Porter',
'Wallace McCutcheon and Ediwin S. Porter',
'Edwin Stanton Porter',
'D. W. Griffith',
'D. W. Griffith',
'D.W. Griffith',
'D. W. Griffith',
'D. W. Griffith',
'D. W. Griffith',
'D. W. Griffith',
'D. W. Griffith',
'D.W. Griffith',
'D.W. Griffith',
'Sidney Olcott',
'D. W. Griffith',
'Oscar Apfel']
British = [

'Hugh Ford',
'Henry Kolker',
'Alfred Hitchcock',
'Herbert Wilcox',
'Alfred Hitchcock',
'Alfred Hitchcock',
'Alfred Hitchcock',
'Frank Miller',
'Herbert Wilcox',
'Alfred Hitchcock',
'Monty Banks',
'Alfred Hitchcock',
'Alfred Hitchcock',
'Ewald André Dupont',
'Alfred Hitchcock',
'Anthony Asquith',
'Victor Saville',
'Alfred Hitchcock',
'Ewald André Dupont',
'Victor Saville'
]

jaccard_similarity(American,British)
0.0

Same Code above, except now with British/Canadian:

#Specifically searching from 1976-2005 for all Directors with the first name of "Charles"
from sklearn.metrics import jaccard_score
def jaccard_similarity(mythology, devotional):
s1=set(British)
s2=set(Canadian)
return float(len(s1.intersection(s2))/len(s1.union(s2)))
British = ['Charles Jarrott',
'Charles Sturridge',
'Charles Sturridge',
'Charles Sturridge',
'Charles Beeson']
Canadian = [
'Charles Jarrott',
'Charles Martin Smith',
'Charles BinamA',
'Charles Martin Smith',
'Charles BinamA'
]

jaccard_similarity(British,Canadian)
0.2

Same Code above, except now with Russian/British:

#Specifically searching for all Directors with the first name of "Sergei"
from sklearn.metrics import jaccard_score
def jaccard_similarity(mythology, devotional):
s1=set(Russian)
s2=set(British)
return float(len(s1.intersection(s2))/len(s1.union(s2)))
Russian = ['Sergei Bodrov Jr.',
'Sergei Loban',
'Sergei Bodrov',
'Sergei Loznitsa',
'Sergei Zhigunov']
British = [
'Sergei Nolbandov',
'Sergei Bodrov'
]

jaccard_similarity(Russian,British)
0.16666666666666666

--

--