Analysis of Top Hollywood Actors

How can Hollywood directors increase the likelihood of winning an award based on the cast?

Colleen Wang
INST414: Data Science Techniques
5 min readMay 3, 2024

--

Stakeholders/Decisions Informed:

In the world of Hollywood movies, there is fierce competition to hire the most notable, accomplished actors to fulfill their roles. The study of the most acclaimed actors and what makes them similar can hold significant findings for stakeholders in the film industry hoping to create a notable cast. By studying data about actors and their award nominations and wins, we can identify clusters of similar actors based on their award recognitions. Groups of actors who have had a similar number of award recognitions can produce higher quality movies, more chemistry, and more popularity therefore increase the likelihood of an award. A key question that can be answered by exploring a data set about actors is how cam Hollywood directors increase the likelihood of winning an award based on the cast.

The decisions the answer to this question could inform are relevant to casting directors and film producers who are trying to build a cast that will ensure success for the movie. These are the specific stakeholders pertaining to this question because the analysis of data about the top actors can help enhance viewership, notability, and anticipation for the movie. If an actor has a history of being nominated and winning awards, it suggests that they have significant talent and industry recognition, making them desirable choices for leading roles in movies. The decisions the answers to the question could inform include a casting director or film producers choice for leading roles in movies in order to increase the likelihood of winning an award.

Data:

To answer the proposed question, a data set containing information about the top Hollywood actors and their award recognitions is essential. The data should include metric data about the number of award nominations each actor has as well as the number of awards won for each actor. Data including their highest rated movie, age in numbers, and usual movie genre would also be beneficial for answering the question. These fields are relevant to the question because these metrics can be analyzed and can produce clusters of actors based on their success. An analysis of these fields can measure the likelihood that an actor will win an award or enhance the movie with their notable acting skills and inform stakeholders of decisions surrounding movie casting. I collected a subset of this data on Kaggle, a free resource for open data sets. The fields contained in this data set are:

  • Name
  • Date of Birth
  • Place of Birth
  • Oscars
  • Oscar Nominations
  • BAFTA
  • BAFTA Nominations
  • Golden Globes
  • Golden Globe Nominations
  • Greatest Performances

Similarity/K Selection:

In this analysis, similarity between data points is measured using Euclidean distance. The features used for measuring similarity are Oscars, Oscar nominations, BAFTA, BAFTA nominations, Golden Globes, and Golden Globe nominations. These features represent the actor’s achievements and recognition in the industry. I then used the K-means algorithm for clustering to group the actors into clusters based on these features and to identify groups of actors that are similar based on these recognitions. In this analysis I chose k=3 for the number of clusters. This was based on trial and error but also on the levels of recognition for all actors in order to capture those with many wins across all categories and those with few nominations and wins. The analysis is based on 6 features so choosing 3 allows for a meaningful, consistent range of recognition levels. I also chose 3 clusters to mimic the common classification in the industry such as C-list actors, B-list actors, and A-list actors. A higher number of clusters resulted in inconsistent groups which could cause confusion.

Figures:

Scatterplot of clusters of actors

Cluster Descriptions:

Cluster 1 can be described as a group of actors who have not won or been nominated for many awards. While these actors are notable names in the entertainment industry, they have not been able to achieve as much award recognition. Examples include Hugh Jackman and Nicolas Cage.

Cluster 2 consists of actors who have established names in the industry and have won many awards and nominations, although not as many as those in Cluster 3. Examples include Clint Eastwood and Peter O’Toole who have had an exceptional amount of nominations across awards.

Cluster 3 consists of the most highly acclaimed actors who have won multiple awards and have been nominated across award shows. Examples of actors in this cluster are Jack Nicholson and Dustin Hoffman who have won multiple Oscars, BAFTAs, and Golden Globes.

Answer:

The analysis of the data set revealed groups of actors in Hollywood based on their award wins and nominations. By examining these clusters stakeholders such as casting directors and film producers can make informed decisions about casting choices to increase the likelihood of winning awards. Actors in cluster 3 can be considered “A-list”actors with clusters 2 and 1 being “B–list” and “C-list” actors. Actors in cluster 3 could be prioritized to increase the likelihood of an award win and enhance the overall talent of the cast. Actors in cluster 2 and 1 can be considered for supporting roles to help increase the anticipation for the film and enhance cast chemistry. By analyzing the findings from this dataset, casting directors and film producers can be informed for decisions surrounding building a cast for increasing the likelihood of winning an award as well as overall success.

Data Cleaning/Bugs:

In this analysis, the selected dataset was fairly clean and consistent already so no major cleaning steps were necessary. The dataset had uniform data types and had no missing values in the columns that I selected to work with. I constructed a dataset with the features I selected, “‘Oscars’, ‘Oscar Nominations’, ‘BAFTA’, ‘BAFTA Nominations’, ‘Golden Globes’, ‘Golden Globe Nominations’” to extract these columns to be the basis for clustering. Some bugs others might encounter when analyzing this data set involve other columns such as “Date of Birth” or “Greatest Performances” where more context or standardization is needed. Others might encounter bugs when formatting these features and conducting an analysis. The output and graph rely on the calculation of the centroids of the clusters and the sorting of the sums of each cluster. This step ensures that the clusters are accurate and show the correct context behind which clusters represent which tier of actors.

Limitations/Bias:

The limitations of this dataset and analysis include the scope of the features used to calculate clusters. The features used rely on metrics from several award shows which does not encompass every award the actor might have received and also does not encompass an actor’s true talent or influence in the industry. The data set is also limited in its contextual data such as the most common genre of the actor and when they were nominated for these awards. AThere is also inherent bias present because the analysis relies on award show wins and nominations. Award shows are subjective and whoever wins can be influenced by external factors such as industry politics or biased voting. Another area of bias from this dataset is that it only examines the top male actors in the industry and does not consider female actresses which presents a higher possibility of biased analyses and conclusions.

Here is a link to my Github repository that contains the code I have developed for this assignment: https://github.com/cwangg/INST414-Modules/tree/main/module-4

--

--