Evolution of Popular Entertainment Over Time According to New York Times Movie Reviews — A Semester Project Report

Megan Resurreccion

Published in

Web Mining [IS688, Spring 2021]

20 min readMay 14, 2021

Authors: Kantida Nanon and Megan Resurreccion

Source: https://www.pexels.com/@cottonbro?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels

Problem and Motivation

Popular entertainment evolves. According to a 2011 article by Cutting et. al, films in the English language between 1935 to 2010 have decreased in terms of shot length, motion, and luminance. While these refer to the actual production of films and not the actual plot of films themselves, this still indicates a change in film production. We can assume that since 2011 that films have continued to evolve. If the article by Cutting is correct, films have become shorter and darker since 2010, and that these films continue to become shorter and darker as well. With the information of movie reviews, we can also determine if this trend is successful and is being perceived well by critics and regular movie-goers alike.

Entertainment has become more necessary than ever before, particularly during the COVID-19 pandemic. Streaming services have become more popular during quarantine, allowing many more people to spend time watching films and television. It can be helpful for the entertainment industry to know what trends have been popular over time can be helpful to them in deciding what to write and produce next as well as to obtain an idea of what kind of entertainment is and has been considered “good” and “entertaining” to society. Knowing and figuring out what the general public wants to watch is vital to producers, directors, actors, and anyone else in the film industry.

The Data and Key Ideas

We are studying trends in the entertainment industry as to whether popular entertainment such as films has evolved. Some questions we are investigating include what does the general public want to watch? What kind of films do critics believe to be considered “good” and are worth watching? As members of the general public, we want to assume that films critics believe to be “good” are worth watching. Films also represent and reflect our society and help provide a picture of what is important to humans overall. This information can be helpful to those in the film industry to figure out what kind of entertainment the general public enjoys and what kind of entertainment should be produced and created next.

Source: http://attheback.blogspot.com/p/the-ultimate.html

New York Times (NYT) Movie Review APIs give us access to the movie reviews by the NYT critics. The API gives access to over 22,000 NYT movie reviews from today until back to 1920. This API allows us to query review contents and get other data such as byline, critics pick, date updated, MPAA rating, title, headline, opening date, article link, and multimedia. We used the New York Times Movie Review APIs to investigate how popular entertainment (particularly books and films) has evolved. Python (the language we will likely use) has an API wrapper for this called Pynytimes. The requests library will also come in handy for this project. Using the API, we can find movie reviews according to title keywords, dates, or critics.

The interesting key ideas are extracting characteristics of movies that are vital to a movie being popular or unpopular, such as if one genre has become more popular than another or if there are certain tokens in a critic’s review that correlate with whether a film is successful or “good.” Besides, extracting what genres, tokens from reviews, and sentiments are associated with popular entertainment throughout the years with NYT API. Another interesting key point is the data analysis with tokenization analysis and sentiment analysis on the reviews to conclude whether the movie is good or bad based on the critic review. Additionally, the insight from the data can predict what might be the top tokens of the next decade from the last decade, to see how movies might evolve. This project could help the public to identify good movies. It could also help the filmmakers in coming up with good movies which the general public will be interested to watch.

The Process
Data Collection
We have used the New York Times Movie Reviews API to collect New York Times movie reviews by 82 listed movie critics. We extracted their list of reviews, review contents, the reviewer name, publication dates, and article links into the CSV file for each critic. The data collection process consists of the following steps:

We used the following libraries in a Python Jupyter Notebook: pynytimes, requests, pandas, JSON, BeautifulSoup, re, and itemgetter. We also used an API key with registered accounts on the NYT website to be able to access this data.
Using the API, there are several ways to extract movie reviews. They include getting movie reviews that are critics’ picks, movie reviews by title with a specified keyword, movie reviews by U.S. opening date, movie reviews by date, and movie reviews by a specified critic. [1] However, there is also an option to search for all the critics’ names while there is no way to search for all movie reviews at once. Hence, we decided to first obtain all the critics’ names and collect each of their movie reviews. We used the requests, JSON, and pandas libraries to do this. Then we exported columns with just their names as a list into a CSV file for future reference.
Once we had our list of critics to refer to, we went on extracting information on the reviews they have written. First, we extracted a pandas data frame of general information on reviews for each critic. This contains the reviewer name, movie title, Motion Picture Association film (MPAA) rating, publication date, article links if that film was a critic pick, etc., shown in the screenshot below. We then stored each of these data frames in their own CSV file.
We extracted the review content by referring to the article links under the link column of the data frame previously obtained. The information contained under this column is a dictionary indicating the type of review the critique was (normally ‘article’) and the associated URL. We extracted just the URLS first. To do this, we filtered the data frame to just the link column, then used the itemgetter library to extract just the URL values from the dictionaries, then appended all those URLs into a list.
This next step is arguably the most important in extracting the article content: cleaning the text and storing it in some way. It was also the most difficult step to figure out. This is where the use of requests and BeautifulSoup comes into play. We use BeautifulSoup’s HTML parser to extract the text of the article and strip all of the HTML, CSS, and new line (/n) tags. For each of these articles, we append them into a new list.
We finally converted lists into another pandas data frame and exported the list of review contents (text) into a CSV file.
Moving forward, we put all the CSV files together to better analyze them together. We took the article content and added it as its own column to the first CSV files we originally extracted. It also helps to study the content of the article texts separately.

With 82 critics, we have 164 files (2 files for each movie critic) and 1097 movie reviews in total to analyze those data sets. We have several hundreds of reviews to analyze as a result. The number of movies each reviewer has been reviewed is between 1 and 20. We combined and stored them in a single CSV which has 13 columns containing the following data in each column: movie title, MPAA rating, critic’s pick, reviewer name, headline, summary, publication date, opening date, updated date, and article link. The data frame is shown in the pictures below. The way each piece of data is stored, however, was either in a list or dictionary. The link, column, for example, displays its content in dictionaries. That dictionary along with all the other information for a given review for a specific critic is stored in a list. The use of lists and dictionaries within a data frame allows for easier manipulation of our data should we need to manipulate it. Then we cleaned data by deleting the data that has the review contents’ row blank. After cleaning the data, we have the remaining 936 rows in our data set.

We have analyzed plots and statistics of the data set. Some insight we have found includes the average number of movie reviews for all critics is 14 movie reviews, the number of movies each reviewer has been reviewed is between 1 and 20, and only 16.77% or 184 of 1097 movies rated as critics pick. Some statistics tables and graphs of the data set are shown in the following figures.

Figure 2: Statistics of the Film Review Data

Figure 3: Histogram (with Normal Curve) of Date Duration| Figure 4: Individual Value Plot of Majority Date Duration

The figures above show the average number of the date period between the movie publication date and the review opening date is 29 days, the range of the date duration is between 0–5,753 days, the standard deviation is 315.6 and the majority of the date duration is in the 0–25 days range with 648 movies or 94%. This insight can be useful for the movie production in terms of the time duration in getting their movie feedback based on the movie reviews from the NYT.

Data Analysis
We used clustering and content mining techniques to find what NYT critics think of movies over a set period. The clustering and content mining techniques we used include web mining with an API, Natural Language Processing (NLP), tokenization, sentiment analysis, and clustering or topic modeling.

A. Tokenization

We analyzed the summary shorts of movies by a tokenization analysis using the summary short column in our data to look at the kinds of movies that were being reviewed and how they may differ across the years. The summary short column is a description of the movie and not the critic’s review of the movie. In the following graphs, we show the most common tokens among these summary shorts for all films, films that are critic picks, and films according to these time ranges: 1920–1959, 1960–1969, 1970–1979, 1980–1989, 1990–1999, 2000–2009, and 2010-present. The dates were taken from the date of publication for reviews. The number of reviews represented in each time range wasn’t perfect, as there will be more reviews in some periods than others. However, some context as to the kinds of movies that were reviewed at those times. This was also why the years 1920–1959 were grouped since there were fewer films reviewed in their respective decades.

After extracting the summary shorts, we cleaned up its data as necessary. We dropped any duplicate summaries for films that may have been reviewed more than once and excluded NANs. We then lowercase all the summaries and tokenized them. From there, we filtered out unnecessary tokens such as stop words as well as ‘film’ and ‘movie’ as they tended to be very frequent among the summaries. For the following lists and plots shown, we include the thirty most common tokens for that category. Among the summary shorts for all the films, it can be presumed that documentaries (94), dramas (53), and comedies (51) are the most reviewed among critics. Other notable adjectives that describe the films in these summary shorts include family (33), love (27), and american (27).

For more context, if a film is designated as a critic pick, it is a good sign. This means that the critic recommends the movie they reviewed for others to view. The number of critic pick films is 184. The number of non-critics pick films is 904. Finally, the percentage of films that are critic picks is 20.35%, approximately one-fifth of movies reviewed by New York Times critics. Again, documentary and drama are top tokens for the films reviewed here. Other notable tokens don’t appear as commonly in the other categories such as king (10), david (10), and adaptation (8). It’s also worth noting that since the percentage of critics who pick films is smaller than the overall sample size that there is less to significantly analyze and extract as useful.

As previously mentioned, the years 1920–1959 were all grouped due to the small sample size of films for each of the years’ decades. Years 1920–1929 only had three films, for example. The following decades (the 1930s, 1940s, and 1950s) had more, but individually these were still too small of sample sizes. Thus these years of “oldie” films were combined. Drama (4), war (4), and broadway (3) are notable tokens here.

The sample size for films in the 60s is also small. For films in the 1960s, notable and unique tokens include london (4), strained (3), and daughter (3).

Similar to films in the 60s, the films in the 70s also had a small sample size. Notable tokens here include syndicate (3), women (3), and melodrama (2). The token, documentary (2) also makes an appearance at the bottom of the top thirty tokens. It can also be seen that the James Bond movies–denoted by james (2) and bond (2)–were released and reviewed during these years.

For films in the 1980s, it can be seen that more horror and thriller films were released as noted by the tokens, killer (4), gore (4), and horror (3). Other notable tokens related to those genres include kidnapped (2) and whodunit (2).

For films in the 1990s, janet (5) and maslin (5) are two notable tokens. However, upon further investigation, it’s found that Janet Maslin is one of the NYT film critics and her name is shown as part of the summary short. As the summary short is meant to be a brief description of the film, we’re not sure why the critic names have appeared here. Otherwise, other notable tokens include family (6), english (4), and class (4).

The 2000s decade is where the documentary token (21) is first shown as a significant-top token. Although it’s possible documentary was a token in previous decades, the genre has seemed to gain more traction during these years. We can also see that the sample size for the films in these decades has increased significantly. There are also more names denoted as the most common tokens for this category including lawrence (20), dave (20), elvis (17), and mitchell (17). The token, mr (25)–presumed to be Mr.– is also the second most common token here.

Finally, as we saw in the first plot, documentary (70), drama (27), and comedy (23) are the most common tokens here. This is likely because films from this decade had the largest sample size. Therefore, the tokens for summary shorts among films of the 2010s skew the original plot.

B. Sentiment Analysis

Sentiment analysis is a way to extract subjective information from some text. Oftentimes, sentiment analysis ends in a text being identified as positive, neutral, or negative. Analyzing the sentiment of critic reviews will aid in classifying what the opinion of the critic is for that film. For our analysis, we use the VADER sentiment library. Additionally, moving forward we decided to combine some of the decades into their own periods. These periods, as you’ll see, are defined as:

Oldies (20s-60s)
Retro (70s-90s)
Y2K (2000s)
Modern (2010s-present)

Figures 14 and 15 measure the overall sentiment of each critic’s review for the specified film. The sentiment score ranges from -1 to 0 to 1, with -1 being the most negative, 0 being neutral, and 1 being the most positive. The threshold for most negative and most positive is a sentiment score of below -0.9 and above 0.9, respectively. These are some of the films that had the most positive and negative sentiments. We can also see that more films overall had a positive sentiment than negative sentiment. Films with the most overall negative sentiment include The Crucible, Mississippi Burning, and Bobby Sands: 66 Days. On the other hand, films with the most overall positive sentiment include Losing Chase, Dr. Ehrlich’s Magic Bullet, and Death Takes a Holiday.

Figure 14: Sentiment Analysis of all reviews sorted by most overall negative sentiment | Figure 15: Sentiment Analysis of all reviews sorted by most overall positive sentiment

We apply the same sentiment analysis techniques to films from each period as well, starting with the Oldies. Similarly, we can see that more reviews had a positive sentiment than a negative one. Figures 16 and 17 showcase films with the most positive and negative sentiments. These include Dr. Ehrlich’s Magic Bullet, La Croisière Jaune, and Death Takes a Holiday, and How I Won the War, These Are the Damned, and Diary of a Chambermaid, respectively. Among the thresholds, there were 103 films with a sentiment score above 0.9 and 18 films with a sentiment score below -0.9.

Figure 16: Sentiment Analysis of all Oldies reviews sorted by most overall negative sentiment | Figure 17: Sentiment Analysis of all Oldies reviews sorted by most overall positive sentiment

Among the Retro group of films, there was also a larger proportion of positive reviews than negative with 171 films having a sentiment score above 0.9 and 62 films having a sentiment score below -0.9. Films with the most negative sentiment include The Crucible, Mississippi Burning, and Hell Night. Films with the most positive sentiment include Losing Chase, It’s All True, and Million Dollar Mystery. Figures 18 and 19 display other films with similar sentiments as well.

Figure 18: Sentiment Analysis of all Retro reviews sorted by most overall negative sentiment | Figure 19: Sentiment Analysis of all Retro reviews sorted by most overall positive sentiment

With the Y2K films, we can see that the difference in the ratio between very positive and very negative film reviews is a little smaller. We still have more positive than negative films, but not as many as in the previous periods with 55 films having a sentiment score above 0.9 and 25 films having a sentiment score below -0.9. In Figures 20 and 21, we can see films with the most negative sentiment include Ju-On, Shotgun Stories, and I Can See You, and films with the most positive sentiment include 13 Going On 30, Mean Girls, and America’s Heart and Soul.

Figure 20: Sentiment Analysis of all Y2K reviews sorted by most overall negative sentiment | Figure 21: Sentiment Analysis of all Y2K reviews sorted by most overall positive sentiment

Finally, for the Modern films, we again find that there were more positively rated reviews than negative ones. 150 films had a sentiment score above 0.9 and 63 films had a sentiment score below -0.9. Films with the most positive sentiment scores included Sylvie’s Love, The Incredible Jessica James, and Mulan. Films with the most negative sentiment scores included Bobby Sands: 66 Days, The Devil All the Time, and Buoyancy. Other similarly scored films are in Figures 22 and 23.

Figure 22: Sentiment Analysis of all Modern reviews sorted by most overall negative sentiment | Figure 23: Sentiment Analysis of all Modern reviews sorted by most overall positive sentiment

C. Clustering

Another technique we’ve used to analyze our dataset is clustering, particularly k-means. This technique groups a set of data in such a way that data in the same group are more similar to each other than to those in other groups. particularly k-means. Clustering is a way to group data together by finding relationships between them and we apply that to our text data.

With the k-means clustering algorithm, you have to choose a value ‘k’ for the number of clusters for your data. A way to find the best number of clusters is the elbow method. The image on the left-hand side of Figure 24 is an example of how the elbow method works, with where the point of the “elbow” is, is the number of clusters you select. We implement the elbow method to our data as you can see on the right-hand side of Figure 24. There’s not a clear point as to what the elbow is, so we did some trial and error as well. We settled on trying 8, 9, and 10 clusters and ran our data with our k values as well, but felt 9 was the most accurate of all of them.

Figure 24: Example of Use of the Elbow Method (left) and Result of Using Elbow Method (right)

With clustering, we need to form groups from these reviews and we do that by using topic modeling. To do this, we vectorize the reviews and extract topics using non-negative matrix factorization from the sklearn library. NMF is an appropriate tool for topic extraction. We can identify the number of topics we want, in which we apply our k value and a list of words for each topic is generated. Those results can be seen in Figures 25 and 26.

Figure 25: Output of generated words from topic modeling with a k-value | Figure 26: output of what the NMF matrix looks like

Finally, we can see some results. For all film reviews, we have our 9 topics and clusters as seen in Figure 27. Topics include words that revolve around documentaries, family, and friends, which are fairly generic. The cluster plot in Figure 28 is a little messy but for the most part, we can still make out some defined clusters according to the topics on the left. Despite the small inaccuracies with clustering we’re seeing on the right, we still felt 9 clusters was best. This suggests that the model was not perfect. Still, we test the NMF model on the rest of our filtered data by our defined periods.

Figure 27: generated topics for all reviews | Figure 28: clustering plot of all reviews

For the Oldies, topics in Figure 29 mention some nationalities or ethnicities, mostly german or french, suggesting an emphasis on films in France or Germany, or films that feature France/German characters. With the token, ‘war’ floating around, this may refer to films revolving around wars as well, such as films about world war I or world war II. Additionally, the clusters here in Figure 30 are mostly balanced out except for cluster 7.

Figure 29: generated topics for all Oldies reviews | Figure 30: clustering plot of all Oldies reviews

For the Retro movies, topics in Figure 31 here include children, school, and music. The clusters in Figure 32 are cleaner, for lack of a better word. Rather, they’re more easily defined. Most films are in clusters, 4, 7, or 8. There is an inaccuracy with points from cluster 2 (green) near cluster 0 and cluster 7, however.

Figure 31: generated topics for all Retro reviews | Figure 32: clustering plot of all Retro reviews

For the Y2K films, the topics in Figure 33 seem fairly generic with similar tokens like friends, time, and life. Again, the clustering plot in Figure 34 doesn’t have as many points here and clusters 1 and 2 could be better defined. The smaller sample size likely impacts the effectiveness of the clustering algorithm and model.

Figure 33: generated topics for all Y2K reviews | Figure 34: clustering plot of all Y2K reviews

Finally, with the Modern film reviews, we have similar topics in Figure 35. One of the tokens mentioned here however are documentaries, which seem more prominent in Modern times. The plot is somewhat problematic in that some clusters have points near other clusters such as with cluster 0 and cluster 3. The clustering plot shown in Figure 36 does only an okay job at grouping these points, but clearly, there are some problems with it.

Figure 35: generated topics for all Modern reviews | Figure 36: clustering plot of all Modern reviews

Overall, we found that there were more positive than negative reviews in older films than newer films, with more reviews in recent years having slightly more of a balance between positive and negative reviews. We also found with the topic modeling that older movies featured more war in them. however, all in all, there’s not a huge shift in topics or evolution in these topics. Additionally, we just want to point out that one critic’s opinion isn’t the same as everyone else’s opinion.

What We Learned

Kantida: I learned to use the NYT API to extract movie data from New York Times which is interesting to me. The API is useful to extract movie reviews and other movie data. Also, I have learned to extract web data using API with Python such saves us more time comparing with C# based on my experience. Another thing I have learned from this project is the Similarity Clustering technique which is helpful in terms of reflecting the strength of the relationship between data items. The Similarity Clustering techniques such as tokenization, sentiment analysis, and clustering can be applied to future data mining and analysis.

Megan: This semester’s project proved to be a challenge. Working with textual data can be difficult due to how language can be interpreted and written differently by different people. Still, I was able to use some techniques that people who work with Natural Language Processing use to learn more about what it’s like to work with this type of data. I also was able to obtain some more experience with accessing web data using an API, specifically the New York Times Film Review API. Finally, clustering data was something that I learned much of as well because of the number of ways you can change the clustering model and the number of available methods, techniques, and algorithms people have developed and used to analyze data.

Limitations, Ethics, and Improvements

As far as limitations go, we’d like more information on the cast, crew, production, and budget for films and not just review content. It would help to explore other clustering algorithms beyond k-means. Although we use k-means for clustering, there may be a different clustering algorithm that is better suited for this data that we did not get to try. K-means is a clustering algorithm we focused on in this course, but maybe something better would have suited our work and it’s something that would be worth trying out.

Other limitations include the data analysis using Gephi to extract some connections between the MPAA rating and among all critics. However, it didn’t work since there were a lot of data rows that were missing. Therefore, there is no connection between those nodes. Also, in regards to the tokenization of the summary shorts, it’s provided some insight as to the kinds of popular genres and characteristics of films being reviewed during those periods. However, there appear to be some inconsistencies that need to be investigated. It’ll be important to also look at the names mentioned in the token lists to see if they refer to a critic, actor, character, director, etc.

Additionally, the number of films reviewed in each decade is not equal, which skewed the results in the first plot. For the summary shorts, we also thought about running a sentiment analysis, but as these are just descriptions for the movies, they likely wouldn’t prove to be very insightful or useful.

The dataset itself is also fairly small. When collecting data for film reviews, we didn’t see a way to collect every film review ever written by the NYT, but we were able to collect a list of all the film critics that had written for the NYT and collect data on each of their reviews. It’s possible that we still did not capture each critic’s review and therefore do not have a comprehensive dataset of every film review ever written by the NYT.

Finally, as all NYT reviews are published publicly online on the NYT website and there is information on films that is easily searchable, we did not come across any ethical issues when it came to using or analyzing this data. You can test out the NYT Movie Reviews API here.

References

[1] Joshi, M.; Das, Dipanjan.; Gimpel, K.; and Smith, N. A. 2010. Movie reviews and revenues: an experiment in text regression. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT ‘10). Association for Computational Linguistics, USA, 293–296.