Can AI predict a movie’s MPAA rating with just its summary?

Jason Salas
11 min readNov 5, 2019

--

Anyone that knows me knows that I really love movies and the movie industry, and that I’m always thinking about innovative ways to put video that people would find interesting in front of them. It’s algorithmic content merchandising.

As such, I’ve always wanted to streamline the process of assigning a rating to a film. So I spent a few hours building a system that analyzes movie studios’ descriptions/plots/synopses/summaries and predicts what rating — G, PG, PG-13, R, NC-17, or Not Rated — a film would most likely receive from the Motion Picture Association of America, THE authority on dictating who should be allowed to see what.

The almighty MPAA. Bow down.

When you think about it, this is a really tricky problem, given that most (not all, as we’ll soon see) movie descriptions are tactically-choreographed sales drivers using crafty marketingspeak, non-uniform prose, severely terse linguistics and often semantic ambiguity. This clever promotional technique can really shift the perception of a film — both for a human audience and an automated machine learning system. So I wanted to see if the synopsis data was stable enough to learn from and automate, to at least keep pace with how the MPAA delicately determines what movie is a little too raunchy or features one too many explosions or well-placed s-bombs.

I’ve floated this idea to several of my friends, many of whom work in tech, all smart and logical, but none in artificial intelligence circles. Nearly all told me this would never work. The info’s too short, they said. Not enough context. You can’t do deep learning with a single text feature. Can’t be modeled. Won’t converge. Not commercially viable. There’s no point. No one else has tried it before.

Challenge accepted.

This blog and associated code repository demonstrate how I developed a natural language processing framework in Python and Keras to predict a film’s MPAA rating solely from its synopsis. The system learns patterns from the vocabulary and style used to promote movies to audiences, attempting to influence people to go to the theater, rent, purchase, or stream a film.

My goal was to have a deep neural network learn the general underlying structure of these mini-narratives in order to infer where the film’s content should properly rank on the ratings spectrum.

This project is essentially an exercise in using a data science approach to assess advertising effectiveness.

No one way to skin a cat (or describe a film)

Authoring movie descriptions is a discipline that started back when studios had limited space on a movie poster, print ad, marquee or box cover. And in the more modern digital era, studios have to not only sell a production to the individual, but craftily appeal to search engine indexing and social media buzzworthiness. It’s a creative endeavor that produces wildly inconsistent data due to a multitude of approaches. Studios are not only battling for your money, they’re gunning for your attention. Whatever works in showbiz. And therein lies the problem.

There’s a lot of variance within the composition of movie summaries. Consider this broad range of approaches to make films, regardless of genre, appealing to potential audiences:

Lone Survivor (R), action/adventure

LONE SURVIVOR, starring Mark Wahlberg, tells the story of four Navy SEALs on an ill-fated covert mission to neutralize a high-level Taliban operative who are ambushed by enemy forces in the Hindu Kush region of Afghanistan. Based on The New York Times bestseller, this story of heroism, courage and survival directed by Peter Berg (Friday Night Lights) also stars Taylor Kitsch, Emile Hirsch, Ben Foster and Eric Bana. LONE SURVIVOR will be released by Universal Pictures in platform engagements on Friday, December 27, 2013, and will go wide on Friday, January 10, 2014. © Universal Pictures

Goldfinger (PG), action/adventure

To many, the quintessential Bond film and a brilliant third entry in the series. Here Bond gets his Aston Martin, spars with two statuesque British beauties and pits his wits against a memorable villain, Auric Goldfinger. Add the first Shirley Bassey theme song and some exciting action sequences and the result is an explosive cocktail.

The Assassin (NR), action/adventure

A female assassin is ordered to take out a nobleman she was previously engaged to. Directed by Hou Hsiao-hsien.

The Exorcist (R), horror

Novelist William Peter Blatty based his best-seller on the last known Catholic-sanctioned exorcism in the United States. Blatty transformed the little boy in the 1949 incident into a little girl named Regan, played by 14-year-old Linda Blair. Suddenly prone to fits and bizarre behavior, Regan proves quite a handful for her actress-mother, Chris MacNeil (played by Ellen Burstyn, although Blatty reportedly based the character on his next-door neighbor Shirley MacLaine). When Regan gets completely out of hand, Chris calls in young priest Father Karras (Jason Miller), who becomes convinced that the girl is possessed by the Devil and that they must call in an exorcist: namely, Father Merrin (Max von Sydow). His foe proves to be no run-of-the-mill demon, and both the priest and the girl suffer numerous horrors during their struggles.

See how varied those are across genres and ratings? Another thing that surprised me was that much to my amazement, movie descriptions have gotten shorter over time. That means for my purposes, there’s less information from which to extract patterns. (But possibly also less noise.) In the 1930s, plot synopses were much longer (on average 350 words) than today, with current descriptions being around less than half that (around 115 words).

Hypothesis

The way I figured, given the lack of standardization with which the descriptions are written, if I could train a neural ‘net to predict a film’s rating based exclusively on its synopsis and hit ~75% accuracy with a margin of error of +/- a single rating in either direction (e.g., a certain film predicted to be PG instead of its ground truth rating of either G or PG-13), I’d be happy. The experiment would be a success. It can be done.

For all the naysayers I encountered along the way, this would be giving them history’s biggest middle finger.

However, this is dangerous territory to have misclassification. Put in cinematic terms: the difference in erroneously labeling a film as PG when it should be R is literally the difference between Police Academy 4 and Police Academy. Oof.

Architecture

The structure of textual data means a lot of semantics and patterns, so I used a recurrent neural network architecture (RNN) with a long short term memory (LSTM) layer to maintain state within the scope of sentences. Most interesting to me is how the system would have to try to learn the components that make a film at least PG-13-worthy or up, namely the presence of nudity, profanity, drug use, sex or violence. And it would have to learn these from text…which may in some cases be explicitly stated, and in others slyly implied.

Another aspect that makes this a tough machine learning process is that with the description approach, each movie only ever has one official synopsis. In Stanford’s Large Movie Review Dataset, reviews from IMDb exist in a one-to-many relationship, with each movie able to have an near-infinite number of reviews. Here, the model needs to be structured to learn the distribution of data for what constitutes an R-rated slasher movie from a G-rated family animated film. Halloween versus Bambi, if you will.

Tomatometer to the Rescue!

I discovered the excellent Rotten Tomatoes Movie Database, hosted on Kaggle and , which has the fields I needed in a single CSV file. It boasts a healthy 29,811 films, which is a decent amount of data upon which to run a learning algorithm. Thanks to Ayush Kalla for putting this dataset together, which in his implementation was to predict a movie’s genre — another cool approach.

The dataset has an eclectic mix of genres — Action, Art & Foreign, Classics, Comedy, Documentary, Drama, Horror, Kids & Family, Mystery, Romance, SciFi. So certain approaches to cater to those niche audiences should help the learning along. I even considered using genre as a secondary input feature, but decided against it. I stuck to my plan.

One thing to note about the data scraped from Rotten Tomatoes — it differs from the other online resources in that it includes a healthy amount of foreign films, notably within the Asian market, so the verbiage used to promote those productions could not necessarily be the same as the US-domestic strategy and could skew the algorithm towards a completely different rating based on regional marketing.

Success! The Rotten Tomatoes Movie Database has everything I need in a single source.

Method

Now armed with a solid data source and consistent formatting, I was able to concentrate on processing the vocabulary. Rather than focus on the intra-corpus terminology, I opted opting to go with a pre-trained set of weights — Stanford’s GloVe word embedding vectors — containing 6 billion tokens from Wikipedia. It works extremely well as a general-purpose language backend and really saves a ton on the number of parameters that my system will have to spend time learning.

Initially, I built a very lean RNN with all layers and components using their default values and training for a modest 10 epochs achieved predictive accuracy of about 53%. This performance is statistically better than random guessing, so it’s a good baseline. To improve optimization, I refined the network with increased complexity — deeper layers and more neurons, and increasing the number of training epochs.

I further poked-around with a variety of hyperparameter tuning, including ramping-up LSTM units, upping the words supported by the Tokenizer, increasing the maximum length of the input word vectors, and increasing the batch size. Ultimately, my model achieved its best results without overfitting with the following configuration:

  • Tokenizer vocabulary: 10,000 words
  • Training batch size: 128
  • RNN LSTM layer: 128 units; dropout & recurrent dropout: 0.2
  • Output dimensions for embedding layer: 100
  • Maximum length of word vectors: 400
  • Training epochs: 35

After training for just short of 70 minutes on Google Colab’s cloud GPUs, the predictive accuracy got to about 72% without severely overfitting. That’s not entirely super-accurate for a practical implementation, but my goal of having a margin of error of +/- one MPAA rating is generally achieved. Nice!

Still, the overall system would benefit greatly with more data.

Divergence starts to rear its ugly head sometime around the 18th epoch

Learning tended to be slow going, even with relatively simple models.

Testing on unseen data

After testing and evaluating, here came the fun part — actually making real predictions! When choosing a summary the net’s not seen before, I deliberately use descriptions from IMDb, just to rule out duplicate examples that it might have already seen from the Rotten Tomatoes data.

It’s functional! But…not perfect. For example:

Scarface

“Al Pacino stars as Tony Montana an exiled Cuban criminal who goes to work for Miami drug lord Robert Loggia Montana rises to the top of Florida crime chain appropriating Loggia cokehead mistress Michelle Pfeiffer in the process Howard Hawks Marks the Spot motif in depicting the story line many murders is dispensed with in the Scarface instead we are inundated with blood by the bucketful especially in the now infamous buzz saw scene One carry over from the original Scarface is Tony Montana incestuous yearnings for his sister Gina Mary Elizabeth Mastrantonio The screenplay for the Scarface was written by Oliver Stone Hal Erickson Rovi.”

Predicted rating: PG-13

I’m not sure there’s a film critic alive (or dead) who’d be brave enough to recommend Scarface be PG-13.

Testing in the wild

I also built a browser-based front-end for people to test the prediction service outside the controlled confines of a Jupyter notebook, forking an excellent demo developed by Gilbert Tanner. and also to demo how to deploy machine learning models to real applications. It’s a simple web app developed in Flask, invoked via a form.

The catch here was not only persisting the RNN model to disk, but also in serializing the Tokenizer object, which has been seeded with the vocabulary from the training data. Using Python’s pickle module marshals the object across sessions, processes and devices.

Now, there is a gotcha to note in some slight class imbalance. The distribution of the rating representation is: R (10,884), NR (8,005), PG-13 (5,052), PG (4,172), G (1,606), and NC-17 (91). So in practical use, one might think that ratings predictions would tend to side with R, but the common vernacular of summaries make them actually more to side with PG-13. So there’s the bias.

Have fun with it! Improve it! Break it! Fork it!

I always find the hilarity in breaking my own inventions, so I started taking a stroll down Amnesia Lane to when I was a music journalist waaaaaaay back, and started plugging song lyrics into the system. It’s interesting to see what movie rating the system tries to force-fit with lyrics for sappy romance jams, or death metal (most of which surprisingly got a PG-13).

I similarly used the plot text from Wikipedia, which produced interesting results: since the Wikipedia plot summary uses language that’s way more graphic and detailed than the IMDb version or any official mass media marketing material ever could be, the predictions tend to “run home to mama” and side with PG-13, because the trained data was never that explicit.

I’ve also tested the volume of descriptions, gauging the predictive accuracy of multi-paragraph summaries against those that are a sentence or two. The presence (or absence) of strong terms makes a bit of a difference in computing the true rating.

So…did it work?

In the final wash, when evaluating a movie’s rating it all comes down to the films themselves. What signals does the video convey that lead the MPAA to assign it a certain rating? In another domain space I heavily tinker in, NCAA college football rankings, the system’s human bias and empirical evidence tends to see authorities lean towards certain expectations about the actual product. This phenomenon was nicely captured in the documentary This Film is Not Yet Rated.

Obviously, the plot synopsis doesn’t weigh as heavily for the fine folks that actually assign a rating as the actual film itself. But while trivial, it is fun as all get out to try a new approach.

And I proved that it is, in fact, possible.

Jason Salas has been writing film critiques since childhood, worked at Blockbuster Video as a college freshman in 1992, has been a diehard HBO addict, has developed recommendation engines and machine perception platforms for video content, has hosted more technical drill-downs about Netflix’s personalization and caching infrastructure than any person should ever be subjected to, co-hosts a weekly movie podcast, and has only ever walked out of the theater three times in his life because he didn’t see the point: Pretty Woman, Footloose and Dirty Dancing. He’s also clearly proficient at authoring run-on sentences.

--

--

Jason Salas

Machine learning, recommender systems, 360-degree filmmaking, college football rankings, movie stuff, general dorkery