FAQ for the “Film Dialogue, By Gender” Project

6 min readApr 10, 2016

I’ve been receiving a lot of questions about the project, almost to the point of “death by a thousand nit-picks.” Here’s an FAQ using real responses from this project.

“You just cherry-picked films that fit your agenda. This isn’t even a properly created sample.”

First, I’d challenge anyone to answer what a proper sample would look like (more about this later on).

Second, here’s are the main sources for scripts, from which we downloaded every script we could find:

That accounts for about 99% of our “sample.” This is about 8,000 scripts. I’d argue that this is essentially every easy-to-find, publicly-available screenplay on the Internet. When it comes to sample-bias, this would be the only way it could happen: publicly-available screenplays improperly bias male-dialogue-driven films.

It also means that forming a sample of 2,000 screenplays from public Internet sources without any gender-dialogue skew would be impossible. I’ll write that again: I’d challenge anyone find 1,000 scripts on public-screenplay sites that over-index female in dialogue.

From there, we had to eliminate scripts that didn’t parse or didn’t match the IMDB cast very well.

“You’re missing these characters, WTF!” or “X Movie has women with lines and it says 100%, WTF!”

A few things could have happened. 1.) the process of parsing erroneously excluded the character, 2.) The script did not have the character but the film did, or 3). this is a minor character which was excluded from the study.

“Wait…you’re excluding minor characters????”

Yes. And by minor character, I mean less than 100 words of dialogue. Matching characters to an IMDB page is a very error-prone process. Usually it’s smooth sailing for major characters, but not so much for minor. For example, a minor character might be the “hooker” (sic) from the Trading Places script, who utters a grand total of ~10 words. Technically she’s in the film, but roles like these are poorly labeled on the IMDB cast list. On the cast page, they might call her “lady on street.” And to identify gender, we use a person’s IMDB page. So without that, we need pronouns from the script or have watched the film.

We were willing to exclude these from the analysis because the spirit of the entire project was measuring dialogue by gender across a large sample of films. Major characters typically have 2,000 to 5,000 words of dialogue. We are excluding characters with less than 100. So from an overall stats perspective, we’re introducing a very small amount of error. In the case of Trading Places, its stat would move from 95% male dialogue to maybe 94%, assuming that the other remaining minor characters wouldn’t skew it further the other way (which in all cases, it probably would).

“Ok. But I still don’t see lines for character X.”

Please send any obvious omissions to us at matt@polygraph.cool or tweet at me @matthew_daniels. This is a massive dataset and we’re very aware that there are missing characters from the scripts. That said, we’re still confident that a 90% accurate dataset produces the same general message as a 100% accurate dataset. You’d also have to be confident that the 10% of errors would not have the same characteristics as the remaining 90% in order for the results to shift.

That all said, we’re still fixing errors. We survived a front-page mention on Reddit, 4,000 comments, and thousands of tweets. Over a million people have visited the project, and we’ve fixed any errors that they’ve mentioned…and it’s basically the same results :)

“Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!”

We don’t. As a rule of thumb, we used the actor/actress’s IMDB info to identify gender. So in some cases, this creates inaccuracies. Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach.

“This study is bullshit because you didn’t…form a representative sample.”

True. At first, we wanted to normalize the sample by using only films in the top 1,000 by box office. But we quickly realized that it would impossible to get a screenplay for each of these films.

So we went big instead: visualize the data for every screenplay that we could find. This means that the sample could be skewed by the availability of screenplays on the Internet OR we could have nefariously cherry-picked screenplays to prove an agenda.

That all said, I also think that it’s hard to actually find a proper sample. What’s representative of Hollywood? Big budget or small? What genres? Should it be weighted by box office?

These are hard questions, so we decided to visualize everything and open source the data. Plus, we visualized only films in the top 2,500 by box-office (for which we have ~ one-third) — a quasi-normalized set. If people can find a way to normalize a smaller sample, by all means.

“This study is bullshit because you didn’t…have a hypothesis.”

This was an exercise in data gathering. We don’t need to follow a perfectly structured academic study because…

  1. This is the Internet. Not academia.
  2. We’re publishing on a .cool domain, not an MIT Journal

Perhaps you feel that anything on the Internet should pass the academic sniff test. In that case, feel free to ignore this whole project.

In no way have we ever represented this project as proof for sexism. That said, I do think that it’s a pretty heavy signal for gender imbalance…but even then, the conclusions in the article are merely presenting the stats.

Personally, I was hoping that the stats could invalidate many of the over-hyped claims about gender imbalance. But that wasn’t the point. The point was to collect data and open source it.

“This study is bullshit because you didn’t…consider on-screen time or the context of how a character is portrayed.”

Indeed! We didn’t consider that. Which is why we released the data and didn’t say “Research proves sexism.” Personally, I do think that it sheds some light on gender representation in film, but it’s one light: dialogue. There’s also screen time. There’s also the softer side of the role of a character in a plot. These are all valid points. But it no way should it undermine an exercise in data collection that adds a significant piece to the discourse surrounding representation in Hollywood.

I’d also add the screen-time is almost impossible to measure because you’d need humans. And humans are far more error-prone than using software to count words in a screenplay.

And on that note…

“Screenplays are bullshit because what really matters is the exact dialogue in the film.”

I’d actually challenge that assumption: shouldn’t the screenplay, the thrust for creating a film, matter more than its final product? The screenplay informs everything: casting, directors, marketing, budget…

That said, if we used final film dialogue, we have the same problem as measuring screen-time: human error. We’d have to watch every film and have a human count the number of words spoken by a character. In terms of verifying results and identifying errors, this approach would be a nightmare. We have to assume that screenplays, across a large sample, are more or less representative of the final product. For the results to be different, there’d have to be systematic differences (i.e., characters cut in post-production) that over-index male.

“Wait but you didn’t even form a model, check for statistical significance, look at standard deviation, etc.”

Again this isn’t about presenting a finding. We’re collecting data, open sourcing it, and allowing other people to draw their own conclusions. If people use this data to prove that gender representation concerns are mis-placed, awesome.