A Deep Data Dive Into MIFF’s Doco Shorts from 2004–2019

William He
The Startup
Published in
13 min readFeb 1, 2021


I’ll be up front — It’s been a while since I’ve done any data visualisation, and I’m worried that I might be getting rusty, so I thought I’d ease myself back into it with some stuff I had on hand. With that in mind, here’s a deep dive into all the documentary shorts that have screened at MIFF between 2004 and 2019 — complete with some fun graphs, because why not! (Animated versions viewable here.)

Why am I doing this? Well, the past two years I’ve gone to MIFF’s doco shorts programmes, there’s been little to no Australian presence — in fact, there was only one Australian film in 2019, and none in 2020. Many of my friends, as well as my lecturers, hold this opinion as well, and particularly like to voice how much of a shame it is given how difficult it’s been for Australian documentary directors to get themselves some clout-bearing laurels to launch their careers.

It’s easy to fall into this trap of an opinion; after all, why else wouldn’t I be raking in the big bucks whilst putting together the pitch document for my next feature documentary about something deeply personal/a sports-based controversy/a fairly niche political interest with a human twist? Of course, I could just be “not particularly noteworthy”, as an external assessor has once described me, or “difficult to understand and kind of boring”, as opined by my dad.

But of course, to approach the topic more objectively, and to supply concrete evidence supporting my dad’s opinion of me, I thought it best to use my Mathematical Economics and Applied Statistics background to answer this question quantitatively!

So let’s have a look, shall we?

The Question(s)

The big question to answer here is: Are there less and less Australian doco shorts in MIFF’s programming across time? Using a bit of the ol’ regression analysis, we can pretty easily answer this, controlling for other factors as well (like language, runtime, festival director, et cetera).

But, why stop there? There’s potential to answer plenty of other stuff too! One thing that could be particularly interesting is seeing if we can predict a winner based on readily available data (so all that stuff like runtime, director, language, all the factors we listed earlier that you can easily look up for each film). So, the second question we can be asking is: Are we able to predict a winner based on available film data? Same deal here too — can use a bit of fun logistic regression to model probability of winning given changes in different factors being examined (so you might be more likely to win best doco short if your film runs for longer, as an example).

The Data

The data I’m using is manually gathered data from the MIFF archives between 2004–2019 — you can view it here, if you’d like. The data was manually gathered because the MIFF archives are terrible. You might think “oh, it’s an archive, surely if it’s an archive then it’s consistently codified and you’d be able to scrape the data using some fun python code or something” but no — about half the time, films might not even be labelled as documentaries, or short films, or as any kind of film distinct from other things programmed into the festival (sometimes festival talks were coded exactly the same as films, for instance). Really begs the question as to why the archive exists, since it’s clearly unfit for actual research purposes. Maybe you’re trying to find one particular film? Good luck trying to find it if you don’t remember the exact title and/or director (although even director names were sometimes missing, so who even knows).

What that means for me, and the ensuing analysis then, is that I can’t treat any of this as census data (or data that completely accounts for an entire population), since there is absolutely every likelihood that I’ve missed some entries because of how poorly codified this data is. Instead, this is survey data (or data that samples a portion of the whole population). Not a big problem, but it’s worth noting because it changes how I’m supposed to deal with this data.

So I had to manually enter the data myself. I chose 2004–2019 since 2004 was the first year I could easily identify a specific doco shorts program, and 2019 was the last normal year for the festival (2020 was weird, for obvious reasons). It’s also worth noting that I am only human , so I only manually compiled a fairly small, somewhat manageable range of data so that I could still go to sleep at a healthy time of night. I’ve also probably messed one or two of these entries up because, as Hannah Montana puts it, nobody’s perfect — but oh well, blame MIFF.

Exploring the data

Here we go — some actual data vis now! Let’s start by taking a look at runtime:

What we see here is that clearly, most MIFF doco shorts run between 10–15 minutes, but a few run all the way up to 40 minutes. Average runtime is a bit over 14 minutes, as is reflected in the graph on the left. On the right, looking at the boxplots though, we se a little bit of a trend upwards in runtime from 2004–2006, before stabilising at the prevailing overall average of about 15 minutes runtime.

This reflects some of the tidbits of wisdom I’ve been told about shorts programming at film festivals — it’s a lot easier to program shorter films, because they’ll fit more neatly into a two or three hour package. That, and you want to fit a decent number of films into the program — if all shorts were 40 minutes, you’d get to watch about four films in a program, rather than the 10-ish we usually expect.

So far nothing surprising or out of the ordinary. Let’s move on to IMDb ratings:

Here we see that IMDb ratings are skewed to the left (or, their average is higher than the middle of the range — it’s above average). The mean and median IMDb ratings are just a bit above seven, which, according to this guy on Quora, makes MIFF’s programming about one point higher than all films on average (and it’s also statistically significant! which is important! but I won’t explain why or how because it’ll get boring!)

There’s not really much to say about trends for IMDb ratings across time — but I’ll just point out 2011 for being a bit of a stinker apparently, with an average at around 6.5 (which is still around the same rating as the average movie, to be fair).

Moving on to languages spoken in MIFF doco shorts, things start to get a bit interesting, and arguably a bit racist:

What seems to have happened here is, well, that MIFF seems to disproportionately program English-language documentary shorts, compared to pretty much every other language. What could explain this? Not sure, but I wouldn’t be surprised if it had something to do with the fairly consistently Anglo-leaning names of festival directors (apart from Al Cossar, who only has one year of festival programming present in this dataset).

The other thing worth noting at this point, though, is how sparse the data is on this particular bit of info — sometimes IMDb wouldn’t even have the language spoken listed for the film (hence the huge amount of NA’s in the barplot, which I decided to include so that everyone can see what I was dealing with), or it would contradict what MIFF had listed in their archive. I would not be surprised if there was some amount of discrepancy between the collected data and reality, but that discrepancy would have to be pretty big given how massive English’s lead is on all the other languages’ presence within MIFF doco shorts’ programming. This is made particularly clear when plotting all other languages against just English and NA values:

Even assuming that all missing values were non-English or had no dialogue, it’s pretty obvious here that English language films are abnormally dominant, particularly for a film festival with “International” in its name.

Given the breakdown of film selections by country of origin, this little Anglo-centric narrative continues to reinforce itself (but also we get to start looking at Australia specifically! Finally!):

Firstly it’s worth noting that Australian doco shorts are clearly programmed the most out of all countries present in the data, by a long shot. The other thing clearly worth mentioning is the next two most common programme selections, country-wise: the USA and the UK. By quite the margin, the majority of short documentaries being programmed by MIFF between 2004 and 2019 are from English-speaking countries, which backs up the narrative highlighted earlier when looking at film languages.

Something you wanna tell us, MIFF? Should we have a quick talk about what the word “International” means?

(To be fair, the UK and US are still technically international, but boy, that’s a lot of English-speaking white people getting programmed into an international film festival)

More directly looking at the proportion of Australian doco shorts programmed across time, I decided to group this data by festival director, because doco shorts programmes are quite small year on year, and this isn’t necessarily the most robust dataset either, so I had to compromise on year-by-year comparisons with another way of logically grouping the data chronologically:

Firstly just want to re-iterate the Anglo-centric narrative we’ve been establishing for the past three or four graphs — Al Cossar’s the only non-Anglo name here, and I wouldn’t be surprised if we looked back further still and the names of the festival directors prior to this dataset were all some permutation of “John Smith”.

Secondly, we sort of roughly see a downwards trend in the percentage of Australian films being programmed across time. We start with James Hewison, who programmed between 2004–2006, whose doco shorts programmes were 35.7% Australian, which slowly trends down to Al Cossar, whose doco shorts programme was only 22.2% Australian (although he is only responsible for one year in the data: 2019).

While we’re here, we can look at all the winners too (or at least all of the data I could find on which films won in which year):

From this subset (I wasn’t sure how to visualise this), there are three things worth noting — 50% of these winners are Australian, the majority are in English, and interestingly, the average runtime of a winning documentary short is statistically significantly higher than the overall average (18.57 minutes compared to 14.14).

Anyway, everything that I just said is bullshit and worthless unless it’s statistically significant, given that this is pretty much just survey data, so let’s actually do some real shit and answer some questions!

The models and their results, interpreted

Here comes the exciting part for me, and the boring part for you: low-res screenshots of my regression model summaries!

First lets take a more statistically robust look at percentages of Australian doco shorts programmed over time:

To the statistically literate: this is a logistic regression model with the variable “Australian” programmed as a binary categorical variable, and “MIFF.Festival.Director” also programmed as a categorical variable. No other control variables were included in the model because no logical link between other variables and “Australian-ness” could be established except for a binary “English/Non-English” variable, which was found to be confounding in earlier models, and also not statistically significant. What we see here is that none of the beta coefficients associated with any of the factors in “MIFF.Festival.Director” statistically significantly shift the intercept given such high p-values for each coefficient. Earlier models done on the ordinal variable “Year” resulted in similar conclusions. So, at a 95% confidence interval, we are unable to reject the null hypothesis that there is no statistically significant difference between festival directors in regards to their programming of Australian documentary shorts.

To the statistically illiterate: yeah there isn’t less Australian films being programmed across time.

So it looks like me, my friends, and my lecturers are wrong —there aren’t actually proportionally less Australian films being programmed over time.

What IS interesting though, is when we take a look at our second question: are we able to predict a winner based off of readily available information?

Let’s skip the smartassery of my last interpretation of a regression model — the asterisk next to the row of values for “Australian” here is essentially saying that yes, this result for this particular variable is statistically significant, and the fact that the number in the second column next to “Australian” is positive means that the probability of winning the doco shorts prize is higher if your film is Australian. In fact, with a little bit of maths, we can work out that the probability of winning increases from a bit under 8% to a little over 32% if the selected film is Australian. So there you go — not only is MIFF’s doco shorts programming not decreasing its Australian intake, but it’s also more likely for a doco short win if it’s Australian. What fun little find!

Conclusions and discussions

Wow, what a thrillride — look at all that data (and I visualised some too!) After all that, we’ve discovered that MIFF’s doco shorts programming isn’t becoming less and less Australian. Not only that, but you’re actually more likely to win if your film is programmed and is Australian (but there’s things to be discussed about that, which I’ll get to later).

What might be fuelling mine and my friends’ perception of a decreasing Australian presence is how confusing some of the programming is — essentially, after looking quite closely at the MIFF archives (for an excrutiating few days, mind you), it’s become clear that not all doco shorts are included in the doco shorts programme, and they are instead spread across the festival in random places: either tied to a separate feature presentation, or in another category entirely. That doesn’t discount its eligibility to win the doco shorts prize though, since Lost Rambos won in 2019 and wasn’t included in the doco shorts programme. The Australian doco shorts that are programmed into the festival are likely just put elsewhere in the festival, and not in the doco shorts package.

One thing worth noting about the analysis as well, is that this is a pretty small dataset to be running something like a regression model — there were only 176 entries in the dataset to begin with, only 14 winners, recorded, and on top of that, the model omitted 83 entries because they were incomplete, so it was really only using 93 datapoints. What this meant was that I really stretched the data to its limits when I was doing these regression analyses, and there is potentially a lot of confounding variables out there that I haven’t been able to look at because otherwise I’d destroy the model. One good example of this is including an interaction variable between festival director and Australian films — essentially, checking to see if specific festival directors were more likely to pick an Australian film as a doco short winner compared to others. This would have been interesting and worth a look into, given that all other festival directors apart from Michelle Carey gave the doco shorts prize to Australian films exclusively (Carey only awarded it to one Australian film during the eight years she was festival director) — but there wasn’t enough data to do that kind of analysis without breaking the model. At the end of the day though, given that the amount of data available to make this model is so limited, and there was still fair few control variables added to the model, it’s almost impressive that Australian films were still found to be statistically significantly more likely to win than other films (it’s a lot harder to find statistically significant results with smaller pools of data, because you’re likely to have much higher variance).

(Also it’s worth noting that this is based on the presupposition that festival directors play some kind of a role in determining what film wins the doco short prize — I have no idea how judging and prize-giving works at MIFF, so this supposition could also be completely wrong.)

Also, just to reiterate — the archive is horrendous. Here is an archival entry that MIFF has made for the 2004 documentary short, “Oil”.

Thanks MIFF.

It’s also worth noting that there may be many other factors that aren’t being considered, that needs to be analysed with data that isn’t publically available. For example, we could ask the question: how likely is it for an Australian doco short to be programmed if submitted for selection? To do that kind of stuff, we’d need access to the metadata for every single submission to MIFF’s doco shorts programme, as well as whether or not they got selected. For obvious privacy-related reasons, that data is not immediately available, although if it were, it’d be a great way to examine how Anglophilic MIFF actually is (the disproportionate number of English-language films could have to do with a disproportionate number of English-language submissions, for example, but we’ll never know).

All in all, we’ve determined at the very least that MIFF’s doco shorts programmes aren’t becoming less and less Australian, and that my friends and I are wrong in our intuition. I guess I’ll have to find another excuse as to why I can’t make a decent living off my niche political feature documentaries with a human twist.



William He
The Startup

Just wanna be a data journalist, writer, video journalist, documentarian, economist, and professional tennis player - is that too much to ask?