Using MIT’s Places365 to Track Location Shooting in American Cinema — The Sample

Amos Stailey-Young
7 min readMay 10, 2022

--

  • This post is part of a series that uses MIT’s Places365 dataset to analyze location shooting in Hollywood cinema. The full list can be found here. Code can be found on my GitHub.

In my previous post, I discussed how we can use MIT’s Places365 dataset to track the “exterior ratio” of film frames, a metric that we can use as a proxy for the frequency of location shooting in American cinema. But to achieve this, we first need to create a sample of films to analyze.

If we want to draw any conclusions about the exterior ratio in American cinema, our sample must adequately represent the population that we wish to investigate. Because the focus of my dissertation was on location shooting during the sound era of the studio system, I limited my historical range from 1915 (roughly when the emerging “studio system” began producing feature-length films) and 1970 (roughly when the “studio system” ended). A representative sample would consist, then, of American feature films produced during this time frame. The goal is to find an exterior ratio for every year so that we can track how it changed over time.

There are two main challenges to building this sample. First, we have to find some sources for digital files of Hollywood feature films. Over the years, I have personally collected numerous video files, but the 40 years that we are examining is a long time. Even if I had a sample of 500 digital files, that would be barely over 10 films per year. Since Hollywood produced between 200 and 500 films per year, those 10 films would account for roughly 2–5% of the entire output for that year. With such small samples, the averages would be significantly prone to outliers. Creating a more robust sample required finding an additional source of films than those that I already have.

The best way to create a representative sample would be to start with the entire population and then select films randomly, thus ensuring that the sample looks like the population from which it derives. The essential problem with this method, however, is that not all films produced during the time frame in question are actually accessible, either because of obscurity, destruction, or copyright, so one (enormous) constraint on our sample is availability. Even if a film might be technically available, it may cost either too much money or too much time to acquire. The sampling method I used is generally referred to as “convenience sampling” because observations are selected according to the ease of acquiring them rather than through a systematic strategy. We have to be very careful about introducing bias into our sample when using such a method.

Certain types of films may be more likely to be available to us than others. For instance, big-budget, “prestige” pictures like Gone with the Wind (1939) are unlikely to become “lost” when compared to the B westerns produced by “Poverty Row” in the 1930s. On the opposite end, public domain movies are more likely to be available because they are not copyrighted. The availability of films does not matter so long as they do not vary concerning location shooting, but one implicit assumption scholars make is that location shooting is correlated to higher-budget films when compared to their lower-budget counterparts. If our sample included a disproportionate number of prestige films, then this could possibly skew the results of our analysis. I will discuss ways to address these problems soon, but first, I want to discuss how I acquired the films.

Gathering the Films

The first place to start in our search for films is in the public domain. Fortunately, The Internet Archive has a section for “Feature Films,” which contains roughly 15,000 entries. While there is no simple way to download all of these films, it is possible to write a Python script that can automatically do so.

The biggest problem with downloading the films is connecting the metadata to the video files, and if we don’t have that metadata, we really can’t do anything worthwhile with the files. While the information on the films is generally correct when it is present, many entries have missing fields. Further, even on The Internet Archive, the metadata is not always in a consistent format, and naming conventions are suspect. Simply having a digital video file is pointless without having at least a year attached to it. The most reliable way of attaching metadata to these video files is to connect them to an existing database, such as IMDb (which I have linked here). Many of The Internet Archive entries have an associated IMDb link, which contains the ID in the URL, but matching a movie by title and date to its IMDb ID can also be accomplished relatively easily.

My other primary source of films is YouTube. Many are available despite their copyright status, and there is a handy tool called youtube-dl that easily allows users to download a single video or an entire playlist. Getting films from YouTube presents many of the same problems as downloading from The Internet Archive, primarily that we do not have reliable metadata. Thankfully, the Python package called Cinemagoer (previously, IMDbPY) can link a film to an IMDb ID, which we can then use to get metadata from IMDb. Although we can match a film solely by title, Cinemagoer is prone to naming errors and conflicts if we don’t have the accompanying year. Having an IMDb ID means each entry in our sample has a unique identifier, which makes removing duplicates much simpler.

Comparing the Sample to the Population

In statistics, we use a “sample” to make inferences about a larger “population.” In this case, the “sample” consists of the films we collected, and the “population” constitutes every single film produced during the time frame under scrutiny. As mentioned above, in order to make claims about the population, we need a sample that adequately represents that population.

What factors might influence our analysis? Certain genres, such as the western, are more associated with exterior scenes than are others, like the musical, so a sample with a similar proportion of films for each genre would be more accurate than one with a dissimilar distribution of genres — all else being equal. Below is a figure comparing the proportion of films by genre.

There are a few genres with sizeable discrepancies between the sample and the population, but for a convenience sample like that which I performed, I am fairly happy with the results. What other categories might we examine? In his foundational book of the classical Hollywood studio system, The Genius of the System, film scholar Thomas Schatz discusses the “house style” of specific movie studios, which is the idea that separate film companies developed consistent strategies to differentiate their product from their competitors. The exterior ratio could be affected by the house style of an individual movie studio. So, another feature to look at is the proportion of films by their production company, as shown below.

Although there are some slight differences — Twentieth Century-Fox and United Artists are overrepresented for some reason — the distributions are nevertheless quite similar. Moreover, this discrepancy only matters if there are substantial differences in the exterior ratio between major studios, something of which I am skeptical but will explore more minutely in a later post.

Earlier, I said that our “population” consists of all films produced during our given time frame. We can actually parse this further, however, and distinguish between the study population and the target population. So to be more precise, “all films produced during the given historical period” constitute the target population whereas all the films that could possibly be included in the sample (i.e., those with publicly available digital files) represent the study population. As discussed above, it is very important for us, then, to note how — and why — the study population may differ from the target population.

During the time period under question, the American film industry was bifurcated into two markets: “As” and “Bs.” The companies termed the “majors,” which encompass the eight studios on the plot above, were responsible for creating A pictures while the “Poverty Row” studios were largely responsible for the B films that would play on the bottom half of so-called “double-bills” (the major studios did also make some B films, though Poverty Row never made A films, with a handful of exceptions). Unlike the “majors,” the companies comprising Poverty Row were constantly changing, except for the few stalwarts who operated for decades: Republic Pictures, Monogram, and Producers Releasing Corporation (also known by its derisive nickname, “Pretty Rotten Crap”). Because the great majority of these companies went belly-up in only a few years, many of the B films produced from the 1930s through the 1950s no longer survive. Moreover, some of these films likely have no IMDb entry since records of their production, distribution, and reception no longer exist. In other words, there are a number of “ghost films” out there that, while technically part of our target sample, remain completely hidden from us.

Fortunately, the tendency of B films to belong to the public domain offsets their obscurity. The two opposing tendencies balance out, so our sample actually has a slightly greater proportion of non-major films. The difference, however, is quite minor.

There are two other factors that could affect how well the sample represents the population: aspect ratio and color. Film scholars have identified both elements as potential causes for increased location shooting after World War II. But we must place both factors in their proper historical context, which I will elaborate in subsequent posts. The American film industry did not develop widescreen until the 1950s, about halfway through our sample. Hollywood first used color in a feature film in the 1930s with Becky Sharp (1935), but the technology was proprietary, owned by Technicolor, making color film production challenging to scale. For instance, Technicolor had a limited number of cameras, which they loaned to the studios for individual productions. The 1950s also saw the introduction of Eastman color, which made scaling much easier because ordinary 35mm cameras could use the film. We have to be really careful with regard to how imbalances in the proportion of widescreen and color films might skew our results. Below are the two distributions for both aspect ratio and color.

Again, for a convenience sample, these ratios are pretty close to each other. While there are differences between the sample and the population in certain features of the data, nothing stands out as being particularly problematic. My next post will explore the results of our analysis in-depth.

--

--

Amos Stailey-Young

I work at the intersection between cultural history and data science, developing new analytical methods and strategies for use in the Digital Humanities.