Museum collections as data: comparing collection stats across four datasets in four days

12 min readApr 28, 2020

The challenge: a four-day project to analyze a handful of publicly available museum collection datasets.

I wanted to see what I could do over a long-weekend period with data that had been made publicly available for analysis. This kind of time-limited sprint was a recommended activity when I was in grad school, just to keep practicing skills and seeing what was out there. I haven’t done one in two years since graduating, so it seemed like a nice time to revisit.

Tableau Public Story: A Tale Of Four Museum Collection Datasets

Check out eecanning’s Tableau Public Story in addition to this process post.

Day 1: Scoping the Questions, and Selecting and Preparing the Data

A. Scoping the questions

There are many museums who have made their collections data publicly available through GitHub repositories, CSV and JSON downloads, and APIs. In order to begin selection, the question had to be refined: what kinds of sources and insights am I interested in for this project? I settled on three major parameters:

1. Art museums — collections primarily of artworks, preferably by named artists
2. Data about the makers of the artworks more so than the artworks themselves
3. The selected museums had to offer similar (easily reconcilable) data in the same formats *

* This means that I would get the same kind of data in the same kind of way, in order to create a baseline consistency. This approach does require going to the common denominator, which means accepting that data will likely be flattened and reduced in complexity by the time I get my hands on it. However, this is a reasonable concession to make for a quick analysis project.

B. Selecting the data

Given my parameter decisions, I went to two sources:

1. A list of the largest art museums as identified on Wikipedia. I realize that a SPARQL query to get museum by collection size might be a better way to get similar data, but this would require that museum collection be fully documented on Wikidata, which would disregard too many institutions to be a recommended approach.
2. @Ambrosiani’s list of Museums on Github. Although this list hasn’t been updated recently, it’s an excellent quick reference list, and comes with a reference to a Wikidata query looking for museums listed as having GitHub accounts on Wikidata! That helps keep an accurate record.

I cross-referenced these to get a starting dozen possibilities:

1. MoMA (New York City, NY, USA)
2. Tate (London, UK)
3. The Met (New York City, NY, USA)
4. Art Institute of Chicago (Chicago, IL, USA)
5. Carnegie Museum of Art (Pittsburgh, PA, USA)
6. M+ (Hong Kong)
7. Williams College Museum of Art (Williamstown, MA, USA)
8. Victoria and Albert (London, UK)
9. National Gallery of Art (Washington DC, USA)
10. Museum of Fine Arts, Boston (Boston, MA, USA)
11. Minneapolis Museum of Art (Birmingham, MI, USA)
12. SF MoMA (San Francisco, CA, USA)

I then reviewed the collections data offerings of each of these. For some of the museums, the GitHub accounts did not include collections data, which quickly eliminated them. For others, access to data was not through a method supported by enough others. I cut the list in half and grabbed .CSV files of collections data from each:

1. MoMA (New York City, NY, USA)
2. Tate (London, UK)
3. The Met (New York City, NY, USA)
4. Carnegie Museum of Art (Pittsburgh, PA, USA)
5. M+ (Hong Kong)
6. Williams College Museum of Art (Williamstown, MA, USA)

C. Break for lunch

Food is always important! East something. Drink some water. Go for a walk. Clear your head. It is important to prepare you in addition to your data for the next step: taking a first pass through the datasets.

D. Preparing the Data: First Impressions

To get first impressions of the dataset, particularly size and cleanliness, I skimmed each of the datasets in OpenRefine:

WCMA
· Record Count: 16,340
· CSV of collections data and artist data
· “Maker” column — suggestions around clustering of Makers seemed to be between those that indicated some level of uncertainty — presence/absence of a “?” —
This to me indicates the need for a certainty/uncertainty value column. Because I don’t know the level of uncertainty being expressed with the use of a question mark, I’m would make this a basic Boolean expression. Probably won’t be necessary for this question as the focus of these cases is cultures/places, not individually named persons.
· Dates have variance in formatting, much which can be cleaned up in Clusters as well

Tate
· Record Count: 69,201
· CSV of collections data and artist data; independent JSON files also available for each record
· Synonymous field with WCMA data is labeled and formatted differently — here it is called “artist” and is formatted as LastName, FirstName whereas WCMA called it “Maker” and formatted it FirstName LastName. I will need to reconcile this to do comparisons — also, this speaks to why reconciliation between sets should be done through shared reference to a common authority, ie. ULAN.
· Provides further object detail fields than WCMA dataset
· No “Anonymous” or “Unknown” makers in this dataset — do they really not have any, or is this an incomplete dataset in some way?

MoMA
· Record Count: 139,003
· CSV of collections data and artist data; JSON file of collections data and artist data
· ArtistBio, Nationality, BeginDate, EndDate, and Gender columns all have data contained within brackets — first thing to do is to strip the brackets
· Interesting cases of clustering where the same set of names are listed as co-creators but in different orders —
Is the order indicating something that is valuable to maintain, or is this error caused by flattening the data into the .CSV format? Is there a better way to present this if so? For the timeframe of this project I am cleaning them up into single piles as opposed to breaking them out into individual maker
· Again here we primary see works on paper for most records: drawings, prints, photographs

Met
· Record Count: 474,439
· CSV of collections data
· This file is huge! 275MB, causing some struggle
· Companies are dominating the most-collected stats

M+
· Record Count: 5,831
· CSV of collections data and artist data; independent JSON files also available
· Mix of Mandarin in some fields; English translations also provided.
· Several columns all have data contained within brackets — first thing to do is to strip the brackets
· Constituent numbers but not names provided — I will need to reconcile this with the artists CSV to make use of it
· A lot of colour field data provided
· Acquisition date not provided in an independent field but can be parsed out of the Credit Line field

Carnegie
· Record Count: 28,269
· CSV of collections data; independent JSON files also available for each record
· Name record provided in two formats: FirstName LastName and LastName, FirstName (full_name, cited_name)
· Very few name clusters found! Looks to be very clean dataset
· Lots of unknowns, and different types of unknowns — will need to cluster for research questions

Day 1 conclusion:

I removed M+ and The Met for similar reasons: size and collections scope. M+ was considerably smaller than the others in my consideration set, while The Met was considerably larger. Additionally, the M+ sets separated the collections and artist data (easily reconcilable, but technically outside of my predefined scope), and The Met has large numbers of object by makers identified by nationality or other geographic or cultural source terms, not names — also technically outside of my predefined scope.

Therefore, the Final Four:

1. MoMA (New York City, NY, USA)
2. Tate (London, UK)
3. Carnegie Museum of Art (Pittsburgh, PA, USA)
4. Williams College Museum of Art (Williamstown, MA, USA)

MuseumofModernArt/collection

The Museum of Modern Art (MoMA) acquired its first artworks in 1929, the year it was established. Today, the Museum's…

github.com

tategallery/collection

The dataset in this repository was last updated in October 2014. Tate has no plans to resume updating this repository…

github.com

cmoa/collection

Welcome to the Carnegie Museum of Art's Collection Dataset. In celebration of our 120th anniversary, Carnegie Museum of…

github.com

wcmaart/collection

The Williams College Museum of Art is an innovative campus museum with a growing art collection at its heart. The WCMA…

github.com

Day 2 + 3: Cleaning and Reconciling the Data

These steps take by far the most time: cleaning data is time-consuming, and reconcilers require time to run. Therefore, I had to make sure to have adequate time to both clean the data, match it to the authority, and pull in any additional fields from that authority that I needed.

E. Cleaning the Data

In order to stay within my timeframe, I limited my cleaning to two field areas: makers and dates.

F. Reconciling the Data

In order to compare the datasets, I needed to be sure that each artist record is, in fact, referring to the same artist. The best way to do this is through a shared mapping to an authority record. @daviskellyk had previously informed me that Getty now has an OpeRefine reconciliation point for their vocabularies, which I was excited to try out!

I started testing the ULAN reconciliation with the WCMA dataset as this is my first time working with the Getty’s reconciliation service. This dataset already has ULAN ids for entries, so it allowed me to see how reliable reconciliation may be for other datasets. There aren’t other Maker aspects to bring in to help — such as nationality, birthdate, death date — so this is higher on the “risk” spectrum, and therefore good for this kind of check.
Conclusion… it works great!

Taking a look at the WCMA ULAN fields also allowed me to identify some variance in the formatting of values, so I cleaned those up in order to support a robust comparison.

Error message from reconciliation service

However, the Getty ULAN reconciliation service does not seem to currently support grabbing addition information based on this reconciliation, and therefore is limited. I want to pull the ULAN ID directly in as a new column, for example, and am unable to do so here — however, I can go through the Wikidata reconciler to grab this! I have to get these set up to run overnight, so the priority here is to be flexible. Again, this is a four-day data project, so I’m teaching myself to be less concerned with academic rigour and instead work with what is available to me to get this done in the four-day timeframe.

At the end of Day 2, I have four clean-enough datasets all running overnight reconciliations to Wikidata. I like to run this service overnight as I can’t do anything with the project while it’s running, so might as well do it while I’m asleep!

On Day 3, I start by accepting the Wikidata reconciliations. Using OpenRefine’s tools, I accept the best matches and review the ones with high likelihood of matches, and then set the rest as non-matches. I also used this time to review the datasets for any additional work that I think I might need to do before using the data. I then set the datasets to running pulls for ULAN IDs from the reconciled matches.

At the end of Day 3, I have the datasets pulling the ULAN IDs from their Wikidata matches, ready to go for Day 4.

Day 4: Analyzing and Visualizing the Data

Final day — starting to feel like I’m running out of time! Time to see what I can make of all this.

I threw everything into Tableau and started with each museum individually: what does the distribution of Makers in each collection look like? Who are the Top 5 (individually named) makers with the most works in each collection? When in the museum’s lifetime had they been acquired, and in what sort of patterns (ie. one major gift or continuous acquisition)?

Individual museum analysis (pages 2–5 of Tableau Story)

While none of these top makers were common across collections, some early findings were already starting:

Proposed Finding 1: When a museum collection has a large number of works by a single artist, they likely acquired the majority of those works in a small number of acquisitions, as opposed to continuous collecting.
Proposed Finding 2: Broadly speaking, museums collect at a greater rate the longer they have been in existence. Generally, the number of acquisitions per year is on a roughly upwards trajectory. Some exceptions to this: Tate acquired 37,984 in 1856, by far the most they have ever acquired in any single year — 37,711 of these works comprise a massive donation of J.M.W. Turner artworks; and MoMA’s biggest collecting year to date was 1964.

Next was to compare across the four museums — what are compared collecting patterns like?

Running total accessions comparison (pages 6 of Tableau Story)

Proposed Finding 3: All four museums experienced heightened acquisition activity during the 25-year period 1960–1985. This seems to be a turning point in acquisition practices, as all start to experience marked increases in collecting starting during this time. I am curious as to whether this is a generalizable finding, and if so, what conditions during this timeframe led to this?

Accession hotspot comparison (pages 7 of Tableau Story)

Proposed Finding 4: Museums go through a short burst of intense acquisition activity that roughly corresponds with this same 1960–1985 time period. All four museums acquired 25% of their collection within 10–20 years — a very short span of time in light of their total lifetimes.

Lastly, to compare across the four museums — are many of the same artists represented in multiple collections?

Comparison of common makers across collections (pages 8 of Tableau Story)

Proposed Finding 5: Amongst these four collections, the high majority can be said to be unique to that institution. However, this statement relies on the accurate mapping of the artists to ULAN IDs, and I am not fully confident in the mappings. While I feel confident that this trend will hold up to a more rigorous comparison, I did not spend enough time on matching, searching, cleaning, and re-matching to say that this is a wholly accurate claim.

Comparison of common makers across collections (pages 9 of Tableau Story)

Conclusion and final thoughts

With a few days and a couple of tools, I was able to pull out some interesting findings. I am most curious about that notable uptick in acquisition practices starting at roughly the same time for each of these museums. A deeper dive would be to next (a) see who was being collected during this time period, or starting this time period, in comparison to before; and (b) see what external circumstances may have led to this. Was there a change in tax law regarding object donations, maybe? Or some other incentive? Did one museum start, and others follow trend?

Some final findings to wrap this up:

Interesting Tidbit: Of the Top 5 Makers (by count) for each museum, only one (and in one case, two) of these artists are represented in all four collections:
· Andy Warhol (Top 5: Williams College Museum of Art)
· W. Eugene Smith (Top 5: Carnegie Museum of Art)
· Joseph Beuys (Top 5: Tate)
· Jean Dubuffet (Top 5: Museum of Modern Art)
· Lee Friedlander (Top 5: Museum of Modern Art)
Unsurprising Tidbit: Of the Top 5 Makers (by count) for each museum, all are men except for one:
· Louise Bourgeois (Top 5: Museum of Modern Art)

Final, final thoughts

And, in conclusion — what if we were to ask Wikidata these same sorts of questions? Again, a question for another day, but I find it interesting that Wikidata’s analysis of the most represented artists differs significantly for Williams College Museum of Art and the Carnegie Museum of Art, somewhat less so for the Museum of Modern Art, and is quite accurate for the Tate.

Solution to this discrepancy: document more collections on Wikidata!