How We Learned to Stop Worrying and Love Open Data:
A Case Study in the Harvard Art Museums’ API

Co-authored by Andrea Ledesma & Leah Burgin

As it has moved from card catalogues to in-house databases to open data, information about objects in museum collections has become increasingly accessible. Recently — for the first time in the history of cultural institutions — museums are sharing access to their entire collections, sometimes hundreds of thousands of cultural objects, through open data platforms. These platforms host information — often in the form of application programming interfaces (APIs) — available for free download by anyone with access to the Internet.

This unprecedented level of access to an institution’s objects (or “content” in open data terminology) and information about those objects (aka “metadata”) — ranging from records of acquisition to dimensions — has raised questions in the realm of cultural institutions. Among these concerns, we found that the most prevalent questions include:

  • What types of projects can take advantage of open data access to museum collections?
  • Who is interested in accessing open data for museum collections, and how does “tech-pertise” factor in?
  • What are the risks and benefits to consider when deciding to make an institution’s data available through an open data platform?

While these questions scratch the surface of the conversations circling advocates for open data and cultural institutions, we found that their consideration informed our own project with the Harvard Art Museums (HAM). Working with the institution’s API, we collected a small data sample from its extensive art collection to see what we could learn about the public’s engagement with the museum’s objects and the museum’s collecting patterns and history. As this project is the first time we worked with APIs, we were interested in learning more about other open data-sourced projects, who else is using APIs, and why institutions are keen — or not so keen — on sharing their collections information through open data.

Open Data Projects

A number of institutions have shared their collections data through the use of APIs. In our research, we found inspiring projects that used these, ranging from a variety of visualizations based on the Museum of Modern Art’s (MoMA) collection to interactive modules like the Random Button Television from the Cooper Hewitt Smithsonian Design Museum to fully developed apps from the 2012 Open Culture Data App Competition in the Netherlands: MuseApp, Histagram, and SimMuseum, in which users can, respectively, create “mash-ups” of famous art pieces, send digital postcards based on historic images, and play curator in a virtual art museum.

The San Francisco Museum of Modern Art (SFMOMA) recently hosted Art + Data Day during which they invited tech experts to explore a beta version of collections data API. Bailey Hogarty and her colleagues report on the incredible projects born from this collaborative endeavor. Participants divided into several teams, producing a visual display for photographs from the 1970s (Team Pixelmaster); a platform for contextualizing artworks within the wider worlds of politics, economics, weather and sports statistics, etc. (Team Context); a compilation of selfies taken with the collection, pulled from social media (Team Selfie); and an Artwork Sentiment graph, which compared the titles of artworks to the emotions they evoke (created remotely by John Higgins, SFMOMA’s lead software architect).

We were impressed by the existing open data projects circulating in the realm of cultural institutions; however, as is probably clear from the previous examples, most of these projects required a high level of tech-pertise. This observation led us to question exactly who is using (and who intends to use) collections-based open data platforms?

The People

Since open data is, in theory, accessible to anyone with Internet access, cultural institutions have to consider how to format, organize, and develop their open data platforms to reach their intended audiences — whether it’s the noobs, the tech savvy, or anyone in between.

These decisions have trade offs. We observed that more robust data demands a more complex API, which in turn demands more developed technical skills from potential public users. For example, APIs often return data formatted in JavaScript Object Notation, or JSON. This language is machine readable and can easily move between disparate computer systems. To the untrained eye, however, it’s nearly incomprehensible. Fortunately, there are a number of programs and plug-ins that can translate JSON. Some museums, like the Museum of Modern Art, have saved their users effort by releasing collections data in a downloadable spreadsheet.

Negotiating what open data platforms could look like will be challenging to institutions, as “the public” is an expansive category that encompasses individuals with widely varying degrees of technical expertise. How can an institution make open data relevant and accessible to these groups, while also inspiring experts to create high-caliber projects that reflect the institution’s mission?

Hogarty and her colleagues from SFMOMA commented on this reality, stating

Ultimately, APIs are not intended as human interfaces. Without an inherent readability, they will likely fail to become tools for exciting interactions designed with humans as the end user. The Collection API, therefore, must be be both human readable and machine processable.

But if this is the case — and raw collections data must be simplified to be useful in an open data format — how “open” is that format? What information about the collection is omitted in the name of simplicity, and how are institutions making and communicating those decisions? Conversely, if data isn’t simplified (and therefore only accessible to those with high levels of tech-pertise), is it really “open”?

The answers to these question seem to come from individual institutions. For example, looking back at its own work with open data, the Cooper Hewitt Smithsonian Design Museum has proposed a modest, yet pragmatic plan to restructure its data in a way which facilitates public interaction and appreciation on all levels:

  • Make our databases more user-friendly for content creation, storage, and repurposing (eg, enable easy editing, importing, and exporting/recording/translating)
  • Add fields (‘buckets’) so there’s more flexibility to extract appropriate content for sufficiently similar contexts
  • Create easy-to-follow guidelines to support quick content creation and repurposing.

Whichever decisions institutions make in sharing their data, we found that tradeoffs exist between the openness of the data and the openness to the public, and institutions are finding a balance that fits the their missions.

THE PITFALLS AND POTENTIAL

While there may be an “open data for open data’s sake” argument to be made, ultimately, if collections-based open data is made available, participatory projects (the good, the bad, and the ugly) will result. Institutions are carefully considering to what degree they are willing to share authority as part of their open data policy and what their long-term plans are for fostering public engagement with their open data.

In general, we found that most of the conversations surrounding open data seem to hinge on the risks and benefits institutions assess when determining if and how to publicize collections-based open data. Lotte Belice Baltussen and her colleagues neatly summarized these risks (Figure 1) and benefits (Figure 2). According to two groups of Dutch cultural institution professionals — participating in either an Open Culture Data Masterclass or “Boss of your own metadata” Workshop in 2012 — the top five risks include: loss of attribution, loss of control, loss of potential income, loss of brand value, and privacy (in slightly different rankings).

Loss of control is, to us, perhaps the most interesting risk on the list. In a dawning age of the participatory museum, institutions are grappling with issues of shared authority. In her essay in Letting Go? Sharing Historical Authority in a User-Generated World, Nina Simon discusses the possibilities that emerge when museums share their authority and employ participatory models for public engagement. While open data is not mentioned in this essay, Simon does make the distinction between institutions that are “‘about’ something or ‘for’ someone” and institutions that are “created and managed ‘with’ visitors.” Institutions seem to be debating the nuances of those prepositions — about, for, with — when building their open data platforms.

Interestingly, the top five potential benefits of open data differed substantively between the two participating groups. Both listed public mission (number one) and data enrichment (number two), but then diverged: the Open Culture Data Masterclass ranked increasing channels to end users (three), increasing relevance (four), and new customers (five), while the “Boss of your own metadata” Workshop ranked discoverability, increasing channels to end users, and increasing relevance, respectively.

Contrasting to the institution-centric risks, these potential benefits focus on what open data can mean for an institution’s public engagement, visitation, and audience growth.

While these rankings result from a small sample size, they speak to the tensions between institutions and the public when navigating issues surrounding open data, especially in terms of what potential benefits open data can have for an institution.

Figure 1: Risks of open data (image by JAM/Europeana. CC BYhttp://creativecommons.org/licenses/by/3.0)
Figure 2: Benefits of open data (image by JAM/Europeana. CC BYhttp://creativecommons.org/licenses/by/3.0)

Case Study: The Harvard Art Museums

To further explore these issues of shared authority and audiences, we decided to get our hands dirty with open data. Recalling our initial curiosity about how open data stands to change the landscape between cultural institutions and the public, we sought to mine a museum’s collections data for a better understanding of both the collection and the museum itself.

Given MoMA and the Cooper Hewitt’s vibrant and active work with APIs, we were compelled to work with their data. After our initial review of these institutions’ APIs, we found that, as open as the data was, it was — in some ways — incomplete. In documenting their objects, these institutions focused on some fields of information — accession year, creator, department, medium, etc. — but not others. Records related to visitation, publication, and exhibition were largely absent or incomplete. For this research project, we were most interested in what collections data could reveal about public engagement with objects and, as such, we were asking questions that these data sets couldn’t answer, at least to our satisfaction. Our initial questions included:

  1. How has the museum’s collection grown over time?
  2. What is the relationship between regions represented in the collection and regions represented in museum exhibitions?
  3. How well researched are the objects?
  4. How frequently (and via which modes) does the public engage with the collection?

We found that the HAM API’s data was a good fit for answering these questions.

Managing the API: Gathering & Cleaning Data

There are over 200,000 objects in the HAM, and for the past two years, this collection has been available to the public as JSON-formatted data. As with MoMA and the Cooper Hewitt, the HAM hosts a GitHub page providing users with how-to guides and examples of previous projects. We found the HAM’s data particularly useful because of its focus on public access. In addition to the standard metadata fields, the API also includes metadata related to exhibition and publication history, object verification level, and web page views.

We began our case study by finding a workable data set within the larger collection. With limited computing capabilities and time, working with the entire collection was not feasible. We needed a manageable collection that spoke to our research questions and interests. With these considerations in mind, we chose to work with the HAM’s collection of 14th century sculptures. There are 94 objects within the collection.

Next, we collected our data. The HAM API enables users to collect data according to an impressive number of facets, including:

  • Object
  • Person
  • Exhibition
  • Publication
  • Gallery
  • Spectrum
  • Classification
  • Century
  • Color
  • Culture
  • Group
  • Medium
  • Period
  • Place
  • Technique
  • Worktype
  • Activity
  • Site

Users access the API directly through their web browser by formatting a URL according to data parameters. A key, available by application, is required to access the HAM API. We pulled data related to the 14th century sculptures according to the “Object” method, using the following command (key omitted):

http://api.harvardartmuseums.org/object?classification=Sculpture&century=14th%20century&apikey=XXXXX&size=100

The call then returns the JSON-formatted data directly to the browser.

A snapshot of the HAM API data for 14th century objects

As evidenced in the screenshot above, we needed to clean our data before we could glean anything from it. This meant structuring and standardizing it into a format that both we and our computers could read. Admittedly, we found this to be the most difficult (yet essential) phase of our case study. As two graduate students in the Public Humanities, our technical experiences are limited. Working with the API thus provided not only a thought exercise in collections data and collaboration, but also a basic lesson in data analytics. In the end, we cleaned our data with the following method:

  1. Uploading the JSON to Open Refine (a free, open-source program for editing, manipulating and analyzing data) to create a spreadsheet from the object records.
  2. Using Open Refine to correct formatting issues — e.g. column names, typos, special characters, white space, etc.
  3. Downloading the Open Refine project as an Excel spreadsheet
  4. Opening this file in Excel to further refine the data. In particular, we used Excel to parse dates associated with accession year, record updates, and page views. Also, we defined the data types (date, general, text, or number) of each field.

We’ve since learned that a number of web browser plugins can streamline this process. Postman allows users to “build, test, and document APIs” directly in the browser, while JSONView rearranges the JSON-data into a more human-readable format.

We deleted a number of columns included in the original record from the final spreadsheet. Information regarding color spectrums, contact information, and image links clogged up our data and were ultimately unrelated to our larger inquiry on public engagement and institutional practice.

With the data refined from raw JSON to an Excel spreadsheet, we pulled it into Tableau, a proprietary software used to analyze and visualize large data sets. The desktop version of Tableau is available free to students. Though there was a learning curve, Tableau does provide training videos and other resources for using its program.

So, what did these 94 objects have to say about the HAM?

Four Questions: Visualizing the HAM’s 14th Century Sculpture Collection

When we discussed the API with Jeff Steward, Director of Digital Structure and Emerging Technology at the HAM, he expressed his interest in how collections data can be a “tool for understanding a particular [museum]” or, in other words, “a wonderful gateway into gleaning what their perspective is and what their angle is.” As Steward asked, we were interested in using the HAM’s open data to inquire: “What can you observe in the API that actually starts to articulate what is important for that particular institution?” In other words, how do the objects in the collection inform institutional practices, especially public engagement?

In an attempt to answer this question, we asked four collections-based questions of the open data. We would like to note that, even though the collections data is open, the HAM’s GitHub page only provides so much detail on what each field means in terms of museum operations. For example, what does it mean for an object to be “published”? Published in a museum catalogue, scholarly work, blog, etc.? With this limitation in mind, we visualized the open data related to 14th century sculptures, analyzed what we learned, and, instead of answering our initial question, used our results as inspiration for further questions about what’s important to the HAM and why.

#1: How has the museum’s collection of 14th century sculptures grown over time?

Accession Timeline

According to the data, the museum’s current assemblage developed from early acquisitions of a “Fragment Of A Figurine Representing An Underworld Spirit” and a “Bodhisattva Ksitigarbha (jizôbosatsu) Standing” in 1919. The collecting pattern thereafter varied with curators most actively acquiring objects in:

  • 1943: 23 objects acquired
  • 1949 and 1969: 9 objects acquired
  • 1962: 5 objects acquired

Since the 1960s, the collection has remained relatively stagnant, with curators adding one to three objects every handful of years. The most recent sculpture was accessioned in 2006.

This data inspires further questions, such as:

  • What economic/social/cultural factors could have motivated curators to collect sculptures so actively during those peak years?
  • Does the museum have any plans to further develop the collection?
  • It seems nothing has ever been deaccessioned. Why is that? What would that process look like should it occur? How would this be reflected in the data, if at all?

#2: What is the relationship between regions represented in the collection and regions represented in museum exhibitions?

Maps

What does the world look like from the perspective of these 94 objects? We mapped out these cultures on Tableau by translating the 24 “culture” metadata fields into a geographic region that could be read and plotted as a geographic location. As an example, all objects labelled “French” were placed into the borders of contemporary France. We should note that not every object could be traced to a specific region. One sculpture recorded as an object of “Pre-Columbian” culture was not included in the maps and calculation above.

The HAM provides a continuous count of the number of times an object has been on exhibit; however, this number does not inform us of how long that exhibit lasted. Nevertheless, combining these two statistics on the map, we could infer which cultures were most visible both within the collection and in exhibitions. The museum has collected the most objects from:

  • France: 18 objects
  • Thailand: 14 objects
  • Japan: 13 objects

In regards to exhibitions, however, most of the objects hail from:

  • France: on exhibit 11 times
  • Mexico: on exhibit 7 times
  • Korea: on exhibit 5 times

This data inspires further questions, such as:

  • Why the emphasis on collecting and exhibiting objects from these particular regions?
  • For those objects/regions that are underrepresented, are there plans to expand the scope of the collection? If not, then what is the goal of this collection — breadth or depth?
  • Why were these objects/regions chosen for exhibition?

#3: How well-researched are the objects?

Verification Levels

Verification levels designate how extensively the HAM has researched and vetted the metadata of an object. The HAM places each object on a verification level scale, which runs from 0 to 4:

  • 0: Unchecked. Object information has not been verified for completeness and has not been vetted
  • 1: Poor. Object information is likely incomplete and has not been vetted
  • 2: Adequate. Object is adequately described but information may not be vetted
  • 3: Good. Object is well described and information is vetted
  • 4: Best. Object is extensively researched, well described and information is vetted

About 33% of the 14th century sculptures have been “unchecked,” while 36% have been “extensively researched…[and] vetted.” Digging deeper into these numbers, it appears that the museum has concentrated its research efforts within this collection on the French objects. All — including the sculpture tentatively labeled “French?” — are ranked at the highest certification level. Little has been done to corroborate the metadata of Japanese, Thai, and Vietnamese sculptures. A majority of the objects in each have a verification level of 0.

This data inspires further questions, such as:

  • What motivates curatorial research patterns? What does this tell us about staffing at the Museum?
  • What decisions were made in creating this verification process and structure? What other verification levels could be used to describe the data?
  • What can be done to improve the quality of these “dark records,” as Steward calls them?

#4: How frequently (and via which modes) does the public engage with the collection?

Access

For each object in its collection, the HAM tracks frequency of exhibition, publication, and online views. While the definitions of “exhibition,” and “publication” are unclear, in general, we can see that some objects are frequently exhibited, like “Mirror Case With a Noble Couple on a Hunting Party,” (N= 5) but relatively infrequently viewed online (N=91). Conversely, some objects, like “Portable Buddhist Shrine,” have been rarely exhibited (N=1) but frequently viewed online (N=766). Unsurprisingly, we can see that the majority of 14th century sculptures (N=73) have never been displayed, about half (N=43) have never been published, and, while many objects have single digit pageviews, only one object has never been viewed online.

This data inspires further questions, such as:

  • Beyond the 14th century sculpture data, what are the most frequently accessed objects via exhibition, publication, and online, and how do they compare? Why is this?
  • Beyond the 14th century sculpture data, what cultures, eras, and types of objects have never been exhibited, published, or viewed? Why is this?
  • How do these statistics align with the HAM’s curatorial, collections, and overall institutional mission?

Our foray into the HAM API provided us with a glimpse of what open-data can reveal about museums, museum collections, and their relationships with the communities and culture they aim to serve. However, our work left us with more questions than answers. To follow-through with any of the additional questions inspired by our data visualizations, we’d need to dig deeper. Full and ready access to HAM collections data beyond 14th century sculptures, clarification on the museum’s cataloging processes and vocabulary, conversations with HAM staff about how they plan to continue developing the API — these are just a handful of resources that would motivate and inform a larger project involving the HAM API and questions of museum collections and public engagement.

Closing Thoughts on Open Data

When first approaching this project, we were very concerned about the technical expertise needed to ask meaningful questions about open data. However, while there’s a definite learning curve for working with APIs — especially for those in the humanities without a programming background — we found that the sheer amount of information that can be pulled through these systems is enough motivation to learn these skills.

While we acknowledge that our case study only represented a tiny fraction of the HAM’s overall collection, we still think our data revealed something about how the HAM collects, studies, and provides access to its collection. We’re confident that much more can be revealed about the institution from further work with its open data.

According to Steward, about 80 users have requested access to the HAM’s museum collections data via their API. While no projects or papers have been published yet, Steward also acknowledges the potential of the API as medium:

The heart of the API — it’s just really trying to share the scholarship and the knowledge that we produce here in a way that’s going to let people put it into interfaces that makes sense for them.

We think open data presents an astoundingly diverse avenue for collaboration between cultural institutions and the public. However, current scholarship only reveals the tip of the iceberg; much more can be done to explore the potential benefits of open data.

For example, individual institutions are doing interesting work with their open data by sharing authority with developers, designers, and coders to create dynamic projects, oftentimes these initiatives are undertaken in an institutional vacuum. What possibilities emerge when we imagine a open data platform that connects information cross-institutionally? What projects — scholarly, artistic, or otherwise — could emerge from such a participatory system?

_______

Andrea Ledesma (@am_ledesma) is a first year MA candidate in the Public Humanities program at Brown University. She’s intrigued by the intersection of history, culture, and technology in museum exhibition and curation. Learn more about her work at http://bit.ly/1m903Zp.

Leah Burgin (@beahlurgin) is a first year MA candidate in the Public Humanities program at Brown University. With a background in informal education and anthropological archaeology, she’s interested in how cultural institutions facilitate public engagement. Learn more about her work at about.me/leahburgin.