Wikidata in Collections: Building a Universal Language for Connecting GLAM Catalogs

Last year, I was part of a group of writers who discussed the amazing and powerful ways in which Galleries, Libraries, Archives and Museums (GLAMs) connect their cultural heritage collections with the world through Wikidata. This year, at Wikimania in Montreal, conversations about Wikidata were at a fever pitch, filling rooms throughout the conference. We learned about how Wikimedia communities around the world are using various tactics for enriching, connecting and learning from heritage collections, spreading the data across dozens of Wikipedia language communities and external applications. Moreover, at Wikidata Con this year, a slew of sessions highlighted just how vital Wikidata can be for making heritage collections more accessible.

However, among all of those opportunities, one of the most exciting collective interests in these conversations was using Wikidata to describe individual objects within collections at heritage and memory institutions. Not only do cultural heritage organizations want to see or understand the connections between topics in their collections, but they actually want to leverage the connections to make their catalogues more useful for their patrons.

Wait you say data?

In the cultural heritage community, one of the most vital parts of curating and collecting museum, archival and library collections is keeping track of the many types of content that these organizations preserve. The best way to do this is to describe collection materials with what is called “metadata”. This metadata can be in many different formats: from sentence-like descriptions of the items, to more structured data, like key facts or statements about the objects. When you search at a Museum, Library or Archive for objects within their collection, you are most frequently searching this metadata.

However, as you can probably imagine, with thousands of libraries, archives, and museums around the world, building their own collections records and catalogues, it is very hard to look for materials on a subject that might be held in multiple institutional contexts and in multiple languages. For this reason, some of the professional communities, especially in libraries, build shared vocabularies for describing topics and subjects, called authority control or controlled vocabularies, and shared rules for describing these materials and constructing names.

These descriptive practices, authorities and vocabularies allow many different organizations to develop shared strategies for describing materials. In turn, this allows computers to connect these collections, and confidently say that these items relate. For example, the software can see the connection between different forms of an author name (i.e. “J.K. Rowling”, “Rowling, J.K.”, “Robert Galbraith”, and “Joanne Rowling”) through authority control or a topic through a subject vocabulary.

Making Context Visible

Traditionally library metadata puts the names of authors, publishers, and other tags, like the Dewey Decimal classification, in the context of a bibliographic universe that usually only extends to the library’s systems or other scholarly databases and repositories. Very rarely do libraries put their data and materials within the context of the wider digital landscape. However, with the help of Wikidata that is beginning to change.

Take for example the work being done on the library catalogue at Laurentian University, in Ontario Canada. Systems librarian Dan Scott has added the ability for the catalogue to represent small snapshot cards of information, similar to what you would find in a Wikipedia Infobox or Google Knowledge Graph information box. However, these are special: that data is information curated by the Wikidata community. Now when students and other users search for their favourite album in the library catalogue, they can find valuable connective information, such as discographical information from Musicbrainz and Twitter handles for the artist. With this curated information, students can take their research or engagement with the library collection beyond the institution’s collections.

Laurentian University catalogue displaying an info card powered by Wikidata

This same principle of helping researchers go beyond “the collection” can be found on the website MediaLibrary.it., which acts as a portal to library collections in Italy. By first reconciling books authors with the Virtual International Authority File, one of these vocabularies, and then verifying those matches against Wikidata items: the MediaLibrary was able to generate descriptions of the authors and their work, along with links to Wikipedia for further exploration of the topic.

This kind of “context beyond the collection” is particularly important for materials that necessarily require interpretation, such as art or historical objects. Take for example at the Museum of Modern Art in New York (MOMA): the Museum has integrated Wikidata and associated Wikipedia articles into “artist” pages in their online catalogue. When MOMA first developed the catalog they were relying on Grove Art biographies to help supplement the records in their own collection for members of the public. However, these biographies frequently were lacking: either going out of date or not covering all of the general interest questions members of public would bring to the collection. The Museum decided to supplement the artist profiles with first Wikipedia articles, and then with the Wikidata ids: by doing so, both the public and reusers of the MOMA data can add even more context from Wikidata or the other vocabularies connected to the collection.

Or take for example, the work of the Social Networks and Archival Context project, a collaborative of archival organizations, who is working to develop a website that helps both navigate archival collections, and understand the connections between these collections. By connecting SNAC id’s with Wikidata, they are able to benefit from the Wikimedia communities effort to create openly-licensed images to represent those individuals, to create a much more vibrant and dynamic way to discover those resources. Moreover, the archival collections collated by the SNAC community, are now available for other collections using Wikidata.

Bridging the gap between expert vocabularies and community description

One of the greatest opportunities and challenges of using connected or linked data in the Cultural Heritage space is that the vocabularies needed for many collections, topics and intellectual spaces defy the expectations of the larger professional communities.

Take for example, popular or folk music: while some libraries and archives specialize in collecting this music, standards and rules for describing it in library and archival systems can leave out key information that describes the roles of performers or the nature of the music’s performance. Libraries depend on the presence of a media, such as an album or piece of music, in order to enter it into the database. Developing a structure for describing the folk music may not be something that an institution has capacity to do independently, or in a highly controlled or under-resourced environment.

Using a traditional process for creating an authority control requires expertise and institutional support across a wide number of organizations. With Wikidata, this expertise can be distributed, with organizers focusing on building their own vocabularies. Earlier this fall, librarians Stacy Allison-Cassin and Dan Scott are leading an initiative to mark the 150th anniversary of the confederation of Canada, to collect structured data about Canadian music. Moreover, Allison-Cassin is initiating a project with other research libraries in North America, to figure out how to represent the cultures of indigenous peoples in Wikidata — topics that have been hard to model because many of the hierarchical and westernized assumptions that the library profession brings to cataloguing.

These new data structures can be the foundation for expanding the concept of the work of librarians and archival work for years to come: when asked about the project, Allison-Cassin, wrote that “Authority control plays a very powerful role in shaping what knowledge is available in libraries. Because authorities are only created for those responsible for creating works, for example the author of book, and not necessarily a person written about in a book, many people are absent from authority files. Furthermore, they may be described in a way that is problematic and current systems making addressing these problems very difficult or even impossible. For example, the Library of Congress, a massively influential source of authority data in North America, is an agency of the government of the United States and describes information according to the needs of the government. Wikidata, on the other hand, creates opportunities for community participation and allows for a greater diversity in the way people can be represented in data, giving people the power to shape knowledge about their own communities“.

Moreover, even when these vocabularies exist, many collections sit at the intersection of the different authority control scopes: take for example the instructional slide collections that are part of the Project Durchblick at the Humboldt-Universität. As the librarians and art history faculty digitize these instructional slides, they need to be able to provide specific metadata that describes both the individual objects depicted in the slides and the contextual data about their use. Though the project team could have used both German National Library and art history vocabularies to cover some of the topics, the international and often very complex sets of information could not be covered with those controlled vocabularies. Enter Wikidata! Instead of having to choose which of these vocabularies to use, Wikidata’s function as a “meta” vocabulary allowing both labelling of digitized content from their project and subsequently linking to other vocabularies matched with Wikidata.

Already the relevance of Wikidata for popular metadata ontologies allows for powerful applications. For example, the Finnish broadcasting corporation YLE, use Wikidata as its main vocabulary for tagging pieces of journalism. Unlike historical materials, where the definitions of topics can be authoritatively described using established knowledge about the topic, frequently journalism needs to be connected to topics being defined during the moment — for example in politics, where political parties and candidates are constantly changing. This living archive of journalism, can then be discovered and reused by researchers based on its vibrant connections with other kinds of data.

Moreover, if a researcher wants to cross-reference the YLE backfile with other research materials on people, places or things in other collections, they can look for other collections using the Wikidata item or other more traditional authority controls or identifiers (like how MediaLibrary.it cross-referenced VIAF ids with Wikidata). Now the public journalism emerging out of Finland in Finnish and Swedish, can be intimately connected with the larger knowledge commons.

The Wikidata powered interface for the YLE Drupal-based content management system that allows journalists to tag YLE works with Wikidata items.

Adding context, beyond the collection

However, sometimes, the process of tagging or reflecting does more than just “provide contextual data” it can actually allow for complex analysis, or help the archive more active applications of their collection.

Take for example, the work of the MediaLibrary.it librarians: instead of just matching the contextual the building the Media Library are able to evaluate their collection for biases, context, which allows them to learn from and better understand the collection they are offering readers. For example, in evaluating the authors that are connected to Wikidata, they discovered that over 90% of their collections were written by men: clearly these works reinforce a biases in the publishing industry, that prevent women from publishing! Or, since the library focuses on Italian, European and English language works, it’s not much of a surprise that France, Italy Germany, the United Kingdom and the United States dominate the collections.

A representation of the nationality of authors in the Medialibrary.it collection: http://tinyurl.com/y8rbc54c

These computing-generated applications of the collections, can step beyond basic visualizations. Take for example, the work of Yale University Library Digital Conservator Katherine Thornton. At the library, Thornton archives digital records, such as old documents and pieces of software. When her collections are tagged with Wikidata items for the software requirements for each of these old files, their software emulators can automatically choose the right software environment for accessing the content. Learn more about the application from Thornton, in a presentation at the University of Edinburg.

Contextual data retrieved from Wikidata can be used creatively for many practical purposes that save time and energy for heritage professionals. In the Netherlands, two Wikimedians (Hanno Lans and Michelle van Lanschot) have founded the non-profit initiative Copyclear, which helps cultural organizations (such as museums) to sort out the copyright status of works in their collections. To do this, Copyclear imports the list of artist names from an art collection into its platform (based on the open source content management system Drupal). Then these names are matched with Wikidata, and then, using death date and country-specific legal provisions, checks are performed whether each artist’s work is in the public domain and — when the artist’s work is still copyrighted — whether the artist is a member of any collecting society. With this information, the cultural institution is then much better equipped to manage the copyrights of works in its collection, and to make informed decisions about publishing reproductions of artworks online.

Or at the national library of the Netherlands, Koninklijke Bibliotheek, the research lab, has matched collection tags with Wikidata concepts across its Digitized newspapers. With these matched concepts, they have been able to run complex searches against the collection of materials, previously unimaginable. For example, if you were to ask “What newspapers contain content related to members of the Dutch parliament who were not born in the Netherlands?” You would get this result. What an amazing, and arbitrary request! But now, its possible! Learn more about the experimental feature from Theo van Veen’s talk at WikidataCon 2017.

Help us build the ultimate connective tissue for collections!

These initial applications of Wikidata to collection systems, are just the beginning! As more and more collections or vocabularies become connected to Wikidata, we can began to ask questions like “which collections have content related to a particular topic?” or “what research materials are available on this topic, so that an expert researcher, hobbyist or Wikipedian can write about it?” .

Moreover, aggregators of institutional collections see the value of these kinds of cross-cutting vocabularies which connect many institutions, with Europeana even suggesting that a necessary next step for authority controls and metadata is connecting those vocabularies with Wikidata. With vocabularies connected, reusers and researchers can save immense amounts of time and energy cleaning the data for reuse, instead focusing on what they have expertise in: analyzing and representing that content to the public.

The actions and decisions needed to make this vast web of connected heritage collections, requires more than just institutions adding these vocabularies to their collections! The Wikidata community could use your help:

Thank you for the feedback and support from: Stacey Allison Cassin, Beat Esterman, Dan Scott, Ed Erhart and Sandra Fauconnier.