Variants: Comparing LCSH alt labels and Wikipedia redirects

Matt Miller
3 min readSep 25, 2019

--

Library of Congress Subject headings are often assigned alternate labels when created. The tracing terms are created to provide alternate access points to the official terminology. For example “Hydrothermal vents” has a number of alt labels:
“Black smokers (Oceanography)”
“Hydrothermal deep-sea vents”,
“Oceanic hot springs”
“Deep-sea vents, Hydrothermal”
“Vents, Hydrothermal”.

The idea is in your library discovery system when someone searches for “Black smokers” they would end up with results for things assigned “Hydrothermal vents”

Wikipedia has a similar system called redirects. When someone goes to: https://en.wikipedia.org/wiki/Black_smoker they get redirected to a part of https://en.wikipedia.org/wiki/Hydrothermal_vent. These redirects were created ad hoc as needed. The result is an invisible folksonomy created over the history of the article being edited, merged with other articles, and Wikipedians adding additional access points to the article based on how people expect to locate information on Wikipedia.

Now that there are many LCSH ids in the Wikimedia ecosystem new areas of automated comparison between the two can be done. I was curious how these two alternate label systems compared to each other. Out of the 430,000 LCSH subject headings around 34,000 are linked to Wikidata. Out of those 19,600 are connected to English Wikipedia articles and both the LCSH and Wikipedia article have alt labels or redirects.

With our dataset of 19,600 headings and Wikipedia articles we can see that there are on average more redirect than LCSH alternate labels:

Mean Average LCSH alt labels: 2.9
Mean Average Wiki Redirects: 9.9

Median Average LCSH alt labels: 2
Median Average Wiki Redirects: 6

This tells us the Wiki redirects are much more verbose than the LCSH alt labels. This makes sense as the Wiki redirects are often gathering a wide range of topics under a single umbrella article. For example lets look at the difference for our earlier example:

There are more variations of the words but there are also additional variants that are more colloquial like “Ocean vent”. This makes sense as the redirects are a folksomony created as needed from the bottom up as opposed to top down thinking of all the possible alternate terminology when the heading is created. Also interesting is one of the alt terms used by LCSH “Oceanic hot springs” does not seems to be a access point in the Wikimedia system.

I created a little utility to explore these alt terms comparisons: https://variants.glitch.me

Screen shots from https://variants.glitch.me/

I wanted to see how comprehensive the Wikipedia redirects were compared to LCSH alternate labels. I checked to see what percent of each LCSH alt labels were present in the Wikipedia redirects (and Wikidata aliases):

Number of LCSH headings and the percent of their alt labels that were also in the connected Wikipedia redirects and Wikidata labels

This chart shows for example that around 4300 LCSH headings had 100% of their alt labels present as an EN Wikipedia redirects or Wikidata aliases. The bulk of the headings had most of the LCSH alt labels present in some form in the Wiki redirects. The 0% matches were often slight variants in spelling, for example.

This data points to the idea Wiki* terminology could not replace the terminology we use in library world but could potentially be a very useful source for alternate labels to use in our systems. Because who really is able to spell “weimaraner” on the first try:

--

--