After 20 years of living in a world with Google in it, many web users consider themselves fairly sophisticated when it comes to finding things. That big search box leads us to likely-looking results, and we know how to tweak our queries if we sense them going down the wrong path. We don’t know much about what’s really happening behind the scenes, but significant amounts of money depend on our interactions with search and our understanding of search results and their relationship with the web pages they lead to.
People who make web pages try hard to make their pages work in this world, to help people find the things. And when we build our own search and navigation tools within more specialised sites, we know that people are bringing understanding and expectations influenced by a world with search engines in it.
When we take our search skills and try to find something in an archive, we might find that these skills seem to have lost their magic. We know it’s in there somewhere! But we’re just not finding it. What’s gone wrong?
In the early days of the web, users found their way around with the help of curated intellectual structures — web directories, like Yahoo. These were hierarchies, knowledge schemes. Follow the forking paths and they’ll lead you to your goal — or maybe to something unexpected and interesting.
Then along came AltaVista — it indexed the whole web, now you could search the text of all those pages. And then came Google, with PageRank, and twenty years later a curated directory seems quaint, hopelessly inadequate to navigate the vast ocean of information. The idea that a human-applied organising scheme could help you find things on the web has faded away. We’ve trained ourselves with different strategies for locating the things we are interested in. We’ve evolved alongside search engines and e-commerce sites and picked up new behaviours and strategies for landing on that page we are after.
Archives are curated intellectual structures.
Archivists have organised and described them. But instead of the general thematic, hierarchical subject organisation of a web directory (which could in theory be flattened out into a single long list of subjects), searchable knowledge about the archive lies along the paths, in archival description at each level, not just at the leaves. The structure is not based on subject classification (“show me things about trains”) but on the arrangement of the archive when it was acquired, which usually corresponds to how it was organised by its personal (or corporate, or departmental) creator.
The archivist seeks to explain and give context to the material and its original order, to help people find their way around it, rather than to classify or categorise individual intellectual objects within it.
Our raw material content — whether we use it for browsing or searching — is this archival description. More work might be done on individual objects, by humans or machines, but we start with the archival description. This may have little and even sometimes nothing to say about an individual object.
Unless users are familiar with archival practice, these organising principles are not necessarily expected, or even perceivable. And even if the interface elegantly conveys the structure of the hierarchy, and the user’s current position within it, the reason for the hierarchy — the principles behind it — may be misinterpreted or misunderstood. Even the seasoned genealogist or local history researcher may be thrown off course by a quite different style, or degree of available detail in archival description when they attempt to navigate unfamiliar material from a different source.
Archival description, which may have been written decades earlier for the benefit of the physical visitor to the archives, seeks to avoid repetition of information already provided along its paths, at the different levels. Unless you are lucky enough to be working with an extensively digitised archive collection, and those digitised items are public, the leaves at the ends of the paths are not even visible. Their content — the content of things in the archive — is not part of the searchable space.
If the archive has not been described down to the item level, the slender branches that separate individual intellectual objects such as letters and reports aren’t visible either. The path might end at something called Correspondence 1948–52; some overview of the content of those documents may be given at this level, but a full contextual understanding of the archivist’s description is only gained by a walk up the path from that point. A categorisation of individual intellectual objects may simply not exist. The thing-to-be-found has no page of its own to land on. If you arrive at a leaf from a search result, you have not been down this knowledge-building path. We know that it is hard for many users to make sense of what they are looking at when landing on a web page for an item in an archive, leading to frustration and confusion.
What do libraries do?
Maybe we can learn from libraries, or museum catalogues. People are more familiar with libraries. Can they apply some of that awareness in an archive?
Library search or browse involves thousands, or even millions, of intellectual creations (works), and instances of those works. They have been described by professionals; they have catalogue records that conform to international standards, just as archives do. Hierarchies may be present: the Library of Congress Subject Headings has narrower terms and broader terms; it has hierarchical properties, and the thinking behind that hierarchy is familiar and recognisable to the user from their experience of knowledge organisation generally.
But a library doesn’t need to be understood hierarchically. The leaves are the main attraction — consistently defined, and individually described. These consistent standards-driven catalogue entries are also very amenable to alternative discovery mechanisms we can build for them on the web; their metadata can be harnessed to drive Generous Interfaces and unlock collections for many more visitors and curious explorers, through aggregations based on properties of the works, and cross-linking from one work to another through their shared properties.
In a library, the described intellectual object is consistently classifiable. The work has properties like author(s) and subject(s) and creation date. These properties comprise its metadata; on the web, they are properties that help you find it directly, from a general search engine or a library portal. These properties help you land on the thing. On the whole, in the world of bibliographic description, discrete intellectual creations (works) have discrete identifiable descriptions. Works in libraries can be made to play nicely with the mental model of search familiar to users from their experience of the web generally, where web pages (and their representation in search results) can work as context-free units for individual appraisal. Bibliographic metadata can translate into a vocabulary, perhaps using schema.org, that search engines understand.
Can we apply this to archives? Archival descriptions are sometimes about works (distinct intellectual creations). But often they are not. They are also about organisation, arrangement, provenance. In one archival series, an item level description might correspond to a distinct intellectual creation, like a letter. But in others, a whole box of papers (many discrete creations, about different subjects) might have a short paragraph of description devoted to it, and the aim of that description is subtly different from a bibliographic one: it help you answer the question “should I look in this box?”
Even if the archival description is rich enough to describe individual intellectual objects, that isn’t the whole story. If you just see the leaf, you aren’t getting the archivist’s insight from the level above, organising content that is part of the hierarchical intellectual structure. And if landing on a web page for that leaf via search only happens as a result of the search engine matching that leaf’s direct description and properties, your query space is missing vital information, adding to the feeling of “I think I know what I’m looking for, I think it’s here somewhere, but I’m just not finding it”.
But if in trying to avoid this, we ensure that any level of the hierarchy can appear as a valid result in an archive’s search interface, or as an entry point from a web search engine, we have a different problem. The user needs a good understanding of exactly what is going on: what is the thing they have landed on? Many visitors expect to be able to see the thing described, and this is only sometimes the case (if the thing has been digitised, which is less common than people expect). And how do they explore further? They need a mental model of how archives work to successfully explore up and down from that point. Knowledge lies along the paths, it is cumulative along those paths, building up an understanding of the archive. This is quite different from the approach taken in bibliographic description.
Search engines and archives
Web pages can describe themselves to search engines using a vocabulary called schema.org. Recent proposed extensions to this vocabulary offer the possibility that consumers of schema.org descriptions included on web pages could at least understand that a described thing is part of an archival description, and that there is a navigable hierarchy present. But that doesn’t help show us the way to a less baffling user experience for search, browse and discovery, in their various combinations as users explore and comprehend archives.
How do we reduce this bafflement?
The next post will start to look at ways of reducing this bafflement.
As we progress with Project Alpha, we’re looking to test some of the concepts with people new to The National Archives. If that sounds like you, we’d welcome your help! Register your interest here.