Managing Big Data — Again
I was reading the recent Distillations magazine from the Chemical Heritage Foundation and saw an article on Information Overload. It reminded me of the post I wrote a while ago on big data in the 19th century, along with multiple posts about the American Chemical Society and Libraries in the 19th century. Sarah Everts, the author of the information overload article, rightly points out that having to manage vast amounts of data is not necessarily a new problem, as multiple other authors have pointed out. She concludes by asking “how should we collect this metadata intelligently and in useful moderation when we don’t even know what research questions will be interesting to future generations of scientists?” and suggests that “modern data curators may wish to learn from the classical collectors: natural-history museums.” She also discusses the importance of metadata in order to facilitate such management.
I wholeheartedly agree with all of Everts’ conclusions, but think that it is also important to look at two other organizations that are particularly relevant to scholarly communication: libraries and scholarly societies. Both of these groups are also essential to managing information overload, and, I think, form a mutual dependency (similar in some ways to the mutual dependencies created by academic journals). Additionally, I think that there is a social dimension to both libraries and scholarly societies (as well as to natural-history museums) that underlie much of what Everts is discussing. Interestingly, in the case of libraries and scholars, there is a kind of divide between the two groups that provides an interesting twist on Everts’ argument.
So far in my own work I have been focusing largely on the history of “big data” in the nineteenth century, particularly as it relates to the American Chemical Society. Other historians of science have looked more broadly at such issues, however. For example, Alex Csiszar has argued that “The key point was not the increasing volume of papers coming into print” which is usually the argument one hears in modern discussions of information overload. Rather, according to Csiszar, scientists in the nineteenth century attempted to replicate social organizations that were “safeguarding scientific value that had once been the putative territory of the societies and academies.” I have found similar patterns in my work. Certainly J. Lawrence Smith of the American Association for the Advancement of Science, and later the American Chemical Society, argued that research should be “pure” and free from interference of the outside forces Csiszar discusses.
What does this have to do with libraries? During the nineteenth century, libraries were also transitioning. My somewhat ancillary study of Theophilus Wylie the first librarian at Indiana University demonstrates this fairly well. Wylie argued for a library that reflected the educational curriculum of the university, and also represented a tradition in which academics, not professional librarians, managed collections. Universities, however, were changing to meet the needs for professional education. Libraries changed with universities, and increasingly focused on becoming complete collections of all published work. Thus, there was a tension between the two organizations. On the one hand scholarly societies were struggling to maintain a social order that differentiated “pure” research from the vast amount of unscientific periodical literature available. Libraries on the other hand tried to collect everything and provide tools for their patrons to navigate this sea of information.
Therefore, at least in the late nineteenth century, there were two ways of creating order out of the chaos brought on by information overload. First, there was the scholarly method of using social organization (and eventually peer review and the other mechanisms that came with it). Second, there was a set of methods in libraries that relied on specialists and classification systems to help library users navigate the explosion of information available to them. Cziszar hints at an important aspect dividing these two communities: authority. Libraries and scholars derive their authority from different sources and from different philosophical viewpoints. The question is, given the current explosion in “big data” and the correct assertion quoted by Everts that “Producing and saving a huge amount of data that nobody will reuse has doubtful value,” whether it is even possible to solve this crisis of authority for the problem of big data.
There may be an answer that is found within the discipline of information science. Archival studies has a sub-discipline called diplomatics that endeavors to understand the authority of a particular document within a particular historical context. Modern scholars in diplomatics have recognized a concept of what they call “organic information” which recognizes all information (print and electronic) as a kind of living organism where meaning and authority depend on social context. Philosophers of science have also noticed the link between information and living organisms. Natural history museums of the type that Everts discusses provide an interesting analogy to this concept of organic information since they, quite literally, collect examples of living organisms. Therefore, in a way, Everts article has uncovered an interesting link that needs to be further explored.
The last sentence of Everts’ article on information overload says, “with its overabundance of information, managers and creators of big data may find their inspiration in the most analog of collections.” I agree, but think there are some interesting twists on that line of argument. In the case of nineteenth century academic information, a divide grew between libraries and scholarly societies that were attempting to manage the first explosion of “big data.” This division between the groups arguably still exists today, and may contribute in part to the problems of scholarly communication. The way to resolve this division, however, goes beyond just the provision of good metadata in the ways Everts suggests. Rather, it may have to rely on the creation of a new method for deriving authority over information that is continually in flux. Diplomatics may provide one framework to help reconcile this division between libraries, scholars, and many other groups. There is one clear lesson from history in this case, however. Given the vast quantities of data that continue to be produced, an explosion that will only grow over time, this is a problem that we both as a society and as an enterprise for higher education cannot afford to get wrong the second time around.
Originally published at histscholcomm.wordpress.com on August 4, 2016.