The Secret History of “Data”: Reimagining the Past with AI and Machine Learning

Alex Wright
8 min readOct 26, 2024

--

Last week I had the opportunity to share a few remarks at Belgium’s KIKK Festival on new directions in machine learning and historical research, alongside CUNY’s Peter Aigner. Following is a rough sketch of my talk (reconstructed from memory and cleaned up a bit for public consumption).

Quads for Authors, Editors, & Devils; ed. Andrew W. Tuer., 1883.

Welcome everyone, and thanks for coming out to spend a few minutes with us here this morning. Today we’re going to talk about how machine learning and related technologies are starting to change the landscape of historical research. But first, I want to acknowledge that this kind of topic isn’t the typical fare here at KIKK, which like most technology-oriented events tends to focus primarily on the future rather than the past.

But if we’re going to engage with the creative possibilities of the future, I’m going to argue, we should also consider the creative possibilities of the past. And while some of us might tend to think of historical research as a dry scholarly pursuit—the stuff of dusty archives and plodding monographs—in truth the practice of history offers us an incredibly rich creative landscape, and is fast becoming the site of a remarkable technological revolution that could fundamentally reshape the way we understand the world and our place in it.

In the world of futures studies, there’s a well-known metaphor called the futures cone, a simple visualization that points to the inherent difficulty of predicting the future. Put simply: the further out you look, the less certain things become. And while none of us can really predict the future, what we can do is prepare for the possibilities of change.

A few years ago, Greek futurist Epaminondas Christophilopoulos proposed a revised version of the futures cone, which he dubbed “The Cones of Everything.” Here we see not just a future-forward perspective, but a twinned concept of divergent possibilities taking shape in two directions at once: towards both the future and the past. Past and future are deeply intertwined, constantly informing each other. And by rethinking our conceptions of the past, we can also imagine new universes of future possibilities that might otherwise elude us.

Christophilopoulos, 2021

Our perceptions of past and future are also always and inevitably anchored in the present tense, shaped by the language, concepts, and mental constructs that we bring to bear from our current historical vantage point. Everything is a moving target.

It has by now become a tired truism that we live in an age of data. But like all truisms, this one is—well—fundamentally true. The term “data,” as commonly used today, is a distinctly twentieth century concept, a construct that emerged in tandem with the emergence of digital computing starting in the 1940s. But the underlying challenge of collecting and managing large bodies of information long predates our present digital age.

The pre-history of big data

Library of Ashurbanipal, circa 612 BCE. British Museum

As far back as 600 BCE, the Assyrian emperor Ashurbanipal made a concerted effort to collect every last written artifact from across his empire — sending out envoys to commandeer every single piece of extant writing and consolidate it into a centralized collection. The Assyrian imperial library was the first of many such attempts throughout history. As I have written elsewhere, the historical relationship between recorded knowledge and political power runs deep.

‘Bayt al hikmah or the House of Wisdom, circa 8th century | Wikipedia

History is rife with other examples of attempts to construct universal libraries for the sake of accruing epistemic — and therefore political — power: the Library at Alexandria, Emperor Shi Huangdi’s imperial archives, the great medieval Library of Baghdad, King Charlemagne’s library, the Sorbonne, and so on.

Giulio Camillo’s Theatre of Memory, circa 1519

With the emergence of the printing press in Europe starting in the fifteenth century, the proliferation of written texts inspired several early proto-information scientists to wrestle with the problems of distributed knowledge management using a variety of ingenious new methods: for example, Giulio Camillo’s Theater of Memory, Leibniz’s Excerpt Cabinet, or Conrad Gessner’s Universal Bibliography, to name a few.

Leibniz’s Excerpt Cabinet, from Placcius 1689, p. 152.

Starting in the nineteenth century, the challenges of information management took on new levels of complexity with the advent of steam-powered printing presses, cheap wood pulp paper, and the spread of communications networks like the international post, railways, and telegraphs. The ensuing industrial information explosion soon dwarfed the Gutenberg revolution in terms of sheer intellectual output — paving the way for a new generation of information pioneers who started to imagine new ways of engaging with written texts that started to presage the advent of distributed computer networks.

Telegraph room, Palais Mondial, circa 1924. Courtesy of the Mundaneum

It was during this time that information science visionaries like Paul Otlet, H.G. Wells, Vannevar Bush and others started to posit new methods and tools for humanity’s collective intellectual capital (or at least the subset of human knowledge that has ever been written down). While these early thinkers worked in an era before digital computers, microchips, or electromagnetic storage, nonetheless their ideas presage some of the foundational concepts of machine learning and neural networks. Both Otlet and Wells envisioned something like autonomous “agents” — albeit human ones — who would collect and curate information, and create new synthetic forms of knowledge. Wells called these imagined proto-knowledge workers “samurai.”

Why should we care about these pre-digital efforts at building large-scale knowledge collections? While we should be careful of anachronistic historical interpretations, I believe that these precursor efforts hold important lessons for the contemporary challenges of machine learning and large language models. As Ryan Cordell and others have argued, large language models are fundamentally bibliographical in nature. For millennia, libraries and archivists have been wrestling with the challenges of acquiring, classifying, and pattern-matching across large bodies of written information. So it should come as no surprise that historical collections of documents and archival material should provide a vibrant petri dish for new modes of research and meaning-making in today’s age of machine learning systems.

Algorithmic histories

With these historical antecedents in mind, let’s fast forward by a century or so to the present day. As more and more primary source material becomes scanned and digitized, machine learning tools are enabling scholars to “zoom out” to look for connections across large bodies of archival material: making predictions, forming conjectures, and juxtaposing different historical data sets to reveal hidden patterns of meaning. As a result, new forms of historical research are starting to take shape.

Working variously under the mantles of “cultural analytics,” “data archaeology,” or “speculative bibliography”—or the broader umbrella term of “digital humanities”—a new breed of technically minded historians are starting to emerge. Working with source material in a more data-centric, quantitative modes of inquiry, they are taking advantage of scanned and digitized source materials, using machine learning methods to analyze and summarize them, and using semantic neural network analysis to search for patterns across documents and other source materials, mapping connections and networks of influence that can shed surprising new light on any number of historical topics.

A few examples that have crossed my radar recently:

  • Venice Time Machine
    A multi-year effort to digitize and analyze Venice’s public government archives, reconstructing 1,000 years of its urban, social, and economic history. Through advanced text-recognition and 3D mapping, it creates an open-access, temporal model of Venice, allowing historical reconstructions in interactive 2D and 3D formats​
  • Sphaera Project
    Analyzing more than 350 texts from the 1500s on the concept of celestial spheres, this project reveals the evolution of these concepts over time through networks of influence between disparate document published in different European cities over more than a century.
  • Project Sefaria
    An effor to trace the development of Jewish narratives over several centuries (e.g., the Book of Genesis), enabling users to explore the interweaving of thematic, theological, and narrative influences in early Jewish literature.
  • MARKUS
    Linking geographic data to early Chinese novels, this project maps references to physical places across classical texts. By combining text-mining with geospatial data, it reveals the patterns of geographic influence that shaped early Chinese literature.
  • Viral Texts
    This project analyzes a large data set of early American newspapers, using text-mining techniques to explore how editors shared newsarticles and other items across publications via the common practice of exchanging newspapers via the postal network and “scissors and paste”-style editing.

Patchwork histories and their pitfalls

For all the excitement bubbling around this kind of work in the academy, this kind of historically focused data analysis and interpretation comes with any number of challenges and ethical watch-outs. Not least among these is the problem of “present-ism,” or temporal bias. The vast majority of training data that feeds our current commercial large language models was created in just the past 15 years or so. These tools are simply not yet trained on a sufficiently large data set of older material, leading to all kinds of implicit cultural biases.

Most of these models also harbor a strong bias towards Western perspectives, given their particular provenance. As a result, they can and do fail to consider non-Western perspectives and narratives. Non-Western regions and languages are woefully underrepresented. And an even more vexing problem looms as well: the vast majority of human knowledge generated in the past 100,000 years has never been written down. How do we expand the historical record without falling into the trap of privileging and perpetuating literate perspectives that tend to reflect the views of people in power over the ages?

Like all applications that involve Generative AI tools, these kinds of projects are also rife with the risk of hallucinations—a risk that becomes even more pronounced when working with relatively limited data sets drawn from a single context. To the extent that they rely on large language models, run the risk of turning into “stochastic parrots”: perpetuating incorrect information, or creating entirely fictious narratives. This risk may become even further exacerbated where these tools try to extrapolate themes and patterns from a limited set of source documents.

Those caveats aside, these projects feel like the first wave of in what is almost sure to be a revolution in historical understanding over the coming decades. I’m looking forward to delving further into this domain over the months ahead as I embark a new research project (which I hope to share more about soon). While staying cognizant of the limits of data-driven approaches to historical inquiry—and recognizing that history can never be reduced to simple mathematical formulae, nonetheless these tools offer us a powerful new lens for re-thinking the past by combining the best of data analysis methods with the speculative and story-telling skills that the historian’s stock in trade. As the historian Carl Becker memorably put it: “History is an imaginative creation.”

--

--

Alex Wright
Alex Wright

Written by Alex Wright

UX Leader at Google; Author of Informatica, Cataloging the World. See also: www.alexwright.com

No responses yet