Final Report on LatAm: a Historical Gazetteer of Colonial Latin America and the Caribbean

Pelagios
Pelagios
Published in
6 min readJun 14, 2019

Ben W. Brumfield, 5 February 2019

Pelagios presents researchers with a fantastic set of tools for studying the world and linking texts, artifacts, and maps to historic places. However, not all the places of the past have been findable with the Pelagios set of tools. In the case of colonial Latin America, if a researcher was studying Acamuchitlan in Nueva España — or most of the rest of colonial Latin America — they had a problem: their search within the Pelagios network would return “0 Results”.

However, just because data aren’t online doesn’t mean they don’t exist. Print resources for studying colonial Latin America geography and history are indeed well represented by Antonio de Alcedo’s Diccionario Historico-Geographico, a five volume dictionary published between 1786 and 1789, and by George Thompson’s 1812 English translation and expansion of Alcedo. As it happens, these sources have been digitized, but converting pixels to machine-readable geographical data is a challenge.

Title pages for Alcedo and Thompson

The LatAm Pelagios Resource Development Grant team coordinated researchers across three continents and five organizations: HD CAICYT in Buenos Aires, Argentina; the LLILAS Benson Latin American Studies and Collections in Austin, Texas; the World-Historical Gazetteer in Pittsburg, Pennsylvania; and HGIS de las Indias in Graz, Austria; and Brumfield Labs in Austin, Texas. Before the Pelagios Resource Development grant, these projects were pursuing the goal of encoding Alcedo/Thompson separately, each following slightly different methodologies for cleaning OCR and extracting geographical entities. Furthermore, the ends of each project were similar but not identical. Geo-aware digital scholarly editions, linked data gazetteers, and HGIS resources may share many common methods and data, but are very different scholarly products. Each project had, as a result, made independent decisions about which toponyms within entries needed tagging, which entries to encode, and what methodology to follow.

Summary of participating projects methodologies

  • NER + OCR Clean-up to CSV (WHG)
  • NER + OCR Clean-up to e-text (CAICYT-HD and LLILAS Benson)
  • TEI-XML to HGIS (HGIS de las Indias)
  • OCR Correction + Tagging to CSV (LLILAS Benson and Brumfield Labs)

Textual Challenges

ACAMUCHITLAN, a settlement of the head settlement of the district of Texopilco, and alcaldía mayor of Zultepec. It contains 60 Indian families, whose commerce is in sugar and honey. It produces also maize, and cultivates many vegetable productions. — Five leagues N of its head settlement.

All efforts processing historical gazetteers face a common challenge in extracting geographical hierarchies from a text that contains its own hierarchy. This entry from Thompson’s version of Alcedo has a textual hierarchy in which the headword, Acamuchitlan, is primary, while the other toponyms mentioned within the text of the entry are secondary to it, and provide context for the headword. However, situation is more complicated, since this entry describes an administrative hierarchy in which Acamuchitlan is actually the smallest unit, while the secondary toponyms like Texopilco and Zultepec represent higher-level units which contain Acamuchitlan (as well as other units). Any attempt to extract geographical hierarchy from these gazetteers requires a reversal of the textual hierarchy combined with identification of types of containing entities. Further complicating this process are toponyms that appear within an entry which are not directly part of the administrative hierarchy. These can include neighboring towns, geographic features, or — in the case of entries on indigenous communities — previous locations. The question of encoding non-containing entities presented recurring problems throughout the project.

An additional challenge presented by the text was anaphora — replacing a headword with a reference to other entries occurring earlier in the text. Because the gazetteer is organized alphabetically, we would see an entry on San Augustin followed by other unrelated entries whose headwords were replaced with “of the same name.” These anaphors prove especially challenging for automatic approaches to encoding

Acamuchitlan text and encoding

While consistency of encoding is essential to digital gazetteer creation, establishing that consistency in light of the differing perspectives of the member projects and the challenges of the texts themselves caused substantial delays and wasted effort. We strongly recommend that any future projects in this area begin by dedicating early working meetings to collaboratively encoding a few pages of the text together so that the difficulties of the text and the strategies for addressing them can be discussed by all team members, with concrete examples in front of them.

Methods

Drawing on their independent histories, the projects used a combination of methodologies to encode Alcedo/Thompson into formats which could be used for digital gazetteers.

The teams in Austin and Buenos Aires used the open-source digital edition tool FromThePage to correct and tag the OCR, and to correlate the Spanish entries with the Thompson’s version. The Pelagios RD grant funded enhancements to FromThePage to support encoding lexicographical text like gazetteers, including new support for headword tagging and for extracting the text of an entry even across column breaks or page breaks. To make this encoded text usable for historical gazetteers, the subject export feature was enhanced to contextualize every tagged toponym with the headword of the entry, the administrative category of that toponym, and a URI for that place.

During the software enhancement and early encoding efforts, the team came into contact with Werner Stangl and his HGIS de las Indias. This is a full historical GIS rather than a gazetteer, which had already encoded a large proportion of the entries in Alcedo. Working together accelerated the computational approach substantially, whereby we pursued data extraction and correlation in parallel to the digital edition effort. Our combined approach consisted of identifying entities of interest for the gazetteer (land forms, different types of settlements, indigenous lands and people). This was accomplished by: (i) looking for occurrences of certain keywords following the headwords of the Alcedo entries; (ii) taking into consideration synonyms and spelling variation in the keywords; and (iii) accounting for anaphora (‘Tiene el mismo nombre’, ‘Otro pueblo’, etc.) between entries that had the same name but represented different entities (a settlement and a river, a province and a capital, two cities in different regions, etc).

For example, Alcedo’s dictionary presents two entries corresponding to the headword COMAS:

COMAS, Pueblo de la Provincia y Corregimiento de Xauxa en el Perú.
Tiene el mismo nombre una laguna de la Provincia y Gobierno de Venezuela, de figura oval, entre el rio Guarico y la jurisdicción que divide este Gobierno del de Cumaná.

After matching these entries with the registries in the HGIS de las Indias database for adding parent territory information and unique identifiers, the information was encoded as follows:

04359;6001139;Comas;Pueblo;Pueblo;Corregimiento de Xauxa;JUPETAJA 04360;;Comas;Lago;Laguna;Venezuela;CGVE0000

The HGIS database also allowed the team to identify ambiguities when there was not enough information in the entry to identify a headword that can correspond to different settlements (cf. the entry CHORILLOS and the two registries of the same name, one in the district of Huarochiri and the other in the district of Lima).

The methods adopted by the different teams contributed to efficiencies in each other’s work. For example, the deep understanding of the Alcedo text developed by Werner Stangl during his creation of HGIS de las Indias was able to inform the encoding of Alcedo/Thompson, and the computational extraction techniques developed by the Buenos Aires team were reproduced to programmatically suggest tags by the Austin teams.

Results

Both the preliminary extracted data from Alcedo and the HGIS were ingested into a development copy of World-Historical Gazetteer as new data sets. This will substantially expand the coverage of the Western Hemisphere in WHG.

Existing WHG data represented in black; new data imported from HGIS de las Indias and Alcedo’s Diccionario in green.

During the course of the 2018 LatAm Pelagios RDG, the teams at LLILAS Benson and HD CAICYT encoded the entirety of Volume I of Thompson’s version and the corresponding entries from Alcedo’s original, cleaning OCR and tagging 5003 unique toponyms. Members at HGIS de las Indias, HD CAICYT, and WHG matched 2539 entries within Alcedo’s text to entries within HGIS de las Indias, ingesting them into the World Historical Gazetteer. We hope to continue this work in 2019, using the processes we have explored through this grant to complete the encoding of Alcedo and Thompson, and to incorporate them into the WHG.

The LatAm team is

  • HD CAICYT
  • Gimena del Rio Riande
  • Nidia Hernández
  • Romina De León
  • LLILAS Benson / University of Texas-Austin
  • Albert Palacios
  • Jennifer Isasi
  • Joshua Ortiz Baco
  • Karla Roig
  • World Historical Gazetteer
  • Karl Grossner
  • HGIS de las Indias
  • Werner Stangl
  • Brumfield Labs
  • Ben Brumfield
  • Sara Carlstead Brumfield
  • Jack Weinbender

--

--