Mapping library collections with Folium

Crafting interactive, location-based narratives with new Library of Congress API

Published in

data.tale()

8 min readApr 24, 2018

This technical tutorial was designed for the Library of Congress with the intention of providing digital humanities scholars with an introduction to geodata manipulation and visualization in Python. It also details relevant aspects of digital collections management, including (importantly) how changes to data stewardship over the lifetime of a collection impact data quality and context.

Digital mapping has become an increasingly accessible and valuable complement to traditional interpretive narratives. Working with spatially-referenced data offers exciting possibilities for placed-based scholarship, outreach, and teaching. It’s also a perfect avenue for interdisciplinary collaboration — between, say, humanities researchers new to GIS and spatial scientists who’ve been using it for decades.

Embedded within digital collections available from the Library of Congress website are geographic data, including the locations of items and their local contexts. We can gather those data programmatically (using Python in this case) and plot them on a map, like so:

The story of these data would be incomplete, however, without a critical understanding of the history behind their collection and stewardship. In this tutorial, we demonstrate how loc.gov JSON API users can find and store spatial information from Library content with an awareness toward data quality, provenance, and why this broadened scope is important for informing research projects at the Library.

Historic Buildings as Data

As a graduate student of applied urban science, I was inspired at the outset of my internship with LC Labs to discover content about the built environment across US cities on the Library website. What I found was an expansive dataset of digitized photographs, drawings, and reports recognized collectively as the HHH, which includes material from three programs:

Historic American Buildings Survey (HABS)
Historic American Engineering Record (HAER)
Historic American Landscapes Survey (HALS)

Take this view of the uptown platform at 79th Street in NYC for example:

*Photo by David Sagarin for the Historic American Engineering Record, Library of Congress, Prints and Photographs Division, August 1978.*

I decided to dig a bit deeper into the engineering record because it appeared to have the best coverage of the three for spatial references. The Historic American Engineering Record, or HAER, was established in partnership by the National Park Service, the American Society of Civil Engineers, and the Library of Congress in 1969. There are more than 10,000 HAER surveys of historic sites and structures related to engineering and industry. The collection is an ongoing effort with established guidelines for documentation — HAER was created to preserve these structures through rule-based documentation, and those documents have in turn been preserved through time.

Collections in Context

The LC uses two systems to organize HABS/HAER/HALS documentation. The newer system uses the survey number as the call number; the older system assigns each survey a call number based on its location (state/county/city). Ex: HABS AL-654 has the call number: HABS ALA,1-PRAVI.V,1-

1 = Autauga County (each state’s counties are assigned numbers in alphabetical order.)
PRAVI = Prattville
.v = in the vicinity of a given city/town
1- = first place in the vicinity of Prattville surveyed

Places documented in rural and unincorporated areas (and even some in urban areas) have good site maps/UTM/decimal degree data; some don’t. When places can’t be located, even the vicinity, or in the rare cases when an address is restricted, city centroid points are used. The National Park Service’s Cultural Resources GIS Program is currently working on a project to create an enterprise dataset that includes all HABS/HAER/HALS surveys, as they’ve done for the National Register of Historic Places.

The NPS guidelines for surveys didn’t initially include spatial data in the way that it exists today. Good site maps are the best data available for surveys from the 1930s. HABS (and HAER) guidelines were later updated to request Universal Transverse Mercator (UTM) coordinates, a global system of grid-based mapping references. All three programs now ask for decimal degree data (in order to comply with the NPS Cultural Resource Spatial Data Transfer Standards), though they still receive data in UTM (and in some cases no geographic reference at all).

Data transmitted to the Library of Congress by contributor Justine Christianson, a HAER Historian with the National Park Service, is particularly rich for our purpose of visualizing the spatial distribution of collection items. Over the past few years, she has reviewed all of the HAER records to index them and assign decimal degree coordinates. Because she is often involved in finalizing HAER documentation before it goes to the Library, she is often listed as contributor in the record metadata. Justine has created spatial data (not to mention other data improvements) for many more HAER surveys than her name is attached to. The subset chosen for this tutorial, then, reflects a certain signature of her involvement in developing standards of documentation, verifying historical reports, and performing scholarly research on material in the collection — nearly 1,500 items with accurate latitude and longitude attributes polished and preserved.

Erie Railway, Niagara Falls Station, Niagara Falls, Niagara County, NY

Tutorial

To get our data from the digitized HAER collection, we’ll use the requests Python module to access the loc.gov JSON API.

Reading in coordinates means our data needs to be re-organized — a task for the popular analysis package, pandas.

Finally, we’ll do our visualization with folium to plot the locations on an interactive Leaflet map.

Folium is a Python wrapper for a tool called leaflet.js. With minimal instructions, it does a bunch of open-source Javascript work in the background, and the result is a mobile-friendly, interactive ‘Leaflet Map’ containing the data of interest.

Gathering item geography

Getting up to speed with use of the loc.gov JSON API and Python to access the collection was a breeze, thanks to existing data exploration resources located on the LC for Robots page.

Specifically, you can find tips on using the loc.gov JSON API from the ‘Accessing images for analysis’ notebook.

Many of the prints & photographs in HAER are tagged with geographic coordinates (‘latlong’). Using requests, we can easily ‘get’ data for an item as JSON and parse it for our latlong:

get_any_item = requests.get("https://www.loc.gov/item/al0006/?fo=json")
print('latlong: {}'.format(get_any_item.json()['item']['latlong']))

To retrieve this sort of data point for a set of search results, we’ll borrow the get_image_urls function from the aforementioned notebook. This will allow us to store the web address for each item in a list, working through the search page by page.

def get_image_urls(url, items=[]):
    '''
    Retrieves the image_ruls for items that have public URLs available. 
    Skips over items that are for the collection as a whole or web pages about the collection.
    Handles pagination. 
    '''
    # request pages of 100 results at a time
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(url, params=params)
    data = call.json()
    results = data['results']
    for result in results:
        # don't try to get images from the collection-level result
        if "collection" not in result.get("original_format") and "web page" not in result.get("original_format"):
            # take the last URL listed in the image_url array
            item = result.get("id")
            items.append(item)
    if data["pagination"]["next"] is not None: # make sure we haven't hit the end of the pages
        next_url = data["pagination"]["next"]
        #print("getting next page: {0}".format(next_url))
        get_image_urls(next_url, items)
        
    return items

To demonstrate with our subset of HAER listed under ‘Justine Christianson’, I’ll use a search that targets items from HAER with the her name listed as the contributor.

url = "https://www.loc.gov/search/?fa=contributor:christianson,+justine&fo=json"# retrieve all image URLs from the search results
image_urls = get_image_urls(url, items=[])# create an empty set to store our latlongs
spatial_set = set()# the parameters we set for our API calls taken the first function
p1 = {"fo" : "json"}# loop through the item URLs
for img in img100:
    
    # make HTTP request to loc.gov API for each item
    r = requests.get(img, params=p1)
    
    # extract only from items with latlong attribute
    try:
        
        # expose in JSON format
        data = r.json()
        
        # parse for location
        results = data['item']['latlong']
        
        # add it to our running set
        spatial_set.add(results)
        
    # skip anything with missing 'latlong' data
    except:
        
        # on to the next item until we're through
        pass

Pausing for reflection

So out of the sample of 100 HAER item URLs that we looped through to pull out spatial references, we ended up with a set of 75 latitude and longitude pairs. Not bad! This is certainly not perfect as far as data coverage is concerned, but given what we learned earlier about the lineage of preservation with this collection and dynamics of stewardship, I feel as though we have enough information for a meaningful demonstration and reasonable confidence in the quality of that data to proceed with the dive.

Something to notice, however, is how these data are currently formatted. Each latitude and longitude pair is glued together as a single string. This isn’t how Folium will want to read in coordinates, so as a next step we’ll need to rework them a bit before we get to mapping.

Data manipulation

We’ve mined out the locations of a digital subset from the HAER collection. Now we’ll restructure it with the popular pandas package.

# convert latlong set to list
latlong_list = list(spatial_set)# convert list to pandas dataframe
df = pd.DataFrame(latlong_list)# split coordinates into two columns
df = df[0].str.split(',', expand=True)# rename columns with latitude and longitude
df = df.rename(columns={0:'latitude', 1:'longitude'})

Geovisualization

As was called out at the start of the tutorial, the open-source tool folium builds on our earlier data wrangling with pandas and the mapping strengths of the Leaflet.js library to create an interactive experience.

# convert pandas dataframe back to a list for folium
df = =df.values.tolist()# picking a spot in the midwest to center our map around
COORD = [35.481918, -97.508469]# uses lat then long - the bigger the zoom #, the closer in you get
map_haer = folium.Map(location=COORD, zoom_start=3)# add a marker to the base map for every latlong pair in list
for i in range(len(latlong_list)): 
    folium.CircleMarker(latlong_list[i], radius=1, color='#0080bb', fill_color='#0080bb').add_to(map_haer)

Conclusion

As tools for digital scholarship improve, proliferate, and take hold, we will continue to see interesting questions emerge that make use of new and existing spatial data. GIS has proven its capability to expand humanities research, however many humanists have yet to incorporate this “spatial turn” in their research. Answering questions about why and how researchers can actually use such data requires a critical understanding of these tools and their outputs.

This tutorial was developed to help beginners get things done when integrating their disciplinary information into a geospatial format. Using digital collections from the Library of Congress API, we touched on foundational data skills such as how to collect and organize historic GIS data, how to deal with data in different formats, how to clean up data, and how to visualize disciplinary data with an interactive digital map. What ties this all together though is the ability to evaluate the quality of external data with respect to the motivation, context, and change over time of its stewardship. This type of awareness, while familiar for those engaged in more traditional research, is a critical piece of any project looking to leverage ever growing and diversifying digital resources.

I’d like to thank Mary McPartland from NPS and Kit Arrington from the Library of Congress for their guidance on the HABS/HAER/HALS collections and the nuanced history (and future!) of its evolution. I’d also like to acknowledge Laura Wrubel, whose LC Labs resources were instrumental in setting up my own projects, and Meghan Ferriter, my coach and mentor throughout the LC Labs internship.