Deriving hospital travel times with population-weighted sampling
April 21, 2020 by Eric Buth
To explore more data on COVID-19, please go to covid19.topos.com
One of the early stories to emerge concerning COVID-19 was the vulnerability of certain communities in the U.S. caused by a lack of access to critical medical care typically provided in hospitals. Today, Navajo County, Arizona, home to three Native American reservations, the Navajo Nation, Hopi Indian Reservation, and Fort Apache Indian Reservation, is experiencing one of the country’s highest per capita cases of COVID-19 (435 cases per 100K people). What makes the county’s high rate especially alarming is the lack of points of medical care able to treat serious COVID-19 patients. To understand vulnerability through the lens of access to critical medical care, we generated a new feature, “median distance to the nearest hospital” for all counties in the U.S. Here we detail our methodology for generating this feature at the county level.
Working effectively with geographic data from multiple sources often requires a strategy for translating between geographic units. This translation is a non-trivial technological challenge, but in the case of COVID-19 can prove important in answering simple but critical questions such as: are there enough medical resources available to serve the number of COVID-19 patients in a given area?
At Topos, we maintain a large amount of categorized information about the locations of businesses, public institutions, etc. — sometimes referred to as “points of interest” or “POI.” As the geographic resolution of POI data is effectively infinite (they are points in space), the core challenge is how to aggregate these points so they can be used in relation to features available at higher levels of granularity such as counties and states.
The most straightforward way to approach this aggregation is to simply count the number of points that fall within a larger geography, a strategy we take with other relevant POI such as housing units. However, there are cases where these numbers don’t fully capture the relationship that people have with the resources being counted.
What if the nearest hospital to a large part of a county’s population is actually in a neighboring county or several counties away? What if most of the grocery stores in a county are located far away from where the residents of the county live? What if a county represents the outer suburbs of a major city or is abutted by national park land? Simply counting points doesn’t sufficiently capture their accessibility, which is particularly important for critical resources like hospitals, urgent care centers, pharmacies, grocery stores or schools.
Rather than simply counting points within a region, we may want to have a sense of how easily those points can be accessed. To this end, we often look not only at geographic distance to these points but actually calculate how long it takes to reach these points via common modes of transportation (drive, walk, etc).
In this project, we begin with the time it might take an ambulance to reach the nearest hospital that has in-patient services — that is, a hospital with bed count greater than zero.
In order to decide which of the thousands of hospital locations are the nearest to a given address, we use an S2-based geospatial index, which allows us to quickly search radiuses around those addresses to build candidate lists. This step is important because of the amount of time and resources it would otherwise take to evaluate the travel time to every hospital in the country. Once we have the reduced list of locations that are close — as the crow flies — we then need to determine which one is actually the quickest to drive to on available roads.
For the same reasons we needed hospital candidate lists, we now need a method for limiting the number of addresses we use as origins — the location point A from which to compute the path to point B using a routing API (here.com, Google Maps, Mapbox, etc.). To accomplish this we construct a sample — a meaningful subset of addresses that roughly represents the entire county.
One way to build this sample would be to pick at random within a given county’s geographic boundaries. However, sampling in this manner risks significantly over-representing less densely populated areas. For example, in Deschutes County, Oregon, this approach results in selecting as many points from national forest land as from the city of Bend.
The resulting travel times to nearby hospitals appear to be evenly distributed. Intiutively, this seems wrong: human geographic organization tends to concentrate both resources and population, and in Deschutes County at least 3 cities exist that should push the distribution away from this apparent randomness. If county residents are significantly more likely to live in a city with a nearby hospital, we’d expect that to be reflected by a concentration of values around a lower median travel time.
We adjust our sampling strategy to account for this issue by using the population counts of census block groups, which are significantly smaller than counties. We’re not producing final metrics at such a low level, but we can use the more granular population counts to weight our random sample of starting points. Imagine that for every person in a county we put one marble, labeled with the block group where that person lives, in a bucket. To get a population-weighted sample, we repeatedly pick a marble from the combined bucket — replacing it each time — and note the label.
The effect is that every person has an equal chance of being selected, even though the resulting block group counts are not themselves equal. Once we have constructed this list of block groups, we then pick a random address within their geographical bounds.
This sampling strategy now shows points clustered around three cities within Deschutes County, the population centers of Bend, Redmond, and Sisters — with some outliers along highway 97. The hospital travel time values now form something closer to a normal distribution, with a median around 13 minutes — a stark difference from the likely misleading 50 minutes of the previous example.
With our Median Distance to Nearby Hospitals metric in hand, we can now examine it in relation to the rapidly unfolding crisis of COVID-19. The visualization below highlights which counties have high per-capita COVID-19 infections with the lowest access to Hospitals (As of April 21, 2020)