Working with geography in survey data

Published in

Pew Research Center: Decoded

7 min readMay 30, 2019

Working with geography in survey data

Survey researchers frequently explore differences in public opinion by demographic group — how men’s views compare with those of women, for example, or how younger people compare with older people. Often, it’s also possible to look at differences by survey respondents’ geographic location.

Yet, when geographic information is available in survey data, it’s not always at the level researchers are looking for. At Pew Research Center, for instance, we typically ask our phone survey respondents for their ZIP code, primarily so we can accurately identify their state and Census region for the purposes of weighting our surveys. On their own, ZIP codes are of little value to researchers interested in doing geographic analyses, in part because there are over 30,000 ZIP codes in the United States.

However, researchers can use ZIP codes to determine location at a more granular level and do geographic analyses that use spatial relationships. Using ZIP codes, it’s possible to locate respondents in a specific place with some degree of accuracy — specifically, the latitude and longitude at the centroid (or geographical center) of the Census’s ZIP code tabulation area.

In this post, I’ll walk through an example from early 2018 that used ZIP codes to conduct geographic analysis. The analysis in question was based on a survey question we asked about public support for offshore drilling. It seemed reasonable to explore whether respondents’ proximity to a coastline was associated with their attitudes toward offshore drilling.

Geolocating respondents

The first step in this kind of analysis is geolocating survey respondents. In this data, the best we can do is locate people within their ZIP codes.

There are a couple of things to keep in mind when working with ZIP codes. First, we have to determine how to translate the ZIP codes into geographic coordinates. In this analysis, we’ll assign respondents to the centroids of their ZIP code tabulation areas. (One great resource for georeferenced Census units is GeoCorr, and you can download the ZIP code data here)

First, we’ll read in the data (you can find the survey data here) and merge it together, as shown below. (A side note on working with ZIP codes: Always check the format. ZIP code is stored as a factor variable in the survey data but as a character in the GeoCorr data. To avoid problems merging the two together, make sure they are in the same format. Additionally, with ZIP codes in particular, always be mindful of leading zeros.)

library(foreign)###Read in the datasets
dat <- read.spss("small_jan_data.sav",
to.data.frame = TRUE)##zip code data from geocorr for all zip code centroids
zip <- read.csv("geocorr2018.csv", as.is = TRUE)head(zip)#remove the first row
zip <- zip[-1,]
zip$lon <- as.numeric(zip$intptlon)
zip$lat <- as.numeric(zip$intptlat)##format zipcode as string variable
dat$zipcode <- as.character(dat$finalzip)###merge data
dat <- merge(dat, zip[,c(“zcta5”,”lat”,”lon”)], by.x = “zipcode”, by.y = “zcta5”, all.x = TRUE)###Plot the zip code coordinates
with(dat, plot(lon, lat, pch = 20))

Calculating distance from coast

The next step is to bring in the map of coastlines and find the distance between respondents’ locations and the coastline.

##bring in coastline shapefilelibrary(sp)
library(rgeos)
library(rgdal)##availble from here: http://www.soest.hawaii.edu/pwessel/gshhg/
map <- readOGR(“coastline map/GSHHS_l_L1.shp”)####extract the coordinates from the map object
polys <- map@polygons##container for coordinates
all.coords <- NULLfor (j in 1:length(polys)) { ##loop through polygons##find coordinate slots
if (is.element(“coords”, slotNames(polys[[j]]))) {
coords <- polys[[1]]@coords
all.coords <- rbind(all.coords, coords)
}##get polygon slots for more complicated geographies
if (is.element(“Polygons”, slotNames(polys[[j]]))) {
p <- polys[[j]]@Polygons##extract coordinates from these more complex objects
coords <- NULL
for (k in 1:length(p)) {
coords <- rbind(coords, p[[k]]@coords)
}all.coords <- rbind(all.coords, coords)
}
}plot(all.coords, pch = ‘.’)

Plot of the points that make up the coastline shapefile

###select N. America coast linex <- which(all.coords[,1] > -180 & all.coords[,1] < -45)
y <- which(all.coords[,2] > 20)wh <- intersect(x, y)plot(all.coords[wh,], pch = ‘.’)###cut out NE Canadian coastline
x <- which(all.coords[,1] > -120)
y <- which(all.coords[,2] > 55)wh2 <- intersect(x, y)wh <- setdiff(wh, wh2)us <- all.coords[wh,]plot(us, pch = ‘.’)points(zip$lon, zip$lat, pch = ‘.’, col = ‘red’)

Calculating distance is a little more involved than simply using Euclidian distance. Given that we are dealing with a coordinate system on a globe, we’ll use a great circle distance approximation:

####Functions for calculating great circle distances in miles
####slightly modified from this post: https://www.r-bloggers.com/great-circle-distance-calculations-in-r/####convert degrees to radians for distance calculationdeg2rad <- function(deg) return(deg*pi/180)# Calculates the geodesic distance between two points specified by radian latitude/longitude using the
# Spherical Law of Cosines (slc)gcd.slc <- function(long1, lat1, long2, lat2) {
##convert coordinates to radians
lat1 <- deg2rad(lat1)
long1 <- deg2rad(long1)
lat2 <- deg2rad(lat2)
long2 <- deg2rad(long2)R <- 3959 # Earth mean radius [miles]##container for distances
d_vec <- rep(NA, length(long2))for (i in 1:length(long2)) {
##find distances from each point
d_vec[i] <- acos(sin(lat1)*sin(lat2[i]) +
cos(lat1)*cos(lat2[i]) * cos(abs(long2[i]-long1))) * R
}return(d_vec)
}##column for distance to coastdat$dist_to_coast <- NAfor (j in 1:nrow(dat)) {
dist <- gcd.slc(dat$lon[j], dat$lat[j],
us[,1], us[,2])
dat$dist_to_coast[j] <- min(dist)
}##create a 3-way distance variabledat$dist3 <- 99
w <- which(!is.na(dat$dist_to_coast))
dat$dist3[w] <- 3
w <- which(dat$dist_to_coast < 300)
dat$dist3[w] <- 2
w <- which(dat$dist_to_coast < 25)
dat$dist3[w] <- 1dat$dist3 <- factor(dat$dist3, labels = c(“Less than 25 miles”,
“25–300 miles”, “More than 300 miles”, “Missing”))

Close reading of this code shows that we’re actually calculating distance to the nearest points that make up the coastline. It would be better to calculate the distance to the nearest point on the line segments that make up the coastline, but this would be a much more involved calculation. However, the large number of points that make up the coastline means the error is pretty minimal. A larger source of error is the fact that respondents are located at the centroids of their ZIP codes rather than their actual addresses.

Examining the results

Given the distance measure, we can now examine attitudes toward offshore drilling by respondents’ proximity to the coast:

library(survey)design <- svydesign(id=~1, weights=~weight, data=dat)
svyby(~q90, ~dist3, design = design,
FUN = svymean, keep.names = FALSE, na.rm = TRUE)dist3                 q90Favor  q90Oppose
1 Less than 25 miles  0.3457668 0.5562020
2 25–300 miles        0.4507683 0.4783657
3 More than 300 miles 0.4603733 0.4967827
4 Missing             0.3720786 0.4979092

Indeed, there appears to be a significant difference in attitudes toward offshore drilling between those who live nearer to a coastline and those who live farther away. People who live within 25 miles of the coast were about 10 percentage points less likely to say they favor offshore drilling than those who live more than 25 miles from the coast. However, Democrats and Democratic-leaning independents live nearer to the coast, on average, than Republicans and Republican leaners do:

svyby(~dist_to_coast, ~partysum, design = design,
FUN = svymean, keep.names = FALSE, na.rm = TRUE)partysum         dist_to_coast se
1 Rep/lean Rep   256.6374      12.05583
2 Dem/lean Dem   205.2576      10.72659
3 DK/Ref-no lean 216.7199      28.74201

In a multivariate framework, there seems to be no relationship between attitudes toward offshore drilling and proximity to the coast after controlling for partisanship:

summary(svyglm(q90 == "Favor" ~ dist3 + partysum,
 design = design, family = 'quasibinomial'))Call:
svyglm(formula = q90 == "Favor" ~ dist3 + partysum, 
    design = design, 
    family = "quasibinomial")Survey design:
svydesign(id = ~1, weights = ~weight, data = dat)Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                0.6551     0.1519   4.314 1.71e-05 ***
dist325-300 miles          0.2855     0.1719   1.661    0.097 .  
dist3More than 300 miles   0.2284     0.1823   1.253    0.210    
dist3Missing               0.1270     0.3351   0.379    0.705    
partysumDem/lean Dem      -2.0681     0.1494 -13.844  < 2e-16 ***
partysumDK/Ref-no lean    -1.4222     0.2350  -6.051 1.81e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for quasibinomial family taken to be 0.999249)Number of Fisher Scoring iterations: 4

Notes

As mentioned above, there is some error with using the centroid of a ZIP code instead of an actual address. If more granular data (e.g., exact addresses) are available, that would involve translating those addresses into exact coordinates rather than the ZIP code centroid. There are many ways to translate street address into geographic coordinates — for example, the ggmap package in R interfaces with the GoogleMaps API to extract latitude and longitude coordinates for address data.

If you’re interested in trying your own geographic analysis of survey data, it’s possible to get ZIP code data for a subset of Pew Research Center general public surveys conducted by phone, once those datasets have already been publicly released. Because respondent privacy is of utmost concern, we do request additional information from researchers and require them to subscribe to additional data usage agreements. If you have a request along these lines, please send an email to info@pewresearch.org and include the specific survey name and dates (or link to the survey). Please note that detailed geographic variables (like ZIP code) are not available for surveys of rare populations or surveys conducted through the Center’s American Trends Panel. (Due to the longitudinal nature of the panel, there is a great deal of information available about panelists so we need to take extra safeguards to protect their confidentiality.)

Bradley Jones is a research associate focusing on U.S. politics and policy at Pew Research Center.

Working with geography in survey data

Geolocating respondents

Calculating distance from coast

Written by Brad Jones