Visualizing Geographic Data in R

Maps have become one of the most popular ways of visualizing data. They are aesthetically pleasing and require little explanation. But all mapping techniques are not created equal. This post will evaluate various methods of visualizing geographic data, and will provide alternatives to the standard data map.

R Packages and Dependencies

This post will focus on using the R visualization package ggplot2, which is found in the tidyverse package. The geofacet package is used to create the tile maps, and the fiftystater package is used to create the chloropleth map.

library(tidyverse)
library(geofacet)
library(stringr)
library(fiftystater)

The data sourcing and wrangling code is not included in this post, but can be found here.

What is a data map?

A data map is any chart which displays geographic data. When we think of data maps, the first map that we think of is probably an election map. Blue along the coasts and large red regions in the middle. Creating these visuals used to be a laborious work of art, not available to the common data analyst. With R’s combination of graphing libraries, anyone with a state code and a value can make a data map. No longer are data maps reserved for elections — they can be used to visualize any data with a geographic component.

Consider the following data map:

####################################
# chloropleth map of state gdps
####################################
low_color='#ccdbe5' 
high_color="#114365"
legend_title = 'GDP ($, trillions)'
ggplot(gdp_area, aes(map_id = state_full)) + 
geom_map(aes(fill = gdp ),color="#ffffff",size=.15, map = fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map() +
labs(x = "", y = "") +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
scale_fill_continuous(low = low_color, high= high_color, guide = guide_colorbar(title = legend_title)) + # creates shading pattern
theme(#legend.position = "bottom",
panel.background = element_blank()) +
fifty_states_inset_boxes() +
ggtitle('State GDP 2016', subtitle = 'California had the highest GDP by a wide margin')

What conclusions about state GDP can you draw from this map? You might notice that in 2016:

  • California had the highest GDP
  • New York and Texas had high GDPs
  • Much of the rest of the country is light blue
  • My home state of Indiana looks to be towards the bottom of GDP producing states

From this map, we can identify that there is no true relationship between geography and GDP. States with high GDPs are quite far away from each other, but we gain nice understanding of which states are our highest producers.

While this map both looks pretty and provides a good bit of geographic information, it has many analytical shortcomings. The first main weakness to this map is that western states, such as Montana, have almost the same land area as the entire northeast. As the following scatter plot shows, there is no relationship between land area and GDP:

####################################
## scatter plot of GDP and land area
####################################
# top states are states which will be annotated
top_states <- gdp_area %>%
filter(gdp > .9 | area > 1 | state == 'IN')
# calculate r2 between variables for scatter plot
r2 = str_c("R-squared = ", format(cor(gdp_area$gdp, gdp_area$area) ** 2, digits=2, nsmall = 2))
# create scatter plot
gdp_area %>%
ggplot() +
geom_point(aes(x=area, y=gdp, color= (area > .5 | gdp > .9 | state=='IN'))) +
scale_color_manual(name = '', values = c('black', '#1f77b4')) +
geom_text(aes(x=area, y=gdp,label=state), color='#1f77b4', data=top_states, vjust=-.4) +
ggtitle('State GDP vs. Land Area', subtitle = 'There is little correlation between GDP and state size' ) +
xlab('Land Area (Sq. Km, millions)') +
ylab('GDP ($, trillions)') +
annotate('text', x=1.4, y=2.65, label=r2) +
scale_y_continuous(limits=c(0,2.75)) +
scale_x_continuous(limits=c(0,1.6)) +
theme(legend.position="none")

While land area does not lead to increased GDP, larger states are unfairly biased in our original map.

Alternatives to the Standard Data Map

While the traditional data map leaves a bit to be desired, there are alternative ways of visualizing this state level data. The tile map has been growing in popularity among data visualizers, as it removes the land area bias that plagues the traditional map.

####################################
# tile map
####################################
create_gradient_state_tile_map <- function(state, value, title, legend_title, low_color='#ccdbe5', high_color="#114365", state_grid='us_state_grid2') {

df <- as.tibble(data.frame(state, value))

fig <- df %>%
mutate(x = 1) %>% # size of the bar being plotted. All bars should be same size to make perfect squares
mutate(label_y = .5) %>% # this location of state labels
mutate(label_x = 1) %>%
ggplot()+
geom_bar(mapping=aes(x=x, fill=value)) +
facet_geo(~ state, grid=state_grid) +
scale_fill_continuous(low = low_color, high= high_color, guide = guide_colorbar(title = legend_title)) + # creates shading pattern
ggtitle(title) +
theme_classic() + # theme classic removes many ggplot defaults (grey background, etc)
theme(plot.title = element_text(size = 28), # format the plot
plot.margin = unit(c(1,1,1,1), "cm"),
legend.text=element_text(size=16),
legend.title = element_text(size=16),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks = element_blank(),
strip.text.x = element_blank(),
axis.line = element_blank()) +
geom_text(aes(x=label_x, y=label_y, label=state), color='#ffffff', size=10)

return(fig)
}
# Using 2016 values, create a tile map
tile_map <- create_gradient_state_tile_map(gdp_area$state, gdp_area$gdp, title='State GDP 2016 \n', legend_title = "GDP, ($, trillions)" )
tile_map

While these maps do not arrange states in a completely accurate geographic way, they allow us to consider each state equally without the influence of a state’s land area. This tile map might be an improvement on our original map; however, it is not without its own faults. Extreme outliers, such as California, make discerning differences in GDP between the bottom 40 states to very difficult. The simplest way to visualize the difference in GDP between each state would be to create a bar chart.

####################################
# simple bar chart of state gdps
####################################
gdp_area %>% 
ggplot()+
geom_bar(mapping=aes(x=reorder(state, gdp), y=gdp), stat = 'identity') +
coord_flip() +
scale_y_continuous(expand = c(0,0), limits = c(0,2.75)) +
ylab("GDP ($, trillions)") +
xlab('State') +
ggtitle('State GDP 2016') +
theme(axis.text = element_text(size=6))

The bar chart is the best visual for communicating just how much larger California’s GDP production is than any other state. When looking at our original map, you may have noticed Texas and New York were the second and third ranking states; however, it was not clear that California was producing almost 40% more than the lone star state and the big apple!

For us Hoosiers, another notable conclusion is that Indiana’s GDP production actually ranks in the top 1/3! On our original map, the difference in GDP between Indiana and the states which rank below was too small to notice a significant color shading change.

While the bar chart best visualizes the difference in GDP, all geographic elements are lost. When considering a map vs. a bar chart, it is important to ask if geography matters to communicate the intended message of the chart.

What about time-series data?

Data maps usually represent data at one point in time. Using a traditional data map, the only way to include time would be to create multiple maps, to compare data at specific points in time. The recently released library, geofacet, allows us to add the element of time to a tile map:

#########################################
# tile area map
#########################################
# create state tile map
ggplot(df, aes(date, gdp)) +
geom_area(fill='#114365') +
facet_geo(~ state, grid = "us_state_grid2", label="name") +
scale_x_continuous(labels = function(x) paste0("'", substr(x, 3, 4))) +
labs(title = "GDP by State 1997-2016 ($, trillions)",
caption = "Data Source: St. Louis FRED",
x = "Year",
y = "GDP") +
theme(strip.text.x = element_text(size = 8))

When looking at this chart, we can see how GDP has changed over the past 19 years for each individual state. This new element provides a reader with historical context to obtain new conclusions from the visual.

Ultimately, no map will communicate every possible piece of information. Our time-series tile map does not accurately represent geography, but provides many other insightful bits of information. Our original map is geographically accurate, but leaves room for improvement in many analytical aspects. When selecting a chart, select the one which will most accurately communicate your intended message.

* The complete R code from this post can be found here