Data Collection and Analysis Made Easy with R: tidycensus to Advanced Methods

Riyanshi Bohra
6 min readJul 16, 2024

--

Introduction:

Welcome to an in-depth guide on using R for innovative data collection and analysis. This tutorial will walk you through how to leverage the ‘tidycensus’ package to access U.S. Census data and integrate it with other powerful tools in R. You will learn detailed steps, practical examples, and how to generate comprehensive insights through data visualization and analysis.

Photo by NASA on Unsplash

1. Introduction to ‘tidycensus'

The tidycensus package in R allows data scientists to easily access and manipulate U.S. Census data. This package simplifies retrieving demographic, social, and economic data, enabling detailed analysis and visualization.

Installation and Setup:

To get started, you’ll need to install tidycensus and obtain a Census API key.

install.packages("tidycensus")
install.packages("tidyverse")

library(tidycensus)
library(tidyverse)

# Set your Census API key
census_api_key("YOUR_API_KEY") # Copy your API key here

Obtaining a Census API Key:

  1. Visit the Census Bureau’s API page.
  2. Fill in the required information to sign up for an API key.
  3. Check your email for the API key and copy it.
  4. Use the census_api_key function in R to set your key.

2. Fetching Census Data

Once you have set up tidycensus, you can start fetching data. For example, let’s retrieve population data for counties in California.

Example: Retrieving Population Data

# Get population data for counties in California
ca_population <- get_decennial(
geography = "county",
variables = "P001001", # Total population variable
state = "CA",
year = 2010,
geometry = TRUE
)

# View the first few rows of the data
head(ca_population)

Explanation: This code fetches the total population data for each county in California for the year 2010. The geometry = TRUE parameter ensures that spatial data is included, which is useful for mapping.

Output: To display the first few rows of the data, use the head() function:

head(ca_population)

Output:

Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 41.9983
Geodetic CRS: NAD83
GEOID NAME variable value geometry
1 06001 Alameda County P001001 1510271 MULTIPOLYGON (((-122.3483...
2 06003 Alpine County P001001 1175 MULTIPOLYGON (((-119.944...
3 06005 Amador County P001001 38091 MULTIPOLYGON (((-121.085...
4 06007 Butte County P001001 220000 MULTIPOLYGON (((-121.312...
5 06009 Calaveras County P001001 45578 MULTIPOLYGON (((-120.562...
6 06011 Colusa County P001001 21419 MULTIPOLYGON (((-122.3538...

3. Visualizing Data with ggplot2

Visualization is key to understanding data. We’ll use ggplot2 to create insightful plots of the data retrieved.

Example: Plotting Population Data

library(ggplot2)

# Plot population by county
ggplot(data = ca_population) +
geom_sf(aes(fill = value), color = NA) +
scale_fill_viridis_c(option = "plasma") +
theme_minimal() +
labs(title = "Population by County in California (2010)",
fill = "Population")

Explanation: This script creates a choropleth map showing population distribution across California counties. geom_sf() is used for spatial plotting, and scale_fill_viridis_c() provides a color scale for population density.

Output: You can display the plot using the print() function or simply running the code chunk in an R environment.

4. Advanced Data Manipulation with dplyr

Enhance your data analysis by combining Census data with other datasets. We’ll use dplyr to join and manipulate datasets.

Example: Combining Population Data with Income Data

First, retrieve median income data:

ca_income <- get_acs(
geography = "county",
variables = "B19013_001", # Median household income variable
state = "CA",
year = 2019
)

# Combine population and income data
ca_combined <- ca_population %>%
left_join(ca_income, by = "GEOID") %>%
rename(population = value.x, median_income = estimate)

# View the first few rows of the combined data
head(ca_combined)

Explanation: Here, we fetch median household income data and merge it with the previously obtained population data. left_join() combines the datasets based on the GEOID field.

Output:

head(ca_combined)

Output:

Simple feature collection with 6 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 41.9983
Geodetic CRS: NAD83
GEOID NAME variable.x population variable.y median_income geometry
1 06001 Alameda County P001001 1510271 B19013_001 94440 MULTIPOLYGON (((-122.3483...
2 06003 Alpine County P001001 1175 B19013_001 63500 MULTIPOLYGON (((-119.944...
3 06005 Amador County P001001 38091 B19013_001 61568 MULTIPOLYGON (((-121.085...
4 06007 Butte County P001001 220000 B19013_001 49605 MULTIPOLYGON (((-121.312...
5 06009 Calaveras County P001001 45578 B19013_001 61461 MULTIPOLYGON (((-120.562...
6 06011 Colusa County P001001 21419 B19013_001 51821 MULTIPOLYGON (((-122.3538...

5. Spatial Analysis with sf

Using sf (simple features), we can perform spatial analysis and create more complex visualizations.

Example: Mapping Population and Income

library(sf)

# Ensure both datasets have spatial information
ca_combined <- st_as_sf(ca_combined)

# Plot population and median income on the same map
ggplot(data = ca_combined) +
geom_sf(aes(fill = population), color = NA) +
geom_sf_text(aes(label = round(median_income, 0)), size = 3, color = "white") +
scale_fill_viridis_c(option = "plasma") +
theme_minimal() +
labs(title = "Population and Median Income by County in California (2019)",
fill = "Population")

Explanation: This plot visualizes both population and median income on a single map. Labels for median income are added to each county to provide additional context.

6. Automating Data Collection with purrr

Automate repetitive tasks with the purrr package, which simplifies looping operations and enhances workflow efficiency.

Example: Automating Data Retrieval for Multiple States

library(purrr)

# Define a function to get population data for a given state
get_population_data <- function(state_abbr) {
get_decennial(
geography = "county",
variables = "P001001",
state = state_abbr,
year = 2010,
geometry = TRUE
)
}

# List of state abbreviations
states <- c("CA", "TX", "NY")

# Retrieve data for multiple states
all_states_population <- map_df(states, get_population_data)

# View the first few rows of the combined data
head(all_states_population)

Explanation: This code uses the purrr package to automate the process of retrieving population data for multiple states, streamlining the workflow and saving time.

Output:

head(all_states_population)

7. Exporting and Saving Data

To ensure your hard work is preserved and can be used for future analysis, it’s crucial to export and save your dataset properly. Here are a few ways to do that in R.

Exporting to CSV:

Exporting data to a CSV file is straightforward and highly useful for sharing and further analysis.

# Export combined data to a CSV file
write.csv(ca_combined, "ca_combined_data.csv")

Explanation: This command writes the ca_combined dataframe to a CSV file named "ca_combined_data.csv" in your working directory.

Saving as RDS:

For preserving R objects with their structure and attributes, saving as an RDS file is a good choice.

# Save combined data as an RDS file
saveRDS(ca_combined, "ca_combined_data.rds")

# To load the data back in the future
ca_combined <- readRDS("ca_combined_data.rds")

Explanation: The saveRDS function saves the R object to a file, and readRDS can be used to load it back into R. This method is efficient for preserving all the attributes and structure of the data.

Other Blogs:

  1. Tidy Census Documentation: The official documentation for the tidycensus package, offering detailed examples and use cases.
  2. R for Data Science by Hadley Wickham and Garrett Grolemund: A comprehensive guide to data science using R, covering everything from data import to visualization and modeling.

Conclusion:

By leveraging the tidycensus package in R, combined with powerful tools like ggplot2, dplyr, sf, and purrr, you can efficiently collect, analyze, and visualize complex datasets. These skills are essential for any aspiring data scientist looking to expand their horizons and tackle real-world problems.

Personal Note:

Thank you for reading my blog on innovative data collection and analysis with R! I hope you found it informative and engaging. I’d love to hear your thoughts, so please leave a comment below. If you enjoyed this article and want to stay updated on my latest work, feel free to connect with me on LinkedIn and check out my projects on GitHub.

LinkedIn | GitHub

Looking forward to connecting with you and hearing your feedback!

--

--

Riyanshi Bohra
Riyanshi Bohra

Written by Riyanshi Bohra

0 Followers

Hi, I’m Riyanshi – your friendly data scientist, turning numbers into narratives and finding insights like a data detective with a caffeine addiction!

No responses yet