R4DS Week 4: An EDA state of mind

We’re closing out week 4 and the section on Exploratory Data Analysis, and if you’ve been logging into Slack, reading people’s posts, asking questions, or catching up on“R for Data Science”, hey, guess what?

You’re doing the thing!

The takeaway from Week 4

There’s no set of code snippets to quickly cover the week’s material — instead this week summarizes everything we’ve covered so far. This post goes through the major highlights and guideposts, but the best way to get good at EDA is to get out there and do EDA.

EDA mindset

The big topic this week was exploratory data analysis — EDA for short — and it’s probably one of my favorite things to do when working with data. This is the part of analysis where you’re rewarded for tidying up your data and developing your ggplot2() skills, because now you get to start playing with the data by asking preliminary questions and exploring the answers.

The goal of EDA is to really dig in and understand your data. This is a point in your analysis where you don’t have to worry about creating publication-quality visuals, and you definitely don’t have to share every last question you’ve asked, let alone the results of everything you learn. So go ahead, jump in there and make mistakes!

The authors put it best when they offer up two guiding questions for performing EDA:

“What type of variation occurs within my variables?”
“What types of covariation occurs within my variables?”

In other words, we’re going to look at our data and ask “what does my data look like?” and “what’s related to what?”

My goal? That everyone feels this way about exploratory data analysis

What does my data look like?

One of the first graphs you’ll want to make is either a bar chart (for categorical variables) or a histogram (for continuous variables). Your goal here is to see how your data is distributed, and look for common values, patterns, and rare values.

# refresher on bar charts in ggplot2(), using the diamonds dataset
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

Quick extension exercises:

  • Observe the graph you’ve created — what additional questions can you come up with related to the data you see? How could you go about answering those questions?
  • How would you figure out how many diamonds are in each category of cut without graphing?
# refresher on histograms in ggplot2(), using the diamonds dataset
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

Quick extension exercises:

  • Change the binwidth value in the above histogram — what happens?
  • Which binwidth most accurately summarizes your data?
  • Filter your data so that you’re only looking at diamonds with a size of less than three carats. Which binwidth most accurately summarizes your data?
  • Observe the graphs you’ve created — what additional questions can you come up with related to the data you see? How could you go about answering those questions?

What’s related to what?

Continuous by categorical

It’s helpful to know what our data looks like, but it’s also helpful to see how our variables may (or may not!) be related to one another. To do that, let’s use boxplots to examine what our continuous data looks like when we break it down by specific categories.

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +

If you’re not sure how to read a boxplot, look at this video, read this article, and/or use this image:

From Section 7.5 in the R for Data Science text

Categorical by categorical

Two approaches are outlined in the text, and shared below:

ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))

Quick extension questions:

  • What’s easy to understand about the graph you just created?
  • What would make this graph easier to understand?
diamonds %>% 
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

Quick extension questions:

  • Which of these two graphs do you prefer? Why?
  • When might one graph be preferable to use?

Continuous by continuous

We’ve done scatterplots before:

ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))

Quick extension exercise:

We can also use two different types of bins in our graphs!

ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))

# install the hexbin package if you don't have it already:          # install.packages("hexbin") 
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))

Quick extension exercise:

Now we can get really fancy and treat one of our continuous variables as if it were categorical!

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Pretty cool, right? What if we do this:

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

Quick extension question:

  • What changed between the two graphs?

OK, now it’s your turn!

Additional datasets for EDA practice

Use one, some, or all of the datasets below to practice exploring your data!

Remember to look at the data in your dataset before you get started by using the following:

# you can look at the included datasets by using the following:
# some of the specific datasets in R that you could use are:

You can find more datasets in this list here! Most, if not all, should exist in R without you having to download the dataset.

Be sure to share your thoughts, initial plots, and questions in our Week 04 Slack channel — there are so many people waiting to help out!

Like what you read? Give Jesse Maegan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.