R4DS Week 5: it’s OK to feel lost sometimes

Introduction

I want to start this week off by acknowledging that if you’re brand new to R, the whole section on tibbles may have seemed a bit… abstract. It’s OK if you read the section on tibbles and went “so what?” or even “I’m lost.”

You’re not alone — even Ryan Reynolds isn’t sure about tibbles yet

Because these weekly round-ups are meant to help those who are new to R, we’re going to gloss over a lot of information that more experienced R users will argue should be in here. And that’s OK, because our goal is to tour the tidyverse landscape so that you know where the major landmarks are, rather then spending an inordinately long amount of time on any one feature.


What in the heck is a tibble?

There are generally one of two reactions to first hearing the word ‘tibble’:

Option 1
Option 2: did you mean tribble?

We very much mean ‘tibble’, which is how one might pronounce the first half of tbl_df. You can read more about tibbles here.

So why would using a tibble matter? Enter the following code into R and run it line by line:

library(tidyverse)
iris
# Notice what happens when you run iris
# How easy/difficult is it to look at the data in this format?

as_tibble(iris)
# We've "wrapped" the iris dataset with as_tibble()
# This converts the iris data frame into a tibble
# How easy/difficult is it to look at the data in this format?

Tibbles do a lot of really cool things that will ultimately make your life in R easier, but we’re not going to get into those things right now. Instead we’re going to acknowledge that tibbles exist, tibbles are awesome, and tibbles are an integral part of the tidyverse.

Alec Baldwin has some serious feelings about tibbles

Getting your data into R

All of the data sets that we’ve worked with so far have existed as a part of R. And while these data sets have been helpful for learning the basics of data science, you’re most likely going to need to analyze your own data sets, or data sets that someone else has given you.

To do this, we’re going to use readr package, which allows us to bring data files into RStudio, where we can then work with them using tools like ggplot2 and dplyr. You can see the list of file types readr works with here, but know that we’re going to focus exclusively on reading in .csv files in this weekly round-up.

Probably the best Excel gif I’ve ever found

Get some data

For these next steps, feel free to use data that you already have. If you don’t have any data and don’t want to generate some dummy data, you can download the Demographic Statistics By Zip Code .csv file.

If you have data in an .xls or .xlsx format (standard Excel worksheets) and aren’t sure how to get it into a .csv format, you can follow the instructions here.

Note: you can read .xls and .xlsx files directly into R using readxl

Accessing your data

There are two approaches to this — the hard way and the easy way. I want to encourage you to take the easy way now and forever.

The easy way

Have you been using Projects in RStudio? No? Let’s start.

  1. Open up RStudio
  2. Go to File and select “New Project”
  3. Choose “New Directory” from the pop-up menu
  4. Choose “Empty Project”
  5. Give your directory (another way of saying folder) a name, such as test_data
  6. Browse to somewhere on your computer that you can find the folder easily, like the Desktop
  7. Click “Open in new session”
  8. Click “Create Project”
  9. Press Cmd/Ctrl + Shift + N to create a new R script in the top left corner of RStudio

OK. If you’ve followed along, you should have a set-up that looks something like this:

If your setup doesn’t look similar to this, try going through the steps again. If it’s still not working properly, reach out in the Slack group!

Now that you’ve got your Project set up, go to wherever you saved your data, and move a copy of the .csv file (or the .csv file itself) into the Project folder you just created.

Go to the R script in RStudio (the large block in the top left corner) and run the following:

library(tidyverse)
read_csv("name_of_data_file.csv")
# What happened?
data <- read_csv("name_of_data_file.csv")
data
# How is this different from the previous line of code?

Stretch exercises

  • Create additional .csv datasets (or use ones you already have), put them in the Project folder, and read them into R.
  • Create a new R script within the same project (Cmd/Ctrl + Shift + N). Do not read any data files into this script. Instead, type summary(data). What happened? How might this be useful? When might this be useful?

The hard way

The hard way involves setting the working directory with setwd() and working from there. I’m intentionally not going into depth on this, but if you would like more information, please reach out on Slack!


Stretch exercises

  • Find a data set — one of your own, or one you find online — and read it into an RStudio Project
  • Look at the summary statistics for your data set (part of EDA)
  • Create a handful of graphs from your data set (part of EDA)
  • Develop a list of questions you’d like to answer based on what you’ve learned working with this new data set (part of EDA)
  • Reach out in the Slack group to share something you’ve learned or created, or to get assistance on any and all of the above steps!

Still catching up? Here’s what we’ve done so far:

Week 1: Setting up your RStudio environment

Week 2: Data visualization

Week 3: Data transformations

Week 4: Exploratory data analysis

Like what you read? Give Jesse Maegan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.