Selecting columns and renaming are so easy with dplyr

Kan Nishida
learn data science
Published in
6 min readMar 10, 2016

--

Talking about just selecting columns sounds boring, except it’s not with dplyr. I’m not going to try to convince you why it’s not, rather let’s start taking a look by doing.

We’ll use the same flight data we have imported last time. If you missed the post you might want to check that one here. There are 27 columns like below.

column names for flight data

Select columns by typing their names

First, let’s select columns by typing their names like you would normally guess.

flight %>% 
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME)
flight data with the selected columns

Note that the ‘%>%’ (pipe) passes data from the command before to the one after.

Now, what if you just want to remove some columns that you know you don’t need, instead of typing each columns to include? You can use “-” (minus) to drop columns.

flight %>% select(-CARRIER_DELAY, -WEATHER_DELAY, -`NA`)
flight data with a few columns dropped

All you need to do is to add ‘-’ (minus) right before the columns you want to drop. It’s that simple.

Notice that the last column name inside the ‘select()’ function where I’m using “`” (back-tick) to surround “NA” characters. This is because there are certain rules on which letters can be used or not for the column names. In this case, “NA” is not a valid name for the column so I had to use the back-ticks. In other scenarios where there is a space(s) in the column names you can use the back-ticks as well.

Select or Drop columns with Github data

Dropping columns with “-” is pretty cool, but there are more. To demonstrate better, let’s switch to this Github issue data that has all the issues for a project called ’Wu Tang — 36 chambers’ at Github. (This is a fictional project.)

When you import Github issues data using Github APIs, this is what you would get. I’m going to have a separate post about how to get data from Github through their APIs in a near future, so stay tuned. Anyway, we can’t really see what the heck of this data just by looking at the result returned by typing the data frame name. Let’s see the list of the columns first with ‘colnames()’ function.

Column names for Github project issues

As you see there are 86 columns, and there is no way I need all those columns for my analysis this time. When you look closer there are bunch of column names that start with the same text like ‘user.xxx’, ‘assignee.xxx’, etc.

The column names that start with ‘user.’ hold all the information about the person who entered the issues. And the same way for ‘milestone.’ and ‘pull_request.’ these all have the detail information about them. Since I don’t need these information for my immediate analysis I want to remove all these columns. With dplyr I can do such operation very quickly and easily. One of the convenient functions dplyr provides is called ‘starts_with()’, which would find the columns whose names start with given characters and return those columns. So I can use ‘starts_with()’ function inside ‘select()’ function to get the matching columns and then use ‘-’ (minus) to drop them all together like below.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."), -starts_with("pull_"))
Github data with less columns

Now I have only 34 columns as opposed to the original 86 columns. But still there are more than I need. I find that there are bunch of columns whose names end with ‘url’, and I want to drop them all. I can do it with ‘ends_with()’ function this time.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."),
-starts_with("pull_"), -ends_with("url"))
Github data with even less columns

Now I see five columns whose names start with ‘assignee.’. These are about the persons who are assigned to each issue so I want to keep this information for my analysis. But, I don’t think I need all of the columns, instead I just need a column that holds the name (or userid). Let’s look at these columns for a second by adding another step after the pipe with ‘select()’ function like below.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."),
-starts_with("pull_"), -ends_with("url")) %>%
select(starts_with(“assignee”))
Github data with only assignee data

Looks like ‘assignee.login’ is the column that holds the assignee name information so I want to keep only this column. I can rename this ‘assignee.login’ column before removing all the columns that start with ‘assignee’ together.

dplyr provides ‘rename()’ function to, ah — , rename columns, so let’s insert a step before the second select step like below.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."),
-starts_with("pull_"), -ends_with("url")) %>%
rename(developer = assignee.login) %>%
select(starts_with("assignee"))
Github data with assignee columns

Now the final result doesn’t include the ‘assignee.login’ column like before. That means, I can safely remove all these four columns whose names start with ‘assignee’ after the ‘rename’ step. To do that, I simply add ‘-’ (minus) right before the ‘starts_with()’ function inside the ‘select()’ function like below.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."),
-starts_with("pull_"), -ends_with("url")) %>%
rename(developer = assignee.login) %>%
select(-starts_with("assignee"))

I can keep removing other unnecessary columns until getting a lot nicer view of data.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."),
-starts_with("pull_"), -ends_with("url")) %>%
rename(developer = assignee.login) %>%
select(-starts_with("assignee"), -title, -comments, -locked, -labels, -id, -body)
Github data with only columns I need

Now, if I add count() clause at the end of the steps I can quickly see how many issues are assigned to who.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."),
-starts_with("pull_"), -ends_with("url")) %>%
rename(developer = assignee.login) %>%
select(-starts_with("assignee"), -title, -comments, -locked, -labels, -id, -body) %>%
count(developer)
Count by developer

Yes, this count() function is super amazing. It groups the data by specified columns and count number of rows for each group. So if you’re interested in separating the issues between ‘close’ and ‘open’ state you can simply add ‘state’ into the ‘count()’ function like below.

github_issues %>% 
select(-starts_with("user."), -starts_with("milestone."),
-starts_with("pull_"), -ends_with("url")) %>%
rename(developer = assignee.login) %>%
select(-starts_with("assignee"), -title, -comments, -locked, -labels, -id, -body) %>%
count(state, developer) %>%
arrange(n)
Count by state and developer

Oh, Raekwon is definitely cooking! ;)

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.