dplyr: Grammar of Data Manipulation in R

Beginning the journey in R (the big-boss in data analysis along with python these days), the first step after learning to load data sets is getting equipped with the R dplyr package.

Why dplyr ?

Managing data frames is important. Although R already provides various functions like subset() to deal with them, it’s still takes a whole bunch of commands to achieve basic operations.

  1. dplyr has functions that correspond to the most common operations (group_by, summarise, mutate, filter, select and arrange).
  2. The key pieces are written in C++ to provide blazing fast performance.
  3. Tabular data is tabular data regardless of where it lives, so you should use the same functions to work with it. With dplyr, anything you can do to a local data frame you can also do to a remote database table. PostgreSQL, MySQL, SQLite and Google bigquery support is built-in.

dplyr Function Properties:

All functions in the package will have common properties:

  1. The first argument is a data frame name to operate upon.
  2. The subsequent arguments describe what to do with the data frame,refer to columns in the data frame directly without using the $ operator.
  3. The functions returns a new data frame.

example:

subset <- select(IndiaBirthRate, city:population)
IndiaBirthRate is the data-frame name, city:population is the range of columns selected(without using $ operator with column names) and the returned result is a data-frame named subset.

Installing dplyr using CRAN:

  install.packages(“dplyr”)

Importing the package:

  library(dplyr)

Functions:

  1. rename(): rename columns in a data frame
  2. summarise(): generate statistics of different columns in the data frame.
  3. mutate(): add new columns or transform existing columns
  4. filter(): extract a subset of rows from a data frame
  5. arrange(): reorder rows of a data frame
  6. select(): return a subset of the columns of a data frame,
  7. >%> : used to connect multiple actions together into a pipeline.

Will cover the use of each of these functions with examples in the next post.