dplyr: Grammar of Data Manipulation in R
Beginning the journey in R (the big-boss in data analysis along with python these days), the first step after learning to load data sets is getting equipped with the R dplyr package.
Why dplyr ?
Managing data frames is important. Although R already provides various functions like subset() to deal with them, it’s still takes a whole bunch of commands to achieve basic operations.
- dplyr has functions that correspond to the most common operations (
- The key pieces are written in C++ to provide blazing fast performance.
- Tabular data is tabular data regardless of where it lives, so you should use the same functions to work with it. With
dplyr, anything you can do to a local data frame you can also do to a remote database table. PostgreSQL, MySQL, SQLite and Google bigquery support is built-in.
dplyr Function Properties:
All functions in the package will have common properties:
- The first argument is a data frame name to operate upon.
- The subsequent arguments describe what to do with the data frame,refer to columns in the data frame directly without using the $ operator.
- The functions returns a new data frame.
subset <- select(IndiaBirthRate, city:population)
IndiaBirthRate is the data-frame name, city:population is the range of columns selected(without using $ operator with column names) and the returned result is a data-frame named subset.
Installing dplyr using CRAN:
Importing the package:
- rename(): rename columns in a data frame
- summarise(): generate statistics of different columns in the data frame.
- mutate(): add new columns or transform existing columns
- filter(): extract a subset of rows from a data frame
- arrange(): reorder rows of a data frame
- select(): return a subset of the columns of a data frame,
- >%> : used to connect multiple actions together into a pipeline.
Will cover the use of each of these functions with examples in the next post.