R libraries to get you started as a Data Scientist
If you are a beginner to the Data Science world and you are interested in learning how to use R programming to solve data science problems, this article provides a list of the must-know R libraries which will help you get started. R is one of the most popular programming languages used in Data Science along with Python, Java and Scala. First I would recommend installing RStudio, the IDE(Integrated Development Environment) for R programming. You will get all the required features in RStudio such as a console, code editor, debugger and tools for plotting datasets and viewing history.
If you are one of the many people trying to learn Data Science on your own, you need access to datasets. Kaggle has tons of interesting datasets that you can work on. Now that you have installed RStudio and chosen your choice to dataset, let’s talk about some of the important R libraries to get you started.
ggplot2 is the implemention of Grammar of Graphics. ggplot2 is the main data visualization library in R. Visualizations are an essential part of data science and analytics. With the help of ggplot2, you can plot static graphs with single variable or multi-variables with ease for both numerical and categorical data. You can also group data by size, colors and symbols. In order to install and load ggplot2, run the following commands in R console respectively:
Primarily, dataframes are used to store any kind of data in R. Data Table is an R package that provides an enhanced version of dataframes. Two of the enhancements done in Data Tables is speed and cleaner syntax. A data table is able to process joins, indexes, assignments and groupings faster than dataframes. This is because dataframes copy the entire data needlessly. Therefore, it’s recommended to use data table when dataset is big such as over 10 GBs.
Dplyr is the essential data-manipulation R package. It provides functions as verbs which are usually coupled with the group_by() function to perform the following type of data manipulations:
mutate() select() filter() summarize() arrange()
In my personal experience, dplyr is one of most used packages for data mutations. You can install this package in the same way:
As the name suggests, this package in R is meant for tidying/cleaning the data. This package works best with data where each row represents an observation and column is a feature/variable. Tidyr comes very handy when it comes to data cleaning with functions like “fill()” which populates missing cells and “replace_na()” which replaces missing values with the value of your choice. Some of the most important functions of Tidyr are gather(), separate() and spread().
Caret is short for Classification and Regression Training. Caret provides functions for training a machine learning model in both classification and regression problems. Caret helps you streamline the following steps:
Data Pre-processing: preprocess() function helps with checking missing data
Data Splitting: Splitting data into training and testing sets, for example, using createDataPartition() function
Training the model: Caret provides a large variety of machine learning algorithms. You can have a look here http://topepo.github.io/caret/available-models.html
Other important functions of Caret are feature selection, parameter tuning and variable importance estimation as well as you can create your own model.
To install caret, you can use the following command:
There are several other important R packages to know but that’s a story for another time.