R for Business Managers - Simple Data Manipulations for Large Datasets
Unable to use Excel because of large data size? Usage of data.table in R
You are having hard time in Excel doing a simple VLOOKUP because of large data? Or, your data has exceeded the row limit of Excel Sheet? Or, you want to automate steps of data manipulations that you often repeat? You also tried to learn Python and R but couldn’t afford time to go in detail. Language is not very intuitive? Then this post is for you, continue reading. If you don’t want to spend too much time reading introductions then just skip to section 4 onward.
I will use a library named data.table available in R. This is the fastest method available in R to perform calculations. data.table handles large scale data in jiffy. Based on this benchmarking study data manipulation with data.table is way faster than dplyr or python pandas.
1.0 Prepare
I assume you have installed R on your machine/laptop already. Since we are going to use data.table, so you will need to install this only ONE time. Run the following to install it.
install.packages("data.table")
2.0 Input, Output and Basic Data Manipulation
As a business user, you will have a csv or text file and you need to perform some data manipulation on that. If you have an Excel file, then better Save As csv file (reading csv or text file is faster in R). Let’s say you have following data (left table) in your csv file. You want to filter data only for item desktop. After filtering for desktop you will have just 1 row remaining
To do this small data operations in R, you will need to follow the following steps
- Load data.table library that you installed in section 1 in your R environment
- Read your csv file from your machine folder location
- Filter for the value desktop
- and, finally output or publish data back to your machine folder location.
Step 1: Load “data.table” library
Run following command in R. This will load the data.table library that you installed in section 1.
require(data.table)options("scipen"=100, "digits"=4)
Use Options just to make sure that numbers do not start showing up in scientific format. It looks ugly
Step 2: Read your data into R
Create a dummy CSV file with following data. Open an Excel sheet and manually enter data as following.
Save As your Excel file in CSV format in one of your machine folder location with the name dummy_data.csv. I have saved it in the following location “C:\Users\zishan\Downloads\Learning”
Now read this dummy_data.csv file by running the following in R.
my_rdata <- fread("C:/Users/zishan/Downloads/Learning/dummy_data.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
In this step we used fread to read data from corresponding folder location and we assigned this data to variable named my_rdata. Now your data is available in my_rdata within R environment. You too can view this data within R as well. Run type the following and run
#Simply type your data variable name and run
my_rdata#Or, you can also type the following and run to see first 5 rows of datahead(my_rdata)
Step 3: Filter the data
my_rdata[Items == "desktop"]
Step 4: Finally output or publish data back to your machine folder location
write.csv(my_rdata, "C:/Users/zishan/Downloads/Learning/dummy_data_output.csv", row.names = FALSE)
After running above code, you will see a new CSV file named dummy_data_output is created in the corresponding folder location. If you open that file, it will have a just one row.
Please note that we have run each code one by one. You can also run all of the above steps/code in one go too as following:
# Step 1 - Load "data.table" library
require(data.table)# Step 2 - Read your data into R
my_rdata <- fread("C:/Users/zishan/Downloads/Learning/dummy_data.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)# Step 3 - Sum the column sale_jan and sale_feb and create a new column named sale_total
my_rdata[Items == "desktop"]#Step 4 - Finally output or publish data back to your machine folder location with a new name "dummy_data_output.csv"
write.csv(my_rdata, "C:/Users/zishan/Downloads/Learning/dummy_data_output.csv", row.names = FALSE)
3.0 Create data on fly
Though in general we need to read a csv file in R to start working on it, but i will create data on fly in this tutorial to showcase how operations works. I am going to create the same CSV data as we created in section 2. Running the following code will generate the same data and assign it to variable my_fly_rdata.
#Step 1: Create data table on fly and let's give it a name my_fly_rdatamy_fly_rdata <- data.table(Items = c("laptop","desktop", "keyboards", "mouse"),
sale_jan = c(20, 87, 74, 38),
sale_feb = c(50, 100, 93, 28))#Display/View data
my_fly_rdata
Now one can perform the same operations as we did above:
# Step 2: Sum up values of two columnsmy_fly_rdata[, sale_total := sale_jan + sale_feb]#Step 3: Output the data on your machine just like beforewrite.csv(my_fly_rdata, "C:/Users/zishan/Downloads/Learning/dummy_fly_data_output.csv", row.names = FALSE)
I will continue creating data on fly for rest of the following tutorials. But in all practical purpose at your office, you will most likely be reading a csv or text file and performing all sort of data manipulations.
4.0 Filter or Subset data
library(data.table)
# Create sample data
animal_data <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 87, 74, 38),
count_feb = c(50, 100, 93, 28))#Display/View data
animal_data
5.1 Single Filter
# Filter the animal - Filter only for cat
cat_filter_data <- animal_data[animals == "cat"]#Display/View data
cat_filter_data
5.2 Multiple value Filters
# Filter the animal - Filter for cat and dog both
animal_filter_data <- animal_data[animals %in% c("cat","dog")]#Display/View data
animal_filter_data
5.0 Select and Concatenate Columns
5.1 Select specific set of Columns
library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))#Display/View data
animal_report#Select column animals and count_jan
selected_animal_report <- animal_report[, .(animals, count_jan)]#Display/View data
selected_animal_report
5.2 Filter/Subset and Select specific set of Columns
library(data.table)
# Create sample data
animal_data <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 87, 74, 38),
count_feb = c(50, 100, 93, 28))#Display/View data
animal_data# Filter only for animal "cat" and select column "count_feb"
cat_select_filter_data <- animal_data[animals == "cat", .(animals, count_feb)]#Display/View data
cat_select_filter_data
5.3 Concatenate columns
library(data.table)
#Define first table
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))#Concatenate the column years and month
demand_data[, index := paste0(years, "-", month)]#Display/View data
demand_data
6.0 Column Manipulations
For column manipulation we will be using data.table operator :=. If this is operator is used then we don’t need to use the assignment operator <-
6.1 Multiply values of a Column
Multiply 2nd column by 3
library(data.table)
#Define a data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))#Display/View data
animal_report
#Multiply column count_feb by 3
animal_report[, count_feb := count_feb*3]#Display/View data
animal_report
6.2 Multiply the values and Create a new Column
Multiply 2nd column by 3 and create a new column
#Define data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))#Display/View data
animal_report#Multiply column count_feb by 3 and get the data in a new_column
animal_report[, new_column := count_feb*3]#Display/View data
animal_report
6.3 Multiply, Divide and Subtract values in columns
Subtract half of count_feb from count_jan in a new column
library(data.table)
#Define a data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report#Multiply column count_feb by 3 and get the data in a new_column
animal_report[, new_column := count_jan - count_feb/2]#Display/View data
animal_report
6.4 Add a new column with certain value
Add reporting_year column
library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))#Display/View data
animal_report#Add a new column reporting_year with value 2019 value
animal_report[, reporting_year := 2019]#Display/View data
animal_report
6.5 Add a new column with value in specific row
Add reporting_year column and populate value only in for the animal cat
library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))#Display/View data
animal_report#Add a new column reporting_year with value 2019 value
animal_report[animals == "cat", reporting_year := 2019]#Display/View data
animal_report
6.6 Apply formula or update only selected row
Update rows where animals are not cats
library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))#Display/View data
animal_report#Update count_feb value where animals are not cat
animal_report[animals != "cat", count_feb := count_feb + count_jan]# View Data
animal_report
6.7 Perform Case when type of Formula using IFELSE
Perform case when cat then 2019; when dog then 2020; when elephant then 2018
library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))#Display/View data
animal_report#Perform case when formula
animal_report[, reporting_year := ifelse(animals == "cat", 2019, ifelse(animals == "dog", 2020, 2018))]#Display/View data
animal_report
7.0 Aggregate or Group by Columns — Type I
Aggregation means like sum, max, min etc. In Type 1 aggregation, we will sum up the data and append it in a new colum in the base data. Base data other columns and structure will remain the same
7.1 General aggregation grouped by a column
Sum up the data by Years
# Create sample data
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))# View Data
demand_data#Sum up the demand by Years and add back in the base data
demand_data[, tot_demand_by_years := sum(demand), by = .(years)]# View Data
demand_data
7.2 General aggregation grouped by a column applied to specific rows
# Create sample data
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))# View Data
demand_data#Sum up the demand excluding row where date is 25
demand_data[date != 25, tot_demand_by_years := sum(demand), by = .(years)]# View Data
demand_data
8. Aggregate or Group by Columns — II
In Type II aggregation, we will sum up the data and will not append it in the base data. Only summed up value column and group by column will show. In such case, you would need to drop off the reference operator := and replace it with dot
8.1 Single Column Grouping
library(data.table)
# Create sample data
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))#Group by Years column
year_group_data <- demand_data[, .(total_demand = sum(demand, na.rm = TRUE)), by = .(years)]#Display/View data
year_group_data
8.2 Multiple Columns Grouping
#Group by Years and Month column
multi_group_data <- demand_data[, .(total_demand = sum(demand, na.rm = TRUE)), by = .(years, month)]#Display/View data
multi_group_data
In place of sum, we can can also use other aggregate function like max, min in above code snippet
9.0 Append Data with different columns
library(data.table)
#Define first table
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))#Define second table
country_data <- data.table(years = c(2020,2020, 2021, 2023),
countries = c("IN", "JP", "CN", "DE"))#Append one data over another
appended_data <- rbind(demand_data, country_data, fill = TRUE)#Display/View data
appended_data
10.0 VLOOKUP
library(data.table)
#Define first table
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))#Define second table
country_data <- data.table(country_year = c(2020,2020, 2021, 2023),
countries = c("IN", "JP", "CN", "DE"))#Perform VLOOKUP
demand_data[, related_country := country_data[match(years, country_year), countries]]