R for Business Managers - Simple Data Manipulations for Large Datasets

Unable to use Excel because of large data size? Usage of data.table in R

Zishan Yusuf
Analytics Vidhya
Published in
11 min readJun 20, 2020

--

You are having hard time in Excel doing a simple VLOOKUP because of large data? Or, your data has exceeded the row limit of Excel Sheet? Or, you want to automate steps of data manipulations that you often repeat? You also tried to learn Python and R but couldn’t afford time to go in detail. Language is not very intuitive? Then this post is for you, continue reading. If you don’t want to spend too much time reading introductions then just skip to section 4 onward.

I will use a library named data.table available in R. This is the fastest method available in R to perform calculations. data.table handles large scale data in jiffy. Based on this benchmarking study data manipulation with data.table is way faster than dplyr or python pandas.

1.0 Prepare

I assume you have installed R on your machine/laptop already. Since we are going to use data.table, so you will need to install this only ONE time. Run the following to install it.

install.packages("data.table")

2.0 Input, Output and Basic Data Manipulation

As a business user, you will have a csv or text file and you need to perform some data manipulation on that. If you have an Excel file, then better Save As csv file (reading csv or text file is faster in R). Let’s say you have following data (left table) in your csv file. You want to filter data only for item desktop. After filtering for desktop you will have just 1 row remaining

To do this small data operations in R, you will need to follow the following steps

  1. Load data.table library that you installed in section 1 in your R environment
  2. Read your csv file from your machine folder location
  3. Filter for the value desktop
  4. and, finally output or publish data back to your machine folder location.

Step 1: Load “data.table” library

Run following command in R. This will load the data.table library that you installed in section 1.

require(data.table)options("scipen"=100, "digits"=4)

Use Options just to make sure that numbers do not start showing up in scientific format. It looks ugly

Step 2: Read your data into R

Create a dummy CSV file with following data. Open an Excel sheet and manually enter data as following.

Data to create CSV file

Save As your Excel file in CSV format in one of your machine folder location with the name dummy_data.csv. I have saved it in the following location “C:\Users\zishan\Downloads\Learning”

Now read this dummy_data.csv file by running the following in R.

my_rdata <- fread("C:/Users/zishan/Downloads/Learning/dummy_data.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

In this step we used fread to read data from corresponding folder location and we assigned this data to variable named my_rdata. Now your data is available in my_rdata within R environment. You too can view this data within R as well. Run type the following and run

#Simply type your data variable name and run
my_rdata
#Or, you can also type the following and run to see first 5 rows of datahead(my_rdata)
Output data

Step 3: Filter the data

my_rdata[Items == "desktop"]
Output

Step 4: Finally output or publish data back to your machine folder location

write.csv(my_rdata, "C:/Users/zishan/Downloads/Learning/dummy_data_output.csv", row.names = FALSE)

After running above code, you will see a new CSV file named dummy_data_output is created in the corresponding folder location. If you open that file, it will have a just one row.

Please note that we have run each code one by one. You can also run all of the above steps/code in one go too as following:

# Step 1 - Load "data.table" library
require(data.table)
# Step 2 - Read your data into R
my_rdata <- fread("C:/Users/zishan/Downloads/Learning/dummy_data.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
# Step 3 - Sum the column sale_jan and sale_feb and create a new column named sale_total
my_rdata[Items == "desktop"]
#Step 4 - Finally output or publish data back to your machine folder location with a new name "dummy_data_output.csv"
write.csv(my_rdata, "C:/Users/zishan/Downloads/Learning/dummy_data_output.csv", row.names = FALSE)

3.0 Create data on fly

Though in general we need to read a csv file in R to start working on it, but i will create data on fly in this tutorial to showcase how operations works. I am going to create the same CSV data as we created in section 2. Running the following code will generate the same data and assign it to variable my_fly_rdata.

#Step 1: Create data table on fly and let's give it a name my_fly_rdatamy_fly_rdata <- data.table(Items = c("laptop","desktop", "keyboards", "mouse"),
sale_jan = c(20, 87, 74, 38),
sale_feb = c(50, 100, 93, 28))
#Display/View data
my_fly_rdata
Output

Now one can perform the same operations as we did above:

# Step 2: Sum up values of two columnsmy_fly_rdata[, sale_total := sale_jan + sale_feb]#Step 3: Output the data on your machine just like beforewrite.csv(my_fly_rdata, "C:/Users/zishan/Downloads/Learning/dummy_fly_data_output.csv", row.names = FALSE)

I will continue creating data on fly for rest of the following tutorials. But in all practical purpose at your office, you will most likely be reading a csv or text file and performing all sort of data manipulations.

4.0 Filter or Subset data

library(data.table)
# Create sample data
animal_data <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 87, 74, 38),
count_feb = c(50, 100, 93, 28))
#Display/View data
animal_data
Output data

5.1 Single Filter

# Filter the animal - Filter only for cat
cat_filter_data <- animal_data[animals == "cat"]
#Display/View data
cat_filter_data
Output

5.2 Multiple value Filters

# Filter the animal - Filter for cat and dog both
animal_filter_data <- animal_data[animals %in% c("cat","dog")]
#Display/View data
animal_filter_data
Output

5.0 Select and Concatenate Columns

5.1 Select specific set of Columns

library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report
#Select column animals and count_jan
selected_animal_report <- animal_report[, .(animals, count_jan)]
#Display/View data
selected_animal_report
Output

5.2 Filter/Subset and Select specific set of Columns

library(data.table)
# Create sample data
animal_data <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 87, 74, 38),
count_feb = c(50, 100, 93, 28))
#Display/View data
animal_data
# Filter only for animal "cat" and select column "count_feb"
cat_select_filter_data <- animal_data[animals == "cat", .(animals, count_feb)]
#Display/View data
cat_select_filter_data
Output

5.3 Concatenate columns

library(data.table)
#Define first table
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))
#Concatenate the column years and month
demand_data[, index := paste0(years, "-", month)]
#Display/View data
demand_data
Output

6.0 Column Manipulations

For column manipulation we will be using data.table operator :=. If this is operator is used then we don’t need to use the assignment operator <-

6.1 Multiply values of a Column

Multiply 2nd column by 3

library(data.table)
#Define a data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report

#Multiply column count_feb by 3
animal_report[, count_feb := count_feb*3]
#Display/View data
animal_report
Output

6.2 Multiply the values and Create a new Column

Multiply 2nd column by 3 and create a new column

#Define data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report
#Multiply column count_feb by 3 and get the data in a new_column
animal_report[, new_column := count_feb*3]
#Display/View data
animal_report
Output

6.3 Multiply, Divide and Subtract values in columns

Subtract half of count_feb from count_jan in a new column

library(data.table)
#Define a data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report
#Multiply column count_feb by 3 and get the data in a new_column
animal_report[, new_column := count_jan - count_feb/2]
#Display/View data
animal_report
Output

6.4 Add a new column with certain value

Add reporting_year column

library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report
#Add a new column reporting_year with value 2019 value
animal_report[, reporting_year := 2019]
#Display/View data
animal_report

6.5 Add a new column with value in specific row

Add reporting_year column and populate value only in for the animal cat

library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report
#Add a new column reporting_year with value 2019 value
animal_report[animals == "cat", reporting_year := 2019]
#Display/View data
animal_report

6.6 Apply formula or update only selected row

Update rows where animals are not cats

library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report
#Update count_feb value where animals are not cat
animal_report[animals != "cat", count_feb := count_feb + count_jan]
# View Data
animal_report

6.7 Perform Case when type of Formula using IFELSE

Perform case when cat then 2019; when dog then 2020; when elephant then 2018

library(data.table)
#Create sample data table
animal_report <- data.table(animals = c("cat","dog", "elephants", "cat"),
count_jan = c(20, 40, 60, 80),
count_feb = c(10, 20, 30, 40))
#Display/View data
animal_report
#Perform case when formula
animal_report[, reporting_year := ifelse(animals == "cat", 2019, ifelse(animals == "dog", 2020, 2018))]
#Display/View data
animal_report
Output

7.0 Aggregate or Group by Columns — Type I

Aggregation means like sum, max, min etc. In Type 1 aggregation, we will sum up the data and append it in a new colum in the base data. Base data other columns and structure will remain the same

7.1 General aggregation grouped by a column

Sum up the data by Years

# Create sample data
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))
# View Data
demand_data
#Sum up the demand by Years and add back in the base data
demand_data[, tot_demand_by_years := sum(demand), by = .(years)]
# View Data
demand_data

7.2 General aggregation grouped by a column applied to specific rows

# Create sample data
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))
# View Data
demand_data
#Sum up the demand excluding row where date is 25
demand_data[date != 25, tot_demand_by_years := sum(demand), by = .(years)]
# View Data
demand_data

8. Aggregate or Group by Columns — II

In Type II aggregation, we will sum up the data and will not append it in the base data. Only summed up value column and group by column will show. In such case, you would need to drop off the reference operator := and replace it with dot

8.1 Single Column Grouping

library(data.table)
# Create sample data
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))
#Group by Years column
year_group_data <- demand_data[, .(total_demand = sum(demand, na.rm = TRUE)), by = .(years)]
#Display/View data
year_group_data
Output

8.2 Multiple Columns Grouping

#Group by Years and Month column
multi_group_data <- demand_data[, .(total_demand = sum(demand, na.rm = TRUE)), by = .(years, month)]
#Display/View data
multi_group_data

In place of sum, we can can also use other aggregate function like max, min in above code snippet

9.0 Append Data with different columns

library(data.table)
#Define first table
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))
#Define second table
country_data <- data.table(years = c(2020,2020, 2021, 2023),
countries = c("IN", "JP", "CN", "DE"))
#Append one data over another
appended_data <- rbind(demand_data, country_data, fill = TRUE)
#Display/View data
appended_data

10.0 VLOOKUP

library(data.table)
#Define first table
demand_data <- data.table(years = c(2020,2020, 2020, 2021, 2021),
month = c(5, 5, 7, 12, 4),
date = c(20, 25, 23, 2, 30),
demand = c(100, 200, 300, 400, 100))
#Define second table
country_data <- data.table(country_year = c(2020,2020, 2021, 2023),
countries = c("IN", "JP", "CN", "DE"))
#Perform VLOOKUP
demand_data[, related_country := country_data[match(years, country_year), countries]]

--

--