How do I transform dataset from Excel spreadsheet using R?

Recently, we were working on a group project in my Data Science Primer Course. A dataset scraped from iFlix.com is given in an Excel spreadsheet, tabbed with the category name in each sheet. My task is to transform all these sheets into a single master dataset and export as .csv file.

We were later given a revised dataset in a master format and mapped with IMDB metadata, however the original iFlix category is removed. Therefore, I decided to re-process the dataset with iFlix category and merge with the revised version.

While I’m well versed in PHP, I find it is tedious to process this using native PHP coding and I foresee there will be plenty of back-and-forth trial-and-error works, and who knows we might need to do some data cleaning or remove unnecessary columns along the way when we do data analysis at later stage.

I find R (read more about R) just fit into this role perfectly because:

  • I want something that can process & manipulate the dataset easily
  • I can do some fundamental (or advanced) programming on the dataset
  • I can experiment the dataset and dump it if I find it irrelevant
  • I can export the resultant with ease
iFlix.com, titles under category TV — Drama
(iflix_contents.xlsx) Excel spreadsheet of the scraped data with title & description and segregated with category on each sheet
[iflix_contents_extra.xlsx] Excel spreadsheet of all iFlix titles mapped with IMDB metadata in single sheet, but original iFlix category is removed

Step 1. Read the Excel spreadsheet

First of all, we need to read the Excel into R environment. Note that the dataset has many tabs and named with the respective category name. After some googling and reading, I decided to use readxl method:

# install req lib
install.packages(“readxl”)
library(readxl)
library(plyr)
# def fx
read_excel_allsheets <- function(filename) {
sheets <- readxl::excel_sheets(filename)
x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
names(x) <- sheets
x
}
# read excel
mysheets <- read_excel_allsheets(“iflix_contents.xlsx”)
console output of mysheets

Step 2. Load it into R data frame

Now we already have the datasets of each sheet in a R data frame, we now want to shape it into following manner (column):

  • title (string)
  • desc (string)
  • type (string = TV | Movie | Studio)
  • ctgy (string) : category of title stated in the sheet

With my minimal knowledge of R and limited time, I just made a for loop to process it (if you notice it is pretty much a PHP approach, share with me if there is a better way in doing this with R)

Note: I skipped the first tab “information”, which something I don’t need
# init the master df
df <- data.frame(title = character(), desc = character(), type = character())
# loop through the xls sheets
for (i in 1:length(mysheets)) {
if (names(mysheets[i]) != ‘information’) {
# if ctgy prefix with ‘TV — ‘, then type = ‘TV’; otherwise, ‘Movie’
if (length(grep(‘Movie — ‘, names(mysheets[i]))) == 1) {
mysheets[i][[1]]$type = ‘Movie’
mysheets[i][[1]]$ctgy = sub(‘Movie — ‘, ‘’, names(mysheets[i]))
} else if (length(grep(‘TV — ‘, names(mysheets[i]))) == 1) {
mysheets[i][[1]]$type = ‘TV’;
mysheets[i][[1]]$ctgy = sub(‘TV — ‘, ‘’, names(mysheets[i]))
} else {
mysheets[i][[1]]$type = ‘Studio’
mysheets[i][[1]]$ctgy = names(mysheets[i])
}

df <- rbind(df, mysheets[i][[1]])
}
}
preview of first transformed dataframe

This is how it looks like after consolidating 1995 rows of data.

Step 3. Label the Top 200

When you take a closer look at the data, you will see there are plenty of duplicates, either the same title falls under various category which is fine from database point of view, but it also duplicated into a category named “TV - Top 200” and “Movie - Top 200”. It makes more sense to create a new column on the data frame and number it accordingly.

We can do that using PHP, but doing it in R is surprisingly easier than I thought. What we need to do is extract the top 200 list and loop through it, and then remove all the rows categorized within that.

# def a df for top200 - filter $type == ‘TV — Top200’ from the master df
tvtop200 <- df[df$type == ‘TV’ & df$ctgy ==’Top 200',]
# for loop the top200 & assign them as top200
for (i in 1:nrow(tvtop200)) {
df$top200[df$title==tvtop200$title[i]] <- i
}
mvtop200 <- df[df$type == ‘Movie’ & df$ctgy ==’Top 200',]
# for loop the top200 & assign them as top200
for (i in 1:nrow(mvtop200)) {
df$top200[df$title==mvtop200$title[i]] <- i
}
# remove rows where type == ‘Top 200’
df <- subset(df, ctgy!=’Top 200')
preview of second transformed dataframe

Step 4. Merge with IMDB Metadata

Lastly, we read the IMDB mapped spreadsheet and merge it with our data frame. Again, it is done easily using R, with less codes!

# iflix_content_extra
iflix_xtra <- read.csv(‘iflix_content_extra.csv’)
iflix_xtra <- unique(iflix_xtra)
# merge the data
df3 <- merge(df, iflix_xtra, by.x = “title”, by.y = ‘Title_In_Iflix’)
preview of final transformed dataframe

to export to csv:

write.csv(df3, file=”iflix_bryan.csv”)

You may now import the CSV to your data visualization tool such as Tableau or Neo4J for further analysis.

Bonus

When you look into Genre column, you will notice it may contains more than one genre delimited with comma (,). So, if you would like to normalize this into several rows from database point of view, you may take the following snippet:

df4 <- df3
for (i in 1:nrow(df3)) {
genre = strsplit(as.character(df3$Genre[i]), ‘, ‘)
df4$Genre[i] = genre[[1]][1]
if (length(genre[[1]]) > 1) {
for (j in 2:length(genre[[1]])) {
df5 <- df3[i,]
df5$Genre <- genre[[1]][j]
df4 <- rbind(df4, df5)
}
}
}

Full Source file:

Anything to add? Do share with me in the comments :)