How do I transform dataset from Excel spreadsheet using R?

Bryan Soong
Dec 8, 2015 · 5 min read

Recently, we were working on a group project in my Data Science Primer Course. A dataset scraped from iFlix.com is given in an Excel spreadsheet, tabbed with the category name in each sheet. My task is to transform all these sheets into a single master dataset and export as .csv file.

We were later given a revised dataset in a master format and mapped with IMDB metadata, however the original iFlix category is removed. Therefore, I decided to re-process the dataset with iFlix category and merge with the revised version.

While I’m well versed in PHP, I find it is tedious to process this using native PHP coding and I foresee there will be plenty of back-and-forth trial-and-error works, and who knows we might need to do some data cleaning or remove unnecessary columns along the way when we do data analysis at later stage.

I find R (read more about R) just fit into this role perfectly because:

  • I want something that can process & manipulate the dataset easily
  • I can do some fundamental (or advanced) programming on the dataset
  • I can experiment the dataset and dump it if I find it irrelevant
  • I can export the resultant with ease
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Step 1. Read the Excel spreadsheet

First of all, we need to read the Excel into R environment. Note that the dataset has many tabs and named with the respective category name. After some googling and reading, I decided to use readxl method:

# install req lib
install.packages(“readxl”)
library(readxl)
library(plyr)
# def fx
read_excel_allsheets <- function(filename) {
sheets <- readxl::excel_sheets(filename)
x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
names(x) <- sheets
x
}
# read excel
mysheets <- read_excel_allsheets(“iflix_contents.xlsx”)
Image for post
Image for post

Step 2. Load it into R data frame

Now we already have the datasets of each sheet in a R data frame, we now want to shape it into following manner (column):

  • title (string)
  • desc (string)
  • type (string = TV | Movie | Studio)
  • ctgy (string) : category of title stated in the sheet

With my minimal knowledge of R and limited time, I just made a for loop to process it (if you notice it is pretty much a PHP approach, share with me if there is a better way in doing this with R)

Note: I skipped the first tab “information”, which something I don’t need

# init the master df
df <- data.frame(title = character(), desc = character(), type = character())
# loop through the xls sheets
for (i in 1:length(mysheets)) {
if (names(mysheets[i]) != ‘information’) {
# if ctgy prefix with ‘TV — ‘, then type = ‘TV’; otherwise, ‘Movie’
if (length(grep(‘Movie — ‘, names(mysheets[i]))) == 1) {
mysheets[i][[1]]$type = ‘Movie’
mysheets[i][[1]]$ctgy = sub(‘Movie — ‘, ‘’, names(mysheets[i]))
} else if (length(grep(‘TV — ‘, names(mysheets[i]))) == 1) {
mysheets[i][[1]]$type = ‘TV’;
mysheets[i][[1]]$ctgy = sub(‘TV — ‘, ‘’, names(mysheets[i]))
} else {
mysheets[i][[1]]$type = ‘Studio’
mysheets[i][[1]]$ctgy = names(mysheets[i])
}

df <- rbind(df, mysheets[i][[1]])
}
}
Image for post
Image for post

This is how it looks like after consolidating 1995 rows of data.

Step 3. Label the Top 200

When you take a closer look at the data, you will see there are plenty of duplicates, either the same title falls under various category which is fine from database point of view, but it also duplicated into a category named “TV - Top 200” and “Movie - Top 200”. It makes more sense to create a new column on the data frame and number it accordingly.

We can do that using PHP, but doing it in R is surprisingly easier than I thought. What we need to do is extract the top 200 list and loop through it, and then remove all the rows categorized within that.

# def a df for top200 - filter $type == ‘TV — Top200’ from the master df
tvtop200 <- df[df$type == ‘TV’ & df$ctgy ==’Top 200',]
# for loop the top200 & assign them as top200
for (i in 1:nrow(tvtop200)) {
df$top200[df$title==tvtop200$title[i]] <- i
}
mvtop200 <- df[df$type == ‘Movie’ & df$ctgy ==’Top 200',]
# for loop the top200 & assign them as top200
for (i in 1:nrow(mvtop200)) {
df$top200[df$title==mvtop200$title[i]] <- i
}
# remove rows where type == ‘Top 200’
df <- subset(df, ctgy!=’Top 200')
Image for post
Image for post

Step 4. Merge with IMDB Metadata

Lastly, we read the IMDB mapped spreadsheet and merge it with our data frame. Again, it is done easily using R, with less codes!

# iflix_content_extra
iflix_xtra <- read.csv(‘iflix_content_extra.csv’)
iflix_xtra <- unique(iflix_xtra)
# merge the data
df3 <- merge(df, iflix_xtra, by.x = “title”, by.y = ‘Title_In_Iflix’)
Image for post
Image for post

to export to csv:

write.csv(df3, file=”iflix_bryan.csv”)

You may now import the CSV to your data visualization tool such as Tableau or Neo4J for further analysis.

Bonus

When you look into Genre column, you will notice it may contains more than one genre delimited with comma (,). So, if you would like to normalize this into several rows from database point of view, you may take the following snippet:

df4 <- df3
for (i in 1:nrow(df3)) {
genre = strsplit(as.character(df3$Genre[i]), ‘, ‘)
df4$Genre[i] = genre[[1]][1]
if (length(genre[[1]]) > 1) {
for (j in 2:length(genre[[1]])) {
df5 <- df3[i,]
df5$Genre <- genre[[1]][j]
df4 <- rbind(df4, df5)
}
}
}

Full Source file:

Anything to add? Do share with me in the comments :)

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store