Big Data with Expedia Dataset

Shaula Andreinna A
4 min readJul 17, 2020

--

Hello future data scientist!^^It’s good to be back!

This time, i want to share about Big Data!!

What is Big Data?

Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered. Big data often comes from multiple sources and arrives in multiple formats.

Big data can be categorized as unstructured or structured. Structured data consists of information already managed by the organization in databases and spreadsheets; it is frequently numeric in nature. Unstructured data is information that is unorganized and does not fall into a pre-determined model or format. It includes data gathered from social media sources, which help institutions gather information on customer needs.

Expedia Dataset

Expedia has provided you logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package. The data in this competition is a random selection from Expedia and is not representative of the overall statistics.

The train and test datasets are split based on time: training data from 2013 and 2014, while test data are from 2015. The public/private leaderboard data are split base on time as well. Training data includes all the users in the logs, including both click events and booking events. Test data only includes booking events.

In this article, i want to use a testing data. You can download the data here.

Ok, that’s enough for the theory, let’s get to the code!

First, open the RStudio program, then running the dataset using the following syntax.

expedia<-read.csv("E://SHAULA DOCUMENT//SEM 6//DATA MINING//MEDIUM//Big Data//test.csv")
View(expedia)

It will take your time a little bit longer. Just be patient :D

Then the output should’ve like this.

We can see how much data by rows and colomns using the following syntax.

dim(expedia)

There are 2.528.243 rows and 22 colomns.

We can see in detail the variables that exist in expedia data.

str(expedia)

We can also see summary of the data.

summary(expedia)

We can see the number of adults specified in the hotel room.

expedia$srch_adults_cnt

From the output we can see that the first adult customer is 2 people. Let’s take a look at first customer in detail.

expedia[1,]

Then the output should’ve like this.

Then, we can make a histogram from the number of adults specified in the hotel room.

hist(expedia$srch_adults_cnt)

Turns out, the number of hotels booking with 2 people mostly ordered.

We can see the number of adults carrying their children by table using the following syntax.

table(expedia$srch_adults_cnt,expedia$srch_children_cnt)

We can also make a plot of the table like this.

plot(table(expedia$srch_adults_cnt,expedia$srch_children_cnt))

Let’s take a look another variable, we can see how long the customer stays at the hotel using the following syntax.

checkin<-as.Date(expedia$srch_ci)
checkout<-as.Date(expedia$srch_co)
hotel<-checkout-checkin
hotel

Then the output should’ve like this. First customer stays at the hotel for 4 days.

Next, we can make a data frame and see how much the customer carrying their children stays at the hotel for 1 day.

input<-data.frame(expedia$srch_adults_cnt,expedia$srch_children_cnt,hotel)
class(input)
xtabs(hotel==1~expedia$srch_adults_cnt+expedia$srch_children_cnt,data=input)

Then the output should’ve like this. If 2 adults stays at the hotel without carrying a child for 1 day is about 555571, and so on.

That’s all for today, hope it’s usefull!^^

References:

--

--