Exploratory Data Analysis (EDA) using R

Neel Save
Neel Save
Published in
5 min readJul 31, 2019

Any data that we work on contains variety of information which could be divided into thousands of rows and columns. Before applying any statistical, machine learning model or using it to test a hypothesis, it is necessary to understand the data. This is where Exploratory Data Analysis comes into picture. EDA allows us to examine the data, find Null/missing data, look for trends and possible prediction that can be drawn from the data.

For performing Exploratory Data Analysis I have used RStudio. Here is a guide on how to install and set up the RStudio.

For this tutorial I have selected the New York City Current Job Postings data. Which can be obtained from here.

Once the RStudio is set up create a working directory in your computer. My directory is D:\Projects\R\NYC Job postings.

Step 1 — Setting up Rstudio for work

  1. Open the RStudio -> File -> New File ->RScript
  2. Set the working directory to the folder where you are going to store the New York City Current Job Postings data file.
  3. Working directory can be set by two methods

a. Go to Session->set working directory->Choose directory and select the folder (for me it is D:\Projects\R\NYC Job postings)

or

b. Type following command in the console

setwd(“D:\Projects\R\NYC Job postings”)

Step 2 — Reading the data

Since the data in the .csv format we use read.csv function to read into a variable called jobdata which we will used throughout the tutorial.

jobdata<-read.csv(“nyc-jobs.csv”) 

Following command help us get the number of rows and columns in the dataset.

dim(jobdata): Gives the number of rows and columns 

To get number of rows:

row.names(jobdata)

To get details such as Column names, data type in each column and sample data.

Str(jobdata)

To get the names of all the columns

names(jobdata)
  1. Consolidates columns:

This counts how many time a variable is present in the column.

table(jobdata$Agency)
From this we can see how many times each department is mentioned
  1. Finding NA values in the data frame
any(is.na(jobdata))Output:
[1] TRUE

2. Determining the number of NA values

sum(is.na(jobdata))Output: 
[1] 961

3. Determining the number of NA values in a Column (in this case Agency)

colsums(is.na(jobdata)
Shows number of NA values in every column
sum(is.na(jobdata$Agency))Output:
[1] 0

Step 3: Working with Data

We are going to use following libraries

  1. tidyr
  2. dplyr
  3. lubricate
  4. ggplot

This website is a great place to learn more about these libraries.

To install these libraries run following commands in console.

install.packages(“tidyr”)install.packages(“dplyr”)install.packages(“lubricate”)install.packages(“ggplot”)

In order to use these libraries we need to add following code in our script

library(tidyr)library(dplyr)library(lubricate)library(ggplot)

Let’s take a look at common functions from tidyr and dplyr:

  • select(): Used for selecting specific columns
  • filter(): Used for selecting rows that satisfy a criteria (e.g. age>50)
  • mutate(): Use for creating new columns using existing column in information
  • group_by(): Used to group the data to create summary
  • summarize(): Used for creating statistical summary (functions like mean are used)
  • arrange(): Used after summarize() function to arrange the data
  • count(): Used for counting discrete values

Firstly there are three ways to write the queries using Select(), filter() queries:

  1. Intermediate steps:
jobdata2 <- filter(jobdata, salary< 50000)
jobdata_sml <- select(jobdata2, job_id, salary)

2. Nested functions:

jobdata_sml <- select(filter(jobdata, salary< 50000), job_id, salary)

3. using Pipes %>% ( This the widely used method and we are going to stick to it)

surveys %>%
filter(salary< 50000) %>%
select(job_id, salary)

Here is how to perform analysis using tidyr and dplyr

Let’s find firms with jobs opening more than 50

# X..Of.Positions is a column with no of opening by each firm
jobdata%>%
filter(X..Of.Positions>50)%>%
group_by(Agency)

Now if you see, we get all the columns in the output. Which is not required so here’s how we select columns namely Agency, X..Of.Positions, Posting.Type, here is how

jobdata%>%
select(Agency, X..Of.Positions, Posting.Type) %>%
filter(X..Of.Positions>50)%>%
group_by(Agency)

We can also find jobs that are posted after 2017–01–05

jobdata%>%
select(Agency,Posting.Type,Posting.Date,Salary.Range.From)%>%
filter(Posting.Date >= as.Date(“2017–01–05”))%>%
group_by(Agency)

To find job posting between 2 dates

Posting.Date >= as.Date(“2014–01–05”) & Posting.Date <= as.Date(“2014–01–10”))

The better way

between(Posting.Date, as.Date(“2014–01–05”), as.Date(“2014–01–10”)))

This is how the query looks like

between(Posting.Date, as.Date(“2014–01–05”), as.Date(“2014–01–10”)))jobdata%>%
select(Agency,Posting.Type,Posting.Date,Salary.Range.From)%>%
filter(between(Posting.Date, as.Date(“2014–01–05”), as.Date(“2014–01–10”)))%>%
group_by(Agency)

Now, let’s look at openings in last 30 days.

jobdata%>%
select(Agency,Posting.Type,Posting.Date,Salary.Range.From)%>%
filter(Posting.Date >= today() — days(30))%>%
group_by(Agency)
today() and days() are functions from lubricate library

Step 4: Plotting with ggplot:

Now let’s look at how to plot the data using a library called ggplot.

I am using this particular library because it offer variety of graphs as well as greater control over the appearance of the graph.

ggplot follows following syntax:ggplot(data , aes(x = , y = ) +#data = data which we use, in this case it is Jobdata# x= and y = give the variable for x and y axisgeom_bar() + # geom_ function decides the type of graph(point, bar, boxplot).theme()#theme is used for adding customization to the plot

I found this guide really helpful in understanding ggplot.

Here are some of the examples:

  1. Number of Postings in an agency:
ggplot(jobdata,aes(x = Agency))+
geom_bar(fill= “red” )+
theme(axis.text.x = element_text(angle = 90, size = 5))

2. Plotting starting salaries offered by an agency.

ggplot(jobdata,aes(x = Agency, y = Salary.Range.To- Salary.Range.From))+
geom_point()+
theme(axis.text.x = element_text(angle = 90, size = 5))
ggplot(jobdata,aes(y = Salary.Range.From, x = Agency))+geom_line()+ theme(axis.text.x = element_text(angle = 90, size = 5))

The following plot is the same as above but instead of geom_point() I have used geom_line() function. Second graph is better to understand the spread of the starting salaries offered by an agency for various job it has posted.

What did you learn?

1. How to set up the RStudio Environment

2. How to read data and perform primary investigation on data

3. How to find missing/NA values

4. How to use tidyr, dplyr and lubricate for data analysis

5. How to use ggplot for visualizing data

--

--