Day 11: Explore data with R’s base plotting system

SaiGayatri Vadali
5 min readFeb 18, 2018

--

This article is the eleventh one in the series “Getting started with data science in 30 days using R programming”. You can find all other articles here.

Why exploratory data analysis?

Exploring data is often incomplete even after performing cleaning, transforming and reducing it’s dimensions. It can be essential to find if the obtained data is useful to solve our use-case or even for designing a work flow to get more intimate with data. This particular step might begin right after cleaning and putting data into a tidy format or might occur after applying some transformations and dimensionality reduction techniques. In fact, the order and need of performing these tasks is also unaware to a data scientist unless he/she gets a deeper understanding of data. Hence the importance of Exploratory data analysis.

What is exploratory data analysis?

Exploratory data analysis involves not just graphs, plots and visualisations. No doubt that graphs provide an instant and clear information just at a glimpse. But before embarking onto plotting, there are certain things to be kept in mind in order to make this exploration of data, more simpler and organised.

  1. Run str() method on the dataset. It will make you know about the classes of columns.
  2. Then run View(), head(), tail() to view the table, first and last records.
  3. Then run summary() and I already gave you the importance of all these functions here. (I recommend going through this article again at this point of time).
  4. Use dplyr package at this stage again to filter certain columns to see if conditions of your data are met. Like , running select() helps you know if the data in columns of your hospital patient data set is in correct range.
  5. Now come up with a simple and good question which is specific to your use case and start plotting so that your plots now answer your question.

BASIC PLOTTING SYSTEM:

Today, let’s see some of the basic plots which help us in our initial exploration.

1.Box Plot:

Box plot is most common. It can be thought of as a graphic representation of summary() function. The line in the box gives the median. The edges of box plot give the quantiles and the extreme lines show highest and lowest values excluding outliers.

>boxplot(iris$Sepal.Length,col="Blue")
Sepal length of the flower lies between 4.5 to 8

Now lets create a plot with two box plots.

>boxplot(iris$Sepal.Length ~ iris$Species)

2. Violin plot:

Violin plots almost like a sister plot of box plot allow us to visualize the distribution of numeric variable for one or more groups. They come under ggplot system. We will know more about them in coming articles.

3. Histogram:

Histogram gives frequency distribution of a quantitative variable. Histogram gives the details of distribution of a variable.

>hist(iris$Sepal.Length,col="green")

We can add label to x axis and also bins to the histogram. Now let us add more bins to the same plot.

>hist(iris$Sepal.Length,col="red",breaks=50)

There are other plots which give distribution like Kernel Distribution plots.

4. Kernel distribution plot:

Following code snippet gives a density plot

> plot(density(iris$Sepal.Length))

There can be seen a similarity between histogram and density plot shown above. Both are giving the density.

5.Scatter plot:

Scatter plot helps us to plot our data in the form of data points scattered across two or three dimensional space. It can be done just using plot() function.

>plot(iris$Sepal.Length,iris$Sepal.Width)

We can add two plots also to the same graph.

>par(mfrow=c(1,2))
> plot(iris$Sepal.Length, iris$Petal.Length, # x variable, y variable
+ col = iris$Species, # colour by species
+ main = "Sepal vs Petal Length in Iris") # plot title
>
> plot(iris$Sepal.Length, iris$Sepal.Width, # x variable, y variable
+ col = iris$Species, # colour by species
+ main = "Sepal Length vs Width in Iris") # plot title

6. Viewing data fitted with a model:

Now that we have dealt with representing our data in better way, why not we apply a machine learning model on it and see how we can plot it. Let’s plot a Linear regression model. It can be done with a single line of code. We will learn about linear regression very soon.

plot(cars,xlab= "distance travelled",ylab="speed ",col="red")
>
> linearregression <- lm(cars$dist~cars$speed,data=cars)

The model is fit. Let’s have a glance at it using our plot function.

> plot(cars,xlab= "distance travelled",ylab="speed ",col="red")
> abline(linearregression ,lty="dashed",col="blue")

Thus, we can view our models which we will be training in coming days. Now let us look at scatter plots which help us to view data as a set of data points scattered across our two or three dimensional space.

7.Saving the image to device:

Once, we are done with plotting, we might want to save it to device through R command line itself. png() function helps us do it. Don’t forget to add dev.off() at the end which closes the new formed image file.

> png("iris.png", width = 500, height = 500, res = 72)
>
> plot(iris$Sepal.Length, iris$Petal.Length,
+ col = iris$Species,
+ main = "Sepal vs Petal Length in Iris")
>
> dev.off()
RStudioGD
2

In the coming articles, let’s have a glance at other plotting systems like lattice and ggplot2. They also help us in adding more value to our insights from the data.

Did I add something to your knowledge base of R and data analysis? then why stop? Just clap as much as you want so that it reaches to more people!!

--

--

SaiGayatri Vadali

An inquisitive Machine Learning Engineer, yoga trainer, fitness freak and a passionate writer!