Data visualization (Pic credit : Google Images )

“R” you analytical ?

Shivee Gupta
5 min readJun 28, 2015

--

Data visualization using R (the statistical programming environment).

This article is for those souls searching for the meaning of life. I’m kidding. This article gives a brief hands-on introduction on the data visualization for statistical analysis using R.

I’ll assume you already have R up and running on your system and are aware of the basic building blocks of R (how convenient :P ). If not, Google is your friend. The data I’m using comes from the UC Irvine Machine Learning Repository. It is the “Individual household electric power consumption Data Set”. Ok so lets get down and play around with this data.

Load the data :

You must have noticed the data-set is in the form of a text file with columns separated with a ‘;’ (semi-colon). In other words its a tabular format. Also note the headers provided which we might want to retain. This is how we read a table from a file …

tab <- read.table(“household_power_consumption.txt”, sep = “;”, na.strings = “?”, header = TRUE);

The function read.table reads the file as a table and creates a data frame from it. The arguments define usage specifications for the function to effectively read the data. The syntax used above is pretty much self explanatory but still I’d go ahead state the obvious.

The first argument is the file name from of our data set from which we read the data. The argument sep defines the symbol used as separators demarcating the columns in the file. Next is na.strings which states that appearance of “?” in a column should be treated as a missing values. The header = TRUE ensures that the column headers are read from the file.

Ask a question :

Now you can view the summary statistics of the loaded table with the cmd

summary(tab)

The dataset contains variables like date, time, voltage, sub-metering values etc and ready for us to analyze. So lets say we want to compare the sub_metering_1/2/3 values. Even better ? Lets pick values of particulars days, maybe 1st and 2nd of Feb 2007. For comparision whats better than plotting all those values together to visualize the value. Shall we ?

First, lets break down the above question into smaller sub problems.

  1. Convert the date & time columns in the table in suitable format to be usable for subsetting.
  2. Subset the data for the two specific days (2007–02–01 and 2007–02–02).
  3. Plot the submetering values and annotate the graph.

Handle data formats :

We want to subset data based on time. But the columns are read from the file are in a string format and not date-time format. So lets combine the date and time columns into one date-time column in the POSIX format.

tab$dateTime <- paste(tab$Date, tab$Time)tab$dateTime <- as.POSIXct(strptime(tab$dateTime, “%d/%m/%Y %H:%M:%S”))

The first instruction combines the two columns into one and the second command converts it into suitable date-time format from string which can be used for subsetting.

Subset the data :

Now all we need to do is to pick those particular reading of the two particular dates of 1st and 2nd of Feb 2007. Here it goes …

mysub = subset(tab, tab$dateTime >= as.POSIXct(‘2007–2–1 00:00:00’) & tab$dateTime < as.POSIXct(‘2007–2–3 00:00:00’))

The above command subsets the ‘tab’ dataset for the two days between 12 am for each date and stores in ‘mysub’.

Plot the graph :

We are using the ggplot2 library to plot the graph. It has various advantages over the base plotting system. You can store the ggplot object and modify it over in additions, variety of smoothing overlays including loess, easy facetting and legends and the list goes on.

So lets save our graph as a png file. For this we activate the plotting device as ‘png’ and set its properties of size and background with the following command.

png(“plot.png”, height = 520, width = 520, bg = “transparent” )

With the above command our plot will be saved as ‘plot.png’ with dimensions of 520*520 px and a transparent background.

Now lets proceed with plotting the graph. Note don’t forget to load the ggplot2 library first.

  • Create the ggplot2 object defining the plot frames. In the command below we have created the object over the ‘mysub’ dataset, with x-axis range over ‘mysub$dateTime’ and y-axis range over ‘mysub$Sub_metering_1’. We have considered only submetering_1 reading in this case because it has the widest range of all the sub_metering values so it covers all the other ranges within it. Also we have defined the colors of the three plots lines with the list ‘color’.
obj <- ggplot(mysub, aes(x = mysub$dateTime, y = mysub$Sub_metering_1), color = c(“black”, “red”, blue))
  • Plot the submetering_1 value. The following command annotates the above ggplot object ‘obj’ with a line plot and name its legend line index to ‘Sub_metering_1'. The geom_line function is a ggplot function to plot a line graph with first argument as the dataset, then defining both the axes data and other annotations if needed, legend index in this case.
obj <- obj + geom_line(data = mysub, aes(x= mysub$dateTime, y= mysub$Sub_metering_1,color = “Sub_metering_1”) )

Same goes for submetering values 2 & 3 .

obj <- obj + geom_line(data = mysub,aes(x= mysub$dateTime, y= mysub$Sub_metering_2, color = “Sub_metering_2”) ) + geom_line(data = mysub,aes(x= mysub$dateTime, y= mysub$Sub_metering_3, color = “Sub_metering_3”) )
  • Annotate the legend. To give the color of the lines in the legend we use scale_color_manual. Also the label title to the legend is given by labs().
obj <- obj + scale_color_manual(values = c(“black”,”red”, “blue”)) + labs(colour = “Sub metering values”)
  • Define axes labels.
obj <- obj + xlab(“Date & Time”)+ ylab(“Energy submetering”)
  • Until now we have just created the ggplot object and annotated it. But haven’t print it as of now. That can be done by simply running the ‘obj’ variable which inherently calls the print module to print the graph on the png device we defined.
obj

And then we close the plotting device with the following function.

dev.off()

Finally :

Now you can see the following graph plotting the submetering values for the required dates; saved a “plot.png” in the current working directory.

Fig : Sub metering values on Feb 1 & 2 of 2007.

PS : There is a huge world of data visualization waiting for you to explore. If you found this useful and interesting please click the recommend button.
If not, click it any way ;)

--

--

Shivee Gupta

Custom Solutions Engineer @ Google. Transforming unique client requirements to scalable solutions. https://in.linkedin.com/in/shiveegupta