LEARNING DATA VISUALISATION

This article covers the sampling of some output of a set of data using R programming and also the lines of codes. Before we dive into the process of learning data visualisation using R programming we must first understand the meaning of data visualisation.

With ever increasing volume of data, it is impossible to tell stories without visualizations. Data visualization is an art of how to turn numbers into useful knowledge. R Programming lets you learn this art by offering a set of inbuilt functions and libraries to build visualizations and present data. Data visualisation in r programming is done with the use of the ggplot2 and corrplot packages and also with a host of other packages.

Data Visualization in R: this article will cover the following visualisations, Bar chart, Histogram, Heat Map, Scatter plot, box plot, correlogram and area charts. To carry out my illustration I will be making use of a set of inbuilt data from our Rstudio

Bar chart: Bar Plots are suitable for showing comparison between variables, it is used when you want to plot a categorical or continuous variable. If we want to plot a bar chart from our data set, E.g if we to want plot Automobile cylinder by the Number of Gear in a car.

Rcode :

Data(mtcars)

View(mtcars)

data(airquality)

View(airquality)

#bar charts of cylindersnand gears in mtcars

barza <- table(mtcars$cyl,mtcars$gear)

View(barza)

barplot(barza, main= 'Automobile cylinder by the Number of Gear in a car’, col=c(’steelblue’,’orange’,’black’), legend=rownames(barza), xlab=’Number of Gears’, ylab= 'Frequency’)

Histogram: Histogram is used to plot continuous variable. From our air quality dataset, if we want to know the histogram frequency of temp value in airquality

hist(airquality$Temp, col=’red’, main = 'maximum Daily Temp’, xlab = 'Temperature(in degrees)’)

Heat Map: Heat Map uses intensity of colors to display relationship between two or three or many variables. Let’s generate a random set of numbers and determine their intensity.

Rcode:

#simulate a dATA Set of 20 points, rnorm-- used to randomly generate numbers

eex <- rnorm(20, mean = rep(1:5, each = 4), sd=0.7)

eey <- rnorm(20, mean = rep(c(1,19), each = 10), sd=0.1)

dataFrame <- data.frame(eex=eex,eey=eey)

set.seed(143)

dataMatrix <- as.matrix(dataFrame)[sample(1:20),]

heatmap(dataMatrix)

##use must covert dataframe to matrix before you can plot an heatmap

Scatter plot: Scatter Plot is used to see the relationship between two continuous variables. From our dataset if we want to visualise the ozone and temperature measurements for month of August in air quality data,

Rcode:

with(subset(airquality, Month == 8), plot(Temp,Ozone,col =’steelblue’, pch=20, cex=1.5))

Box Plot: Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. To illustrate this let’s plot box plot with mpg and cyl data from our mtcars dataset.

Rcode:

mtcars <- transform(mtcars, cyl=factor(cyl))

class(mtcars$cyl)

boxplot(mpg~cyl,mtcars, xlab = 'number of Cylinder’, ylab = 'miles per gallon’, main = 'miles peer gallon for automobile cylinder’, cex.main = 1.2)

Correlogram: Correlogram is used to test the level of co-relation among the variable available in the data set. The cells of the matrix can be shaded or colored to show the co-relation value. The darker the higher the correlation. The red color indicted negative correlation and the blue indicates positive correlation respectively. It should be noted in correlogram a correlation matrix must be done. Unlike other visualisation that only require ggplot2 alone, the correlogram requires a package called corrplot.

Rcode;

library(corrplot)

as.numeric(mtcars)

class(mtcars)

data(mtcars)

corr_matrix <- cor(mtcars)

(corr_matrix, method= 'circle’, type = 'lower’)

Area charts: Area chart is used to show continuity across a variable or data e.g trends. Also it is used to plot continuous variables.

Rcode:

Librar(dplyr)

data(airquality)

airquality %>%

group_by(Day)%>%

summarise(mean_wind = mean(wind) %>%

ggplot() + geom_area(aes(x = Day, y= mean_wind) + labs(title = "Area chart of Average Wind per Day", subtitle = "using airquality data", y ="Mean Wind"))