I finally discovered the trick to using GGPLOT2!

Davis Anunda
7 min readJan 17, 2023

--

Data visualization in R

Base R is equipped with data visualization packages, useful in making simple visuals from pie charts or bar graphs to more complex choropleths. A wide variety of packages exist in R that can be used for making visualizations. These include:

  • Plotly
  • Lattice
  • Dygraphs
  • Highcharter
  • Ggridges
  • Gganimate
  • Patchwork e.t.c.

About ggplot2

Ggplot2 is the most popular visualizing package in R. It was created by Hadley Wickham in 2005 after being inspired by the 1999 book “The Grammar of Graphics” by Leland Wilkinson. The term ggplot2 was coined using the two Gs from grammar and graphics.

Why ggplot2

  • Ggplot2 is flexible and can be used to create very powerful graphics with just a little bit of code.
  • Ggplot2 can create all different types of plots, from simple scatter plots to maps and networks.
  • Ggplot2 can be customized easily to change the look and feel of the plots.
  • Ggplot2 can be combined with the pipe operator to manipulate, arrange and summarize data before visualization.
  • Building a ggplot2 object is easy since it allows the addition of layers to the building blocks of the ggplot function.

ggplot2 basics

Ggplot2 has a set of building blocks that once understood, plotting becomes straightforward. The blocks are:

  • Aesthetics: These are the visual properties of the plots. For instance, in scatterplots, the aesthetics could be the shape, colour, and size, of the data points. Aesthetics also map your variables to the visual features of your plot.
  • Geom: This is the geometric object used to represent data. Geom objects determine the type of plot you are building. For example, the geom object for the scatter plot is geom_point, while that of the histogram is geom_histogram.
  • Facets: This allows you to split your data into subgroups along a certain variable.
  • Label/annotate: These allow you to add titles, subtitles, label axes, highlight or annotate specific points etc.
  • Others: scales, coordinate systems etc.

Installing the required packages
Tidyverse is a collection of packages that can be installed together in a single command.

install.packages('tidyverse')
library(tidyverse)

The packages in the tidyverse package have related functions in importing, tidying, transforming, visualizing, modelling and communicating the data. These packages include:

  • ggplot2 for data visualization
  • dplyr for data manipulation tasks
  • tidyr for data cleaning
  • readr for importing data
  • purrr works with functions and vectors
  • tibble works with dataframes
  • stringr for string manipulations
  • forcats solves problems with factors

When tidyverse is loaded, all the packages above including ggplot2 are automatically loaded and ready for use.

We will load the iris dataset from R’s inbuilt datasets for demonstration. The dataset comprises four columns and 150 rows.

library(datasets)
data(iris)
head(iris)

We will also use a second dataset from Kaggle about the best movies and TV shows on Netflix. It comprises seven columns and 387 rows. The country column has been created from the main_production column and replaces all countries except the US, IN and GB with “Others”.

netflix <- read_csv("datasets/best_movies_netflix.csv")
head(netflix)

Visualization

To build, we start with the ggplot2 function, then we add other layers after it. For example, to build a scatter plot between sepal length and sepal width we add the geom_point object. The variables sepal length and width are mapped to x and y as aesthetics.

ggplot(data = iris) + geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width))

To create other types of plots, the geom object changes:

  • Histogram: geom_histogram()
  • Lineplot: geom_line()
  • Barchart: geom_bar()
  • Area plot: geom_area() e.t.c.

This is the building block for most ggplot2 plots. The aes function maps the variables to the x and y axis in the geom object. The dataset is specified in the ggplot2 function. Changing the chart type only requires you to change the geom object and the variables. For example, to create a barplot which only requires one variable specified, we add the geom_bar object:

ggplot(data = netflix) + geom_bar(mapping = aes(x = country))

To draw a line plot:

netflix %>% 
group_by(release_year) %>%
summarize(average_duration = mean(duration)) %>%
ggplot() + geom_line(mapping = aes(x = release_year, y = average_duration))

Another method of layering objects in ggplot2 is by specifying the aesthetics within the function ggplot(). It is all about preference, and whichever one finds easier to use or remember.

ggplot(data = netflix, mapping = aes(x = duration, y = score)) + geom_point()

Customizations

Change the colour and shape of the scatter plot points using the variable species by simply adding colour and shape to the aesthetics.

ggplot(data = iris) + geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species))

Modify the size of the data points using a variable by including the size parameter. Alpha changes the transparency.

ggplot(data = iris) + geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, color = Species, size = Sepal.Length),alpha = 0.5)

Split the data into subsets and visualize it separately by using faceting. For single variables, use facet_wrap while for more than one variable, use facet_grid.

ggplot(data = netflix) + geom_bar(mapping = aes(y = main_genre)) + facet_wrap(~country)

Let’s create a variable movie_length, where movies or TV shows below 1 hour are classified as short, medium length is between 1 and 2 hours, and more than two hours is long.

netflix$movie_length <- as.factor(ifelse(netflix$duration < 60, 'short',
ifelse(netflix$duration > 120, 'long', 'medium')))

We will also filter the data to remain with the top three genres for easy demonstration. Let’s visualize the data and split it along the two variables movie_length and country.

netflix %>% 
filter(main_genre %in% c('drama','thriller','comedy')) %>%
ggplot() + geom_bar(mapping = aes(y = main_genre)) + facet_wrap(movie_length~country)

Add labels and titles

Add the labs layer to the plot to edit the title and axis labels.

ggplot(data = iris) + 
geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
labs(title = "Sepal length vs Sepal width", x = "Sepal length (cm)", y = "Sepal width (cm)")

Adding annotations

Add text on the plot by using the annotate function that requires parameters like coordinates where the text appears and the label.

ggplot(data = iris) + 
geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
labs(title = "Sepal length vs Sepal width", x = "Sepal length (cm)", y = "Sepal width (cm)") +
annotate("text",x = 4.5,y = 2.5, label = "Very short", angle = 25)

Splitting the code

As the code gets longer and longer, it gets more and more difficult to fit it on one line. We can assign our code to an object onto which we can continue adding the layers.

p <- ggplot(data = iris) + 
geom_point(mapping = aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species))

p + labs(title = "Petal length vs Petal width", x = "Petal length (cm)", y = "Petal width (cm)")

p + annotate("text",x = 2.5,y = 0.5, label = "Quite short", angle = 25)

p + theme_light()

p

Drawing other types of graphs

To plot a boxplot, just add the geom_boxplot and specify the aesthetics x and y. You can customize it further by adding an outline colour or fill the boxplot using the fill parameter. Center the title by adding a theme object and specifying other parameters.

p <- ggplot(data = netflix) + stat_boxplot(mapping = aes(x = country, y = score, color = country)) 
p + labs(title = "Movie score per country", x = "Country", y = "Score")
p + theme(plot.title = element_text(hjust = 0.5))
p

Conclusion

In essence, ggplot2 can be difficult or easy depending on how you see it. But once you’ve mastered the fundamentals, plotting more difficult and complex graphs requires a little tweaking of the geoms, aesthetics and other objects. Try it out, and be the judge.

--

--

Davis Anunda

Microbiologist | Data Analyst | Chess addict | Poetry lover | Fitness procrastinator