How to create elegant violin and box plots in R with minimal code using the “ggstatsplot” package

12 min readNov 19, 2023

“Don’t give up seeing the exhaustive lines of code 😅. It’s just copy and paste then Run!!”. Stay with me, and I will show you how to generate the plot below in simple steps.

Comment what R tutorial should I do next, and I will try to do within 2 days and follow me to know latest updates.
For bioinformatic data analysis kindly contact : ALLAN VICTOR

Data visualization is a critical component of statistical analysis and presentation. Choosing the right type of plot allows researchers to gain key insights into distribution, variability, and summary statistics. Two common choices for visualizing the distribution of numeric data are violin plots and boxplots. However, creating beautiful, customized publication-quality plots requires using the right tools.

The ggstatsplot package for R provides enhanced graphics for statistical analysis using the popular ggplot2 plotting system. In this article, I’ll demonstrate how to use ggstatsplot to create customized, high-quality violin and boxplots for data analysis and communication.

Specific topics will include:

Data format
Basic syntax and options for ggstatsplot violin and boxplots.
Customizing colors, themes, labels, titles, and other aesthetic elements.
Adding statistical details like sample size, mean/median annotations.
Creating grouped comparisons and faceted plots.
Using ggstatsplot visualization functions for exploratory data analysis.

Whether you’re creating plots for reports, publications, or just exploring your data, ggstatsplot provides the tools to make both violin plots and boxplots shine. Let’s dive into the code and see what this package can do!

Data format

The data should be in the long format and they prefer that the data is in CSV file format.

Get the data ready.
Set the directory to the data containing folder.
Read the data in R.
Validate or check the data for errors.

Long format of the data —

Get the data ready in the above format and save as .csv file. You can download the data here.

Download or visit mega for free cloud storage from here if you like it.

Reading the data into R —

Set the working directory of the R session to the folder that contains the data. Follow the steps in the picture below and choose the directory.

Select the folder after clicking on choose directory

Now you can run the below code and read the data into R.

#Code for reading data - Copy paste run!!
data=read.csv("yourdataname.csv")
#Replace the filename of your data instead of "yourdataname" in the code
#above and run the line

Validate or check the data for compatibility —

Once the data is read inside R, we can see the data using the command,

# Code for checking the data - Copy paste run!!
head(data)

After running the line, you should see the data like this.

Then we can check the type of the data by using the code,

# Code for seeing the structure of the data - Copy paste run!!
str(data)

The output for the code. — This is the structure of the data that appears in the console

The species column in the data frame has the type “character”, denoted by “chr”. Although the boxplot function can handle character type, it is better to convert it to “factor” type. We can do this by using the code below.

# Code for changing the type of 1st column - Copy paste run!!
data[,1]=as.factor(data[,1])

What this code does is that it will change the 1st column (the syntax is data[row,column]), here its data [“empty space means all columns,1 (1 st row)]. After this when you view the structure of the data it will be as shown below,

Now we have our data ready to create a boxplot!!!

Basic syntax and options for ggstatsplot violin and boxplots.

# Don't Copy paste run ...

ggbetweenstats(
  data,
  x,
  y,
  type = "parametric",
  pairwise.display = "significant",
  p.adjust.method = "holm",
  effsize.type = "unbiased",
  bf.prior = 0.707,
  bf.message = TRUE,
  results.subtitle = TRUE,
  xlab = NULL,
  ylab = NULL,
  caption = NULL,
  title = NULL,
  subtitle = NULL,
  k = 2L,
  var.equal = FALSE,
  conf.level = 0.95,
  nboot = 100L,
  tr = 0.2,
  centrality.plotting = TRUE,
  centrality.type = type,
  centrality.point.args = list(size = 5, color = "darkred"),
  centrality.label.args = list(size = 3, nudge_x = 0.4, segment.linetype = 4,
    min.segment.length = 0),
  point.args = list(position = ggplot2::position_jitterdodge(dodge.width = 0.6), alpha =
    0.4, size = 3, stroke = 0, na.rm = TRUE),
  boxplot.args = list(width = 0.3, alpha = 0.2, na.rm = TRUE),
  violin.args = list(width = 0.5, alpha = 0.2, na.rm = TRUE),
  ggsignif.args = list(textsize = 3, tip_length = 0.01, na.rm = TRUE),
  ggtheme = ggstatsplot::theme_ggstatsplot(),
  package = "RColorBrewer",
  palette = "Dark2",
  ggplot.component = NULL,
  ...
)

The ggbetweenstats function has many parameters as you can see, that can be adjusted, but the default settings usually work well. As I create the plots, I will demonstrate some of the key parameters in practice.

Plotting the boxplot

Basic plot

To install the package for the first time, use the code below. If you already have the package, you can skip this step.

#Installing the packages - Copy paste run!!
install.packages("ggstatsplot")

# For plotting - Copy paste run!!
library(ggstatsplot) # Loading the package

ggbetweenstats(
  data  = data, 
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)

Data format and the corresponding plot we are going to generate.

The above function can create a basic plot with the data = parameter as the name of your data. In this example, we read the data into R with the name “data”. The X = parameter is the “Species” column of our data, which has three unique values, so we get three separate boxplots. The Y = parameter is the data values. We use the “Sepal.Length” column, which is plotted on the Y axis for each boxplot. The “title =” parameter allows us to give any title we want.

Statistical details in the plot

Let’s create a plot for the R’s default “iris” dataset and explore the details in the plot to understand further.

# For plotting - Copy paste run!!
ggbetweenstats(
  data  = iris,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)

The Generated plot by running the above code

F Welch(2,92.21) = 138.91: This iris the result of the F-test. The F-value is 138.91. The numbers in the brackets are the degrees of freedom for the test, which are parameters that allow the test to account for certain factors in the data. In this case, the degrees of freedom are 2 and 92.21. This could be related to the number of Iris species (3 species — 1 = 2) and the total sample size minus the number of species.
p = 1.51e-28: This is the p-value of the test. The p-value is a measure of the probability that the observed data would occur if the null hypothesis of the test were true. In this case, the p-value is extremely low (1.51e-28), which means it’s very unlikely that the observed differences in sepal lengths between the Iris species happened by chance. This provides strong evidence that the sepal lengths are significantly different between the species.
ω²_p = 0.74: This is the effect size of the test, which measures the magnitude of the difference or relationship. In this case, the effect size is 0.74, indicating a large effect. This means that the species of Iris (setosa, versicolor, or virginica) accounts for about 74% of the variation in sepal length.
CI_95% [0.67, 1.00]: This is the 95% confidence interval for the effect size. This means we can be 95% confident that the true value of the effect size is between 0.67 and 1.00. In other words, if we were to repeat the experiment many times, we would expect the species to account for between 67% and 100% of the variation in sepal length 95% of the time.
n_obs = 150: This is the number of observations in the data. In this case, there were 150 observations, which could represent 150 individual Iris flowers that were measured.
log(BF 01) = -65.10: This is the logarithm of the Bayes Factor (BF01). The Bayes Factor is a measure used in Bayesian statistics to quantify the evidence in favor of one statistical model over another. A negative value here suggests that the data strongly supports one model over the other. This suggests that there is strong evidence in our data (the sepal lengths of the different Iris species) to support one model over another. For example, it could be evidence that the sepal lengths are different between the species.
R² posterior Bayesian = 0.61: This is the Bayesian posterior R-squared value. It’s a measure of how well our model fits the data. A value of 0.61 means that our model explains 61% of the variability in the data. This tells us that our model (which could be a model predicting sepal length based on species) explains 61% of the variability in sepal lengths. So, if we know the species of an Iris flower, we could potentially predict about 61% of the variation in its sepal length.
CIHDI [95%] [0.54, 0.67]: This is the 95% Highest Density Interval (HDI). It’s a range of values that likely contains the true value 95% of the time. In this case, we can be 95% confident that the true value lies between 0.54 and 0.67. This range tells us that, if we were to repeat our experiment many times, we can be 95% confident that the true effect size (the difference in sepal lengths between species) would fall within this range (0.54 to 0.67).
rJZS Cauchy = 0.71: The statistics is unclear. This seems to represent another statistical measure with a value of 0.71 associated with the Cauchy distribution.

You can remove the statistical tests by setting the parameters to “none”

#For plotting - Copy paste run!!
ggbetweenstats(
  data  = data,
  x     = Species,
  y     = Sepal.Length,
  type = "none", 
  pairwise.display="none",
  p.adjust.method="none",
  title = "Distribution of sepal length across Iris species"
)

Plotting for multiple traits or columns

The following function can combine multiple boxplots for different traits. Further information about the function can be found here.

# Don't Copy paste run!! 
combine_plots(
  plotlist,
  plotgrid.args = list(),
  annotation.args = list(),
  guides = "collect",
  ...
)

The above code may be look exhausting to beginners. But here is the simpler and wholesome version of the code.

Generate your plots for each trait and save in R with corresponding names. Here I am saving the plots as a, b, c and d as shown in the code below.

# The simplest form of code to plot multiple plots - Copy paste run!!

data=iris #Renaming the iris data as "data". (Here you have to read the data instead of this)
# data = read.csv ("yourdataname.csv")
head(data) # Reviewing the data
str(data)
library(ggstatsplot) # Loading the library
data[,5]=as.factor(data[,5]) #The fifth column of the data is the "Species" column. So we are changing the type of the column to factor. 
colnames(data) # To see the column names of the data,  so that we can plot for corresponding traits.

a=ggbetweenstats(
  data  = data,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)

b=ggbetweenstats(
  data  = data,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)
c=ggbetweenstats(
  data  = data,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)
d=ggbetweenstats(
  data  = data,
  x     = Species,
  y     = Sepal.Length,
  title = "Distribution of sepal length across Iris species"
)

2. Now your plots will be saved as shown below,

3. You can visualize each single plot by running plot (Name of the plot),

plot(a)

4. Now we will combine the plot as below,

svg("plots.svg", width = 10, height = 10)
combine_plots(
  list(a, b, c, d), #Name of the plots
)
dev.off()

You can change the width and height of the plot by changing width = and height = parameters if needed. Here we are saving the plot as .svg. The plot will be saved in your directory.

Changing parameters

Plotting for multiple traits include creating individual boxplots for each trait and combining all the boxplots into a single diagram. I also tried to explain different parameters by changing them in each plot. You can get an idea by looking through the comments within the code below,

# Just to learn the different parameters - Can be Copied and pasted and run 

data=iris #Renaming the iris data as "data".
head(data) # Reviewing the data
str(data)
library(ggstatsplot) # Loading the library
data[,5]=as.factor(data[,5]) #The fifth column of the data is the "Species" column. So we are changing the type of the column to factor. 

colnames(data) # To see the column names of the data,  so that we can plot for corresponding traits.

#Plotting the first plot for Sepal length
p1 <- ggbetweenstats(
  data = data,
  x = Species,
  y = Sepal.Length,
  xlab = "Species",
  ylab = "Sepal Length",
  violin.args = list(width = 0),# to remove violin plot we are setting the width = 0 to make it invisible. 
  type = "p", # p = "parametric tests
  conf.level = 0.99,
  title = "Parametric test", # The title of the plot we are generating
  package = "ggsci", 
  palette = "nrc_npg"
)

#Plotting the second plot for Sepal width
p2 <- ggbetweenstats(
  data = data,
  x = Species,
  y = Sepal.Width,
  xlab = "Species", # The x axis label
  ylab = "Sepal width", # The Y axis label
  boxplot.args = list(width = 0),# to remove the boxplot and keep the violin plot we are setting the width of boxplot to 0 to make it invisible. 
  type = "np", # Non-parametric tests
  conf.level = 0.99,
  title = "Non-parametric Test",
  package = "ggsci",#Scientific Journal and Sci-Fi Themed Color Palettes for 'ggplot2'
  palette = "uniform_startrek" # A colour palette from the ggplot - ggsci
)

#Plotting the third plot for Petal length
p3 <- ggbetweenstats(
  data = data,
  x = Species,
  y = Petal.Length,
  xlab = "Species",
  ylab = "Petal length",
  type = "r",# Robust test
  conf.level = 0.99,
  title = "Robust Test",
  tr = 0.005,
  package = "wesanderson", # A package for colour palette
  palette = "Royal2", # Colour palette
  k = 3 #This is the number of colors to be used from the palette. In your case, it’s 3.
)


#Plotting the fourth plot for Petal width
## Bayes Factor for parametric t-test and boxviolin plot
p4 <- ggbetweenstats(
  data = data,
  x = Species,
  y = Petal.Width,
  xlab = "Species",
  ylab = "Petal width",
  type = "bayes",
  violin.args = list(width = 0), # removing both the voilin and boxplot
  boxplot.args = list(width = 0),
  point.args = list(alpha = 0),
  title = "Bayesian Test",
  package = "ggsci",
  palette = "nrc_npg"
)


## combining the individual plots into a single plot
combine_plots(
  list(p1, p2, p3, p4), #Name of the plots
  plotgrid.args = list(nrow = 2), #number of rows
  annotation.args = list(
    title = "Comparison of life expectancy between 1957 and 2007",
    caption = "Note: This is the combined boxplot"
  ) # Title and the caption of the combined plot
)

# Some times while running the above code error may occur due to larger size of the plots. 
# In that case save the plot as .pdf or .svg or some other desired format as shown below.
# The plot will be saved in the directory you set at first.


pdf ("plots.pdf", width = 10, height = 10)
## combining the individual plots into a single plot
combine_plots(
  list(p1, p2, p3, p4), #Name of the plots
  plotgrid.args = list(nrow = 2), #number of rows
  annotation.args = list(
    title = "Comparison of life expectancy between 1957 and 2007",
    caption = "Note: This is the combined boxplot"
  ) # Title and the caption of the combined plot
)
dev.off()

Take away

In this article, we learned you how to use ggstatsplot to create elegant violin and box plots in R with minimal code. We learned how to format the data, choose the type of test, customize the appearance, and interpret the results. We also saw some examples of plots for different data and tests. I hope you found this article useful and interesting. If you did, please give it a clap and a follow. Happy plotting! 😁