Displaying continuous data using R

Data analysis using R in Six Sigma style — Part 1

Rafal Burzynski
6 min readMay 8, 2023

This post is the first part of series on data analysis in R following the approach known in Six Sigma methodology.

In many cases when we gather data for numerous reasons to analyze it later on. To do it efficiently, we need a structured approach, which is provided by Six Sigma methodology.

In this article, as a first step, we will learn how to load the dataset, display the data by use of boxplot and histogram. Finally, we will touch on the normality test to see if data has a normal distribution.

For the series of articles titled “Data Analysis using R in Six Sigma style” I will be using Visual Studio Code to run the scripts. For all those who want to do it in the same way, I have prepared and article how to set up Visual Studio Code for running R scripts.

All script files are available in this Github repository:

Required packages and loading data

For the purposes listed in this document we will basically use built in R packages. But in order to display the charts in VS Code properly, need one package called httpgd. It can be installed with the following command:

# loading required modules
# if you don't have them installed uncomment the next line
# install.packages("httpgd")
library("httpgd")

Now a little more about data. In many cases it will be gathered and most probably stored in Excel. From there it is possible to generate a comma separated file (csv), which can be easily loaded into R

# uploading the dataset
df <- read.csv("data/sample_data.csv", sep = ";", dec = ",")

CSV files can use different separators and decimal point signs. These can be controlled via parameters called sep and dec respectively.

In case of loading of data from csv files, it is important to check working directory. This can be done with getwd() function. If it is necessary to set the correct working directory, this can be done with setwd(‘path to wd’) function.

In our example the sample data file contains measurement of lengths of three samples. After loading the dataset it is always good to look at the first few rows, just to see if the data was loaded properly. This can be done using head function.

# display the head of data
# data includes measurement of the length of three samples
head(df)
> head(df)
First Second Third
1 66.4 73.9 73.18
2 69.3 78.6 76.66
3 66.5 75.8 71.47
4 67.2 78.4 76.14
5 67.3 74.8 72.00
6 66.3 78.0 75.39

It means that our measurement data looks good and we can inspect its structure and summary:

> str(df)
'data.frame': 50 obs. of 3 variables:
$ First : num 66.4 69.3 66.5 67.2 67.3 66.3 68.4 64.8 70.9 69.6 ...
$ Second: num 73.9 78.6 75.8 78.4 74.8 ...
$ Third : num 73.2 76.7 71.5 76.1 72 ...

From the structure of data we can see that there are three variables of 50 observations. Also all the variables are of numerical type.

Summary of data is a handy way of looking at some more factors describing the variables.

> summary(df)
First Second Third
Min. :64.20 Min. :73.19 Min. :70.55
1st Qu.:68.47 1st Qu.:75.17 1st Qu.:72.37
Median :69.70 Median :76.81 Median :74.38
Mean :69.49 Mean :76.80 Mean :74.22
3rd Qu.:70.88 3rd Qu.:78.08 3rd Qu.:75.76
Max. :73.30 Max. :80.49 Max. :78.45

We get information about min and max values, mean and median as well as first and third quartile.

Boxplot

Boxplot is a good visual representation of the data and it can be created with a built in function. Below one can see the boxplot of the first sample.

# Box plot
boxplot(df$First)

It is also possible to plot boxplots for all three samples on one plot.

# adding more factors
boxplot(df$First, df$Second, df$Third)

This boxplot would look better if it had some additional information on it.

# adding labels to axes
boxplot(df$First, df$Second, df$Third, names=c("First sample",
"Second sample", "Third sample"),
xlab = "Sample order", ylab = "Length [mm]",
main = "Boxplots of lenghts of samples")

Histogram

Histogram can be easily created with the use of hist function.

# histogram of data
hist(df$First)

Just by adding some additional parameters we can add descriptions to the axes.

# controlling bins, color and labels
hist(df$First, breaks = 10, xlab = "Length [mm]",
main = "Histogram of length measurements of the first sample",
col = "lightblue")

One particular parameter requires further clarifiation. It is called breaks and with it, it is possible to control the width of the bins. It is also always a good practice how a histogram looks with different number of bins.

Histogram with 10 bins
Histogram with 20 bins

Normality test

After checking how data looks like by using boxplot or/and histogram, the next valuable piece of information can be given by normality test. This is a statistical test that checks if the data follows the normal distribution. Knowing that collected data can be described with normal distribution is helpful in some other statistical tests that will be described in different articles in this series.

In case of R, we will use built in function that checks if the data follows the normal distribution. This function is for Shapiro-Wilk test and our particular interest lies in a p-value calculated by this statisical test. If a test returns p-value, it means it includes hypothesis testing (more information on hypothesis testing will be provided in another article). However, p-value alone does not tell us much, we need to compare it to a certain threshold. In all of the statistical tests I have used so far, this threshold value is set at 0,05 (but it could also be lower or higher). So let’s look at the results:

> shapiro.test(df$First)

Shapiro-Wilk normality test

data: df$First
W = 0.96865, p-value = 0.2039

In this case p-value = 0,2039 > 0,05. When p-value is greater than 0,05 in case of Shapiro-Wilk test, it means that collected data can be described with the normal distribution. Thus it makes sense to calculate mean and standard deviations, which are the parameters of a normal distrubtion,

> # mean value
> m <- mean(df$First)
> print(m)
[1] 69.492
> # standard deviation
> s <- sd(df$First)
> print(s)
[1] 2.065444

With those values and a bit more complicated lines of code, a normal fit can be plotted.

# plotting normal fit
norm_fit <- dnorm(df$First, mean = m, sd = s)
plot(df$First, norm_fit, xlab = "Length [mm]",
ylab="Density")
lines(df$First[order(df$First)], norm_fit[order(df$First)], col = "orange")
grid()

Conclusion

The intention of this article has been to show the analysis of continuous data using basic R functions. ggplot package has been omitted intentionally. The workflow starts with loading of the dataset, analyzing types of data, plotting it using histogram and boxplot and finally checking for normality.

Here is the full file as Github Gist

This concludes the first part of data analysis workflow based on methodology used in Six Sigma. The next part will focus on capability analysis of the data.

--

--

Rafal Burzynski

I like to learn stats, data science and coding in Python and R. Then I publish what I have learnt. See my other work at https://rafburzy.github.io