Statistical analysis in R: for a starter palate

Shweta Dixit
8 min readJul 22, 2023

--

Statistical analysis is a crucial aspect of data science and data analysis. R, being a powerful statistical programming language, provides a wide range of functions and packages to perform statistical computations which I have already covered in previous stories.

Here, in this write up on Statistical Analysis in R is providing a friendly perspective of statistical analysis to anyone new into the field. This guide is aimed at beginners with some familiarity with R and statistics. Here at first, I will review some basic function of statistics: central tendency and dispersion and moving further into regression analysis using inbuilt datasets for progressing one’s understanding. And most importantly, the content will be approachable and suitable for those new to both statistics and R programming.

Let’s get started — ->

Univariate analysis is a statistical method that focuses on examining and understanding individual variables in a dataset. In this type of analysis, only one variable is considered at a time, and various descriptive statistics and visualization techniques are applied to summarize and explore its characteristics. The main goal of univariate analysis is to gain insights into the distribution, central tendency, dispersion, and other key attributes of a single variable. Common methods used in univariate analysis include:

Summary statistics: Measures such as mean, median, mode, variance, standard deviation, and percentiles are used to describe the distribution of the variable.

Histogram: A graphical representation showing the frequency distribution of values within the variable’s range.

Box plot: A visual summary of the distribution that displays the median, quartiles, and potential outliers.

Probability density function (PDF): A graphical representation of the probability distribution of continuous variables.

Cumulative distribution function (CDF): A function that shows the probability that a random variable takes a value less than or equal to a given value.

Kernel Density Estimation (KDE): A non-parametric way to estimate the probability density function of a continuous variable.

Bivariate analysis is a statistical method that involves analyzing the relationship between two variables simultaneously. In bivariate analysis, the focus is on how one variable (the independent variable) affects or influences another variable (the dependent variable). The objective is to understand the association, correlation, or dependence between the two variables. Common methods used in bivariate analysis include:

Scatter plot: A graphical representation that displays the relationship between two continuous variables, with each data point represented by a point on the plot.

Correlation: A numerical measure that quantifies the strength and direction of the linear relationship between two continuous variables.

Cross-tabulation (Contingency table): A summary table that displays the joint distribution of two categorical variables.

Chi-square test: A statistical test used to assess the association between two categorical variables.

T-tests or ANOVA: Statistical tests used to compare means between two or more groups in the context of a dependent variable and an independent categorical variable.

Univariate and bivariate analysis are fundamental techniques in statistical analysis, allowing researchers to gain a deeper understanding of the dataset and uncover relationships between variables that can inform further investigations or modeling.

# Load the 'mtcars' dataset
data("mtcars")

# Univariate Analysis
# Summary statistics of the 'mpg' variable
summary(mtcars$mpg)

# Histogram of 'mpg'
hist(mtcars$mpg, main = "Histogram of mpg", xlab = "Miles per Gallon (mpg)")

# Box plot of 'mpg'
boxplot(mtcars$mpg, main = "Box Plot of mpg", ylab = "Miles per Gallon (mpg)")

# Bivariate Analysis
# Scatter plot between 'mpg' and 'hp'
plot(mtcars$hp, mtcars$mpg, main = "Scatter Plot: mpg vs. hp", xlab = "Horsepower (hp)", ylab = "Miles per Gallon (mpg)")

# Correlation between 'mpg' and 'hp'
correlation <- cor(mtcars$mpg, mtcars$hp)
cat("\nCorrelation between 'mpg' and 'hp':", correlation, "\n")

Further we can move into depth of statistical analysis in below sections:

Section 1: Central Tendency and Dispersion

1.1 Central Tendency: Central tendency measures help us understand the typical or central value of a dataset. In R, we can calculate the mean, median, and mode of a dataset using functions like mean(), median(), and table().

1.2 Dispersion: Dispersion measures provide insights into the spread or variability of data points. Functions like range(), var(), sd(), and IQR() can be used to calculate the range, variance, standard deviation, and interquartile range, respectively.

Below is the R code example demonstrating how to calculate these measures for an inbuilt irisdataset.

# Load the 'iris' dataset
data("iris")

# Central Tendency Measures
mean_value <- apply(iris[, 1:4], 2, mean) # Calculate the mean for each column (attribute)
median_value <- apply(iris[, 1:4], 2, median) # Calculate the median for each column (attribute)
mode_value <- apply(iris[, 1:4], 2, function(x) { table(x)[which.max(table(x))] }) # Calculate the mode for each column (attribute)

# Dispersion Measures
range_value <- apply(iris[, 1:4], 2, range) # Calculate the range for each column (attribute)
variance_value <- apply(iris[, 1:4], 2, var) # Calculate the variance for each column (attribute)
sd_value <- apply(iris[, 1:4], 2, sd) # Calculate the standard deviation for each column (attribute)
iqr_value <- apply(iris[, 1:4], 2, IQR) # Calculate the interquartile range for each column (attribute)

# Print the results
cat("Central Tendency Measures:\n")
cat("Mean:\n")
print(mean_value)
cat("\nMedian:\n")
print(median_value)
cat("\nMode:\n")
print(mode_value)

cat("\n\nDispersion Measures:\n")
cat("Range:\n")
print(range_value)
cat("\nVariance:\n")
print(variance_value)
cat("\nStandard Deviation:\n")
print(sd_value)
cat("\nInterquartile Range:\n")
print(iqr_value)

Section 2: Introduction to Regression Analysis

2.1 Understanding Regression Analysis: Regression analysis is a fundamental statistical method used to examine the relationship between one or more independent variables (also known as predictor or explanatory variables) and a dependent variable (the outcome or response variable). Its primary objective is to model and quantify the association between the variables, enabling us to make predictions and draw insights from the data. Regression analysis is a powerful statistical method used to establish relationships between variables.

Regression analysis finds applications in various fields, including economics, finance, social sciences, healthcare, marketing, and engineering. It aids in making data-driven decisions, forecasting trends, and identifying significant factors affecting an outcome.

2.2 Exploring the Dataset: We’ll use R’s built-in dataset mtcars for our regression analysis. This dataset contains information about various car models and their performance characteristics.

# Load the 'mtcars' dataset
data("mtcars")

# Display the structure of the 'mtcars' dataset
str(mtcars)

# Summary statistics of the 'mtcars' dataset
summary(mtcars)

# Correlation matrix of the variables in the 'mtcars' dataset
cor_matrix <- cor(mtcars)
print(cor_matrix)

# Scatter plot between 'mpg' and 'hp'
plot(mtcars$hp, mtcars$mpg, main = "Scatter Plot: mpg vs. hp", xlab = "Horsepower (hp)", ylab = "Miles per Gallon (mpg)")

2.3 Data Preparation: Before performing regression analysis, we need to preprocess the data, handling missing values and ensuring the relevant variables are of the correct data types.

# Load the 'mtcars' dataset
data("mtcars")

# Data Preparation for Simple Linear Regression
# Step 1: Check for Missing Values
missing_values <- sum(is.na(mtcars))
cat("Missing Values in 'mtcars' dataset:", missing_values, "\n")

# Step 2: Ensure 'mpg' and 'hp' are numeric
mtcars$mpg <- as.numeric(mtcars$mpg)
mtcars$hp <- as.numeric(mtcars$hp)

# Step 3: Split the data into Dependent and Independent Variables
dependent_var <- mtcars$mpg
independent_var <- mtcars$hp

# Step 4: Check the data type of the dependent and independent variables
cat("Data Type of 'dependent_var':", class(dependent_var), "\n")
cat("Data Type of 'independent_var':", class(independent_var), "\n")

2.4 Simple Linear Regression: The simplest form of regression analysis is linear regression, where a straight line is fitted to the data to represent the relationship between the variables. The equation of the line helps estimate the impact of the independent variables on the dependent variable.

In this section, we’ll conduct simple linear regression using the lm() function. We'll analyze how the miles per gallon (mpg) of a car is affected by its horsepower (hp). Simple linear regression using the “mpg” (miles per gallon) variable as the dependent variable and the “hp” (horsepower) variable as the independent variable from the “mtcars” dataset.

# Load the 'mtcars' dataset
data("mtcars")

# Simple Linear Regression
model <- lm(mpg ~ hp, data = mtcars)

# Summary of the Regression Model
summary(model)

Section 3: Intermediate Regression Analysis

3.1 Multiple Linear Regression: When multiple independent variables are involved, we use multiple regression to assess their collective influence on the dependent variable. This enables us to consider the joint effect of several factors. Multiple linear regression allows us to model the relationship between a dependent variable and multiple independent variables.

Here, we’ll perform multiple linear regression using the mtcarsdataset. Multiple linear regression involves predicting a dependent variable using two or more independent variables. In this example, we’ll use “mpg” (miles per gallon) as the dependent variable and “hp” (horsepower) and “wt” (weight) as the independent variables from the mtcarsdataset.

# Load the 'mtcars' dataset
data("mtcars")

# Multiple Linear Regression
model <- lm(mpg ~ hp + wt, data = mtcars)

# Summary of the Regression Model
summary(model)

3.2 Model Evaluation: Model evaluation is essential in regression analysis to determine the goodness of fit and the accuracy of predictions. Common evaluation metrics(more can be found in this earlier story) include R-squared, adjusted R-squared, and standard error of the estimate. We’ll learn about techniques like R-squared, adjusted R-squared, and residual analysis to assess the model’s goodness of fit.

# Model Evaluation
# Residual Analysis
residuals <- resid(model)
plot(residuals, main = "Residuals vs. Fitted Values", xlab = "Fitted Values", ylab = "Residuals")

# R-squared and Adjusted R-squared
r_squared <- summary(model)$r.squared
adjusted_r_squared <- summary(model)$adj.r.squared
cat("\nR-squared:", r_squared, "\n")
cat("Adjusted R-squared:", adjusted_r_squared, "\n")

3.3 Polynomial Regression: Sometimes, the relationship between variables is not linear. Polynomial regression helps capture non-linear relationships. We’ll demonstrate how to fit polynomial regression models in R.

# Load the 'mtcars' dataset
data("mtcars")

# Polynomial Regression
# Fit a polynomial regression model with quadratic term (degree = 2)
model <- lm(mpg ~ hp + I(hp^2), data = mtcars)

# Summary of the Polynomial Regression Model
summary(model)

In this example, we fit a polynomial regression model with a quadratic term (degree = 2) for the “hp” (horsepower) variable. The formula used for the regression is mpg ~ hp + I(hp^2), which means we are regressing "mpg" (dependent variable) on "hp" and the quadratic term of "hp" (i.e., "hp^2"). The summary will provide insights into how well the polynomial model fits the data and the significance of the coefficients for both the linear and quadratic terms.

When dealing with Regressions, we can’t ignore the Regression Assumptions as the Regression analysis relies on certain assumptions, such as linearity, independence of errors, constant variance of errors (homoscedasticity), and normal distribution of errors. Violations of these assumptions can affect the accuracy of the results.

Interpreting the regression coefficients is crucial to understanding the direction and magnitude of the relationships between variables. Positive coefficients indicate a positive relationship, negative coefficients imply a negative relationship, while coefficients close to zero suggest weak or no relationship. But it's not enough for best interpretation. We’ll come to it later.

Statistical analysis is a crucial skill for anyone involved in data analysis and data science. In this guide, we explored the fundamental concepts of central tendency and dispersion before diving into regression analysis using R’s built-in datasets. By following these steps, beginners and intermediate users can gain a solid foundation in statistical analysis using R, enabling them to derive meaningful insights from data.

Remember, learning R along with statistics is an empowering skill that can open doors to diverse career opportunities and enable you to make data-driven decisions. Embrace the journey, enjoy the learning process, and soon you’ll be amazed by the incredible things you can achieve with R.

Happy coding!

Happy learning!

--

--

Shweta Dixit

||LearneR||Academician|| Researcher|| Biostatistician||