Beginner to Advanced Exercises — R for Statistical Computing.
RSeries#2
Level 01
Data set :
https://github.com/TheCodingTrio/RStudio/raw/main/DATA%203.xlsx
The data set is about accommodations of set of students who are studying at University College.
X1 = Age of Student
X2 = Gender
1 = Male
2 = Female
X3 = Accommodation
1 = Stays at Home
2 = Boarded Students
3 = Lodging
1. Identify the variables and import the given data set into R.
2. Analyze the data in a single variable at a time (univariate analysis).
3. Describe gender and accommodation together (bivariate analysis — Analyzing 2 variables at a time).
4. Describe age with gender/accommodation.
5. Find the mean age for all the combinations of gender and accommodation.
Hint: Use following format.
#1. Identify the variables and import the given data set into R.
setwd("C:\\Users\\kodit\\Desktop\\Y2S2\\Probability and Statistics - IT2110\\Week 03\\Lab 03-20221202")
getwd()
#Import data set
data<-read.csv("DATA 3.csv", header = TRUE)
data
#get data in editor mode
fix(data)
#rename the columns
#For renaming we use a function called names() , and we use a vector to pass the new column names
names(data)<-c("Age","Gender","Accomodation")
fix(data)
#rename categorical data
#we use the function called factor() for categorical data renaming
data$Gender<-factor(data$Gender,c(1,2),c("Male","Female"))
data$Accomodation<-factor(data$Accomodation,c(1,2,3),c("Home","Boarded","Lodging"))
fix(data)
#We have done these modification in the editor mode
#We need to attach these modification to the imported file
#For that we use attach()
attach(data)
#IMPORTANT
#Make sure the run the attach(data) TWICE. otherwise if u run only once the modifications will be attached partially
#So for complete attachment , u need to run the cmd TWICE
################################################################################
#2. Analyze the data in a single variable at a time (univariate analysis).
#The variables we have here is Gender and Accommodation.
#For Univariate analysis , we use the categorical data type.
#And then analysis them by using pie-charts, bar-charts and box plots
#For that ,First we need to create some frequency tables.
#SYNTAX : TABLE NAME <- table(Column name)
gender.freq<-table(Gender)
acc.freq<-table(Accomodation)
gender.freq
acc.freq
#Creating Pie charts
#SYNTAX - pie(table name,name for pie chart)
pie(gender.freq,"Pie chart for Gender")
pie(acc.freq,"Pie chart for Accomodation")
#U can see the pie chart under the plots section
#Creating barplots
#SYNTAX- barplots(table name, main="name for bar plot", ylab="name for y axis")
# abline(h=0) ; to start the x axis at 0
barplot(gender.freq, main = "Bar plot for Gender", ylab = "Frequency")
abline(h=0)
barplot(acc.freq, main = "Bar plot for Accomodation", ylab = "Frequency")
abline(h=0)
#StackOverFlow :
#How to delete a line drawn by function abline without changing the color to match background? Btw, I am using chart_Series for charting. It would be great to have xy lines (1 vertical and 1 horizontal) following the mouse movement. Pls help
#You can't, R uses a "pen on paper" model for it's base graphics. Once you taint the "paper" (device) with any plotting object it is there for good. You need to redo the entire plot without the line in order to "remove" it. Painting over the line with the same data in the background colour just fakes deletion; as far as R is concerned you now have two lines, one on top of another.
#Creating box plots
#SYNTAX - boxplot(Column name,main="name/title for box plot", ylab="name for y axis",outpch=give a random number(this gives an outlier patter))
boxplot(Age,main="Boxplot for Age",ylab="Age",outpch=8)
###########################################################################
#3. Describe gender and accommodation together (bivariate analysis).
#We can use the 2-way frequency table
#Using 2 way frequency table we can generate Stack bar charts and Clustered bar charts.
#Lets first create 2-way frequency table
#SYNTAX - table name<-table(relevant columns)
gender_acc.freq<-table(Gender,Accomodation)
gender_acc.freq
#Creating Stacked Bar Chart
#SYNTAX- barplots(table name, main="name for bar plot", legend=rownames(table name))
# abline(h=0) ; to start the x axis at 0
barplot(gender_acc.freq,main = "Gender & Accomodation", legend=rownames(gender_acc.freq))
abline(h=0)
#legend= rownames() ; becuz we access the data in row name wise
#Creating Clustered Bar Chart
#SYNTAX- barplots(table name,beside=TRUE,main="name for bar plot", legend=rownames(table name))
# abline(h=0) ; to start the x axis at 0
barplot(gender_acc.freq,beside=TRUE,main = "Gender & Accomodation", legend=rownames(gender_acc.freq))
abline(h=0)
#beside = TRUE is used to get the 2 bars beside each other
###########################################################################
#4. Describe age with gender/accommodation.
#For this we create a Side by Side Boxplot
#SYNTAX - boxplot(column1 name~column2 name,main="name for boxplot",xlab="x axis name",ylab="y axis name")
boxplot(Age~Gender,main="Boxplot for Age by Gender", xlab = "Gender", ylab = "Age")
#According to the two Boxplots we can see that Both Male and Female have a Positive Distribution
boxplot(Age~Accomodation,main="Boxplot for Age by Accomodation", xlab = "Accomodation", ylab = "Age")
#Home has a Symetric distribution
#Boarded and Lodging have a Positive Distribution
#If we want to get an outlier pattern , give outpch = random value (u can change the number and see the different outlier patterns)
boxplot(Age~Accomodation,main="Boxplot for Age by Accomodation", xlab = "Accomodation", ylab = "Age", outpch=6)
###########################################################################
#5. Find the mean age for all the combinations of gender and accommodation
#Here we have to handle 2 categorical data(Gender,Accomodation) with 1 numberical data(Age)
#In such a case we use a function called xtabs()
#SYNTAX - xtabs(column1~column2 + column3) / table name
xtabs(Age~Gender+Accomodation) / gender_acc.freq
Level 02
Data Set :
https://github.com/TheCodingTrio/RStudio/blob/main/DATA%204.txt
Major League Baseball is known as “America’s pastime.” The role of Major League Baseball has been ingrained into American culture. The heroic figures and memorable moments of Major League Baseball reflect the type of attitude that American culture is built on. Given below are some measurements observed in this significant sport during the 1998 league.
X1 = Team Attendance
(Average number of spectators for a match that the team play)
X2 = Team Salary
(Earning of the team)
X3 = Years
(Years since the team has owned a stadium)
1.Identify the variables and enter the given data set into R.
2.Obtain the following for each variable
a. Box-Plot, Histogram and Stem-Leaf Plot.
b. Mean, Median and Standard Deviation.
c. First and Third Quartile.
d. Interquartile Range.
3.Write a function to find the modes of a given set of values. Check the function by finding the mode of the variable “Years”.
4.Write a function that would produce the outliers when the values are given. Check the function with the 3 variables in the dataset.
#Set the working directory
getwd()
setwd("C:\\Users\\kodit\\Desktop\\Y2S2\\Probability and Statistics - IT2110\\Week 04\\Lab 04-20221202")
#check the newly set wd
getwd()
#Importing data set
data<- read.table("DATA 4.txt", header = TRUE, sep = "")
data
#to go into data editor view
fix(data)
#rename the column names
names(data)<- c("Team","Attendance","Salary","Years")
data
fix(data)
#attach() is useful to access our columns without mentioning the entire data set
#accessing variables or columns directly without calling the whole data set
attach(data)
#if u didn't use the attach(data) , then u can't directly access
#Then u have to use the '$' to access
#Ex : boxplot(data$Attendance)
#If u want to detach --> detach(data)
#Q2
#a. Box-Plot, Histogram and Stem-Leaf Plot.
#Box-Plot
boxplot(Attendance, main="Boxplot for Attendance", outline= TRUE, xlab="Attendance", horizontal = TRUE)
#the horizontal = TRUE is used to place the diagram in horizontal
boxplot(Salary, main="Boxplot for Salary", outline= TRUE, xlab="Salary", horizontal = TRUE)
boxplot(Years, main="Boxplot for Years", outline= TRUE, xlab="Years", horizontal = TRUE)
#Histogram
hist(Attendance, main="Histogram for Attendance", ylab="Frequency")
abline(h=0)
hist(Salary, main="Histogram for Salary", ylab="Frequency")
abline(h=0)
hist(Years, main="Histogram for Years", ylab="Frequency")
abline(h=0)
#Stem-Leaf Plot
stem(Attendance)
stem(Salary)
stem(Years)
#b. Mean, Median and Standard Deviation.
#Mean
mean(Attendance)
mean(Salary)
mean(Years)
#Median
median(Attendance)
median(Salary)
median(Years)
#Standard Deviation
sd(Attendance)
sd(Salary)
sd(Years)
#c. First and Third Quartile
#summary() function will given u all the values
summary(Attendance)
summary(Salary)
summary(Years)
#quantile also gives u the summary of all values percentage wise
#the word is 'quantile' NOT quartile
#quantile values
quantile(Attendance)
quantile(Salary)
quantile(Years)
#if we want to get a specific quartile , then pass the index as well.
#indexing starts from 1
#Lets say we need the 1st quartile of Attendance
quantile(Attendance)[2]
#3rd quartile
quantile(Attendance)[4]
#d. Interquartile Range.
IQR(Attendance)
IQR(Salary)
IQR(Years)
#3. Write a function to find the modes of a given set of values.
#Check the function by finding the mode of the variable “Years”.
get.mode<-function(y){
counts<-table(y)
names(counts)[counts==max(counts)]
}
#OR
get.mode<-function(y){
counts<-table(y)
names(counts[counts==max(counts)])
}
#get.mode is a variable / function name
#counts is a variable
#names(count) is an object
#y is the argument we pass ; here it is the years column
#max() is used to get the max occurence
get.mode(Years)
#4. Write a function that would produce the outliers when the values are given.
#Check the function with the 3 variables in the dataset.
get.outliers<-function(y){
q1<-quantile(y)[2]
q3<-quantile(y)[4]
iqr<-q3-q1
ub<-q3 + 1.5*iqr
lb<-q1 - 1.5*iqr
print(paste("Upper Bound", ub))
print(paste("Lower Bound", lb))
print(paste("Outliers",paste(sort(y[y<lb|y>ub]),collapse=",")))
}
#LOGIC
#1)Get q1
#2)Get q2
#3)Then get the IQR
#4)Then get the Upper Bound
#5)Then get the Lower Bound
#Using sort(y[y<lb|y>ub] we check whether y is lesser than Lower Bound
#and greater than Upper Bound
#collapse is used to separate values. we can use any symbol("," OR "#" OR any other symbol)
#paste() is used to concatenate strings and numbericals.
#there is another function called paste0()
#paste0() is used to concatenate vectors
#paste() takes multiple elements and concatenate them to a single element.
#and we use collapse for separation
#paste0() by default will separate with a SPACE
get.outliers(Years)
Level 03
Data Set :
https://github.com/TheCodingTrio/RStudio/blob/main/Data.txt
The number of shareholders is to be organized into a frequency distribution.
1) Draw a histogram for the above data.
2) Draw a histogram using seven classes where the lower limit is 130 and an upper limit of 270.
3) Construct the frequency distribution for the above specification.
4) Portray the distribution in the form of a frequency polygon.
5) Portray the distribution in a less-than cumulative frequency polygon.
6) Based on the polygon, three out of four (75%) of the companies have how many shareholders or less?
#set the correct wd
getwd()
setwd("C:\\Users\\kodit\\Desktop\\Y2S2\\Probability and Statistics - IT2110\\Week 05\\Lab 05-20221202")
getwd()
#read the text file
data1<-read.table("Data.txt",header = TRUE,sep = ",")
data1
fix(data1)
#lets rename the table headers as x1 and x2 for easiness
names(data1)<-c("x1","x2")
attach(data1)
#1) Draw a histogram for the above data.
hist(x2, main="Histogram for Number of Shareholders", ylab="Frequency")
abline(h=0)
#150,200,250 u see on the Histogram are limits
#we call one vertical bar as a Class
#Here we can see we have 8 classes
#2) Draw a histogram using seven classes where the lower limit is 130 and an upper limit of 270.
histogram<-hist(x2, main="Histogram for Number of Shareholders", ylab="Frequency",breaks = seq(130,270,length=8), right = FALSE)
#we use a function call seq() to set the lower limit, upper limit and the no of classes
#when inserting the value for no of classes, we use the length cmd
#Therefore if the no of classes we need is n , we have to add the length as n+1
#Here we need 7 classes, therefore length is set as 8
#U can also see that we have assigned this to the breaks cmd
#What it does is it checks the lower and upper limits and the no of break points(n+1 breaks)
#In the histogram u might look for the 130 and 270
#its actually there on the 2 ends
#The histogram shows the middle points, to the left of 140 is 130 and to the right of 260 is 270
#3) Construct the frequency distribution for the above specification.
#since we need the histogram for this, lets assign the histogram to a variable.(Done above)
#to construct the freq dis, we have to identify some key points
#i)Identify the break points ; this is how we do it
#histogram$breaks
#lets round this value and assign it to a variable
breaks<-round(histogram$breaks)
breaks
#ii)identify frequencies of each class
freq<-histogram$counts
freq
#iii)Identify the mid points of each class
mids<-histogram$mids
mids
#With these 3 data we can plot the freq distribution table
#first of all lets create a vector. It should be an empty vector
classes<-c()
#we create a for loop to repeat till our last breakpoint
#by using a for loop we are going to access the values of the
#breakpoints and store them in the classes vector
for(i in 1:length(breaks)-1){
classes[i]<-paste0("[",breaks[i],"-",breaks[i+1],"]")
}
#Using the above for loop we have created our freq dis table
#to get the output(freq table) , we use the cbind cmd
#cbind is used to join 2 set of columns together into a single dataframe
cbind(Classes=classes,Frequency=freq)
#Classes and Frequency are the column names
#4) Portray the distribution in the form of a frequency polygon.
#we use the lines cmd to plot the freq polygon in the same histogram
lines(mids,freq)
#If we want to draw the freq polygon in a new plot
plot(mids,freq,type = "l", main = "Frequency polygon for number of shareholders",xlab = "Shareholders",
ylab = "Frequency",ylim = c(0,max(freq)))
#OR
plot(mids,freq,type = "o", main = "Frequency polygon for number of shareholders",xlab = "Shareholders",
ylab = "Frequency",ylim = c(0,max(freq)))
#OR
plot(mids,freq,type = "p", main = "Frequency polygon for number of shareholders",xlab = "Shareholders",
ylab = "Frequency",ylim = c(0,max(freq)))
# type cmd has 3 types.
# l, o and p
#l - only a line
#o - line with points
#p - only points , no lines
#ylim is used to limit y
#5) Portray the distribution in a less-than cumulative frequency polygon.
#cumulative freq is the running total of the freq
#to find the cumulative freq we use the below cmd
#cumsum()
cum.freq<-cumsum(freq)
cum.freq
#to draw the cumulative frequency polygon , we have to first store them in a vector
new<-c()
for (i in 1:length(breaks)){
if(i==1){
new[i] = 0
}else{
new[i]<-cum.freq[i-1]
}
}
plot(breaks,new,type = "o", main = "Frequency polygon for number of shareholders",xlab = "Shareholders",
ylab = "Cumulative Frequency",ylim = c(0,max(cum.freq)))
#C(0,max(cum.freq)) tell to keep the y axis between 0 and the max freq
cbind(Upper=breaks,CumulativeFrequecy = new)
#6) Based on the polygon, three out of four (75%) of the companies have many shareholders or less?
#many (i think , this is my random answer)
Level 04
Data Set :
https://github.com/TheCodingTrio/RStudio/blob/main/Forest.txt
The data set is about rainforest data. This is a data mining approach to predict forest fires using meteorological data. The variable description is given below.
- X — x-axis spatial coordinate within the Montesinho park map: 1 to 9
- Y — y-axis spatial coordinate within the Montesinho park map: 2 to 9
- Month — month of the year: “jan” to “dec”
- Day — day of the week: “mon” to “sun”
- FFMC — FFMC index from the FWI system: 18.7 to 96.20
- DMC — DMC index from the FWI system: 1.1 to 291.3
- DC — DC index from the FWI system: 7.9 to 860.6
- ISI — ISI index from the FWI system: 0.0 to 56.10
- Temp — temperature in Celsius degrees: 2.2 to 33.30
- RH — relative humidity in %: 15.0 to 100
- Wind — wind speed in km/h: 0.40 to 9.40
- Rain — outside rain in mm/m2 : 0.0 to 6.4
- Area — the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform)
1) Identify the variables and import the given data set into R.
2) Get the summary of the data set
3) How many observations are there?
4) What is the maximum and minimum wind speed of this data set?
5) Get five number summary of temperature
6) How many outliers are there in the wind variable?
7) According to the boxplot of wind what kind of a distribution it has?
8) What is the median of temperature?
9) What is the mean and standard variation of wind variable?
10) What is the interquartile range of wind variable?
11) How many observations have measured during Friday in August?
12) What is the average temperature, during September?
13) On which day have they measured most observations during month of July?
#setting working directory
getwd()
setwd("C:\\Users\\kodit\\Desktop\\Y2S2\\Probability and Statistics - IT2110\\Week 06\\Lab 06-20221202")
getwd()
#1) Identify the variables and import the given data set into R.
#we can store the data to a variable called data1
data1<-read.table("Forest.txt",header=TRUE,sep=",")
data1
fix(data1)
.#we need to attach the data TWICE
attach(data1)
#2) Get the summary of the data set
#str() cmd will give the summary of the data set
str(data1)
#3) How many observations are there?
#By looking at the summary u can see that there are 517 Observations
#4) What is the maximum and minimum wind speed of this data set?
max(wind)
min(wind)
#5) Get five number summary of temperature
#MIN,MAX,Q1,Q2,Q3
summary(temp)
#here its asking 5 number summary of temp (a variable)
#if asking summary of the data set we use str()
#6) How many outliers are there in the wind variable?
#1st parameter - variable name that you want to draw the boxplot for
#2nd parameter - I need the boxplot to be horizontal
#3rd parameter - If there are any outlines, I want to display that as well
#4th parameter ;
# When u plot the boxplot u can see 3 blackdots
#if u put pch = 4 u will see 3 crosses
#pch = 10 will give 3 diamonds
#pch has values upto 25 denoting different shapes
boxplot(wind, horizontal = TRUE, outline = TRUE, pch =16)
#we have 3 outliers
#horizontal = FALSE ; will give u vertical boxplot
#outline = FALSE ; will not disaply ur outliers
#7) According to the boxplot of wind what kind of a distribution it has?
#negatively skewed distribution
#8) What is the median of temperature?
median(temp)
#9) What is the mean and standard variation of wind variable?
mean(wind)
sd(wind)
sv<-1.791653 * 1.791653
sv
#I just found std variance. But the question is actually asking for Standard deviation/variation
#10) What is the interquartile range of wind variable?
IQR(wind)
#11) How many observations have measured during Friday in August?
#To find this we need to create the 2 way frequency table for day and month
freq<-table(day,month)
freq
#21 observations
#12) What is the average temperature, during September?
mean(temp[month=="sep"])
#13) On which day have they measured most observations during month of July?
count<-table(day[month=="jul"])
names(count[count==max(count)])
Level 05
Data Set :
https://github.com/TheCodingTrio/RStudio/blob/main/Data%20-%20Lab%207.txt
The nicotine contents, in milligrams for 40 cigarettes of a certain brand (population) were recorded.
1. Calculate population mean and variance of the dataset.
2. Get 30 random samples of size 5, with replacement and calculate sample mean and sample variance for each sample.
3. Calculate mean and variance of the Sample Means.
4. Compare and state relationship (if any) Population Mean and the Mean of Sample Means.
5. Compare and state relationship (if any) Population Variance and the Variance of Sample Means.
Use the Following Format.
getwd()
setwd("C:\\Users\\kodit\\Desktop\\Y2S2\\Probability and Statistics - IT2110\\Week 09\\Lab 07-20221202")
getwd()
#lets import the txt file
nicotine<-read.table("Data - Lab 7.txt", header = TRUE)
fix(nicotine)
nicotine
#Now when we run the line#10 we get the data is in a vertical order
#we can change that to a horizontal order
#nicotine[[1]]
#After changing we can assign it to the same variable or different one
nicotine<-nicotine[[1]]
#Now u can see that the structure has changed
nicotine
#1. Calculate population mean and variance of the dataset.
#population means we have to consider the whole dataset
mean(nicotine)
var(nicotine)
sd(nicotine) #std deviation
#2. Get 30 random samples of size 5, with replacement and calculate sample mean and sample variance for each sample.
#to get a sample of size 5 , use the below cmd.
#sample(nicotine,5)
#Save it to a variable
s<-sample(nicotine,5)
#u can see a random sample of size 5
s
#we need 30 samples.
#So to do that lets create an empty vector called samples and
#another empty vector called n
samples<-c()
n<-c()
#then using a for loop we will get 30 random samples
for (i in 1:30) {
s<-sample(nicotine,5)
#lets combine these random samples into a vector called samples
samples<-cbind(samples,s)
#then we need to concatenate the strings and add those values to the n vector
#paste is used to concatenate 2 string values
#'S' is given to name the columns in the vector as S1,S2,S3....S30
# i is the value of the sample
n<-c(n,paste('S',i))
}
#to give column names use the below cmd
#colnames(samples)
#and assign it to n
# n is the vector with S1,S2,S3......S30
colnames(samples)<-n
#now u can see 30 random samples with size 5
#now lets find mean and variance of the samples
#use the below cmd to find the mean
#colMeans(samples)
#lets assign it to a variable called s.means
s.means<-colMeans(samples)
s.means
#to find the sample variance of the 30 samples use the below cmd
#samples is the variable name
# Since we have the values in our sample column wise we give 2
#If u have the values row wise, instead of 2 , u have to pass 1
#finally give var to find variance
#apply(samples,2,var)
#lets save it inside the variable
s.vars<-apply(samples,2,var)
s.vars
#3. Calculate mean and variance of the Sample Means.
#Asking to calculate the mean and the variance of the "Sample Means"
#s.means is the variable name you have to use
mean(s.means)
#to find variance
var(s.means)
#4. Compare and state relationship (if any) Population Mean and the Mean of Sample Means.
mean(nicotine)
mean(s.means)
#Population mean = 1.77425
#Mean of Samples = 1.7454
# 2 values are approximately equal
#5. Compare and state relationship (if any) Population Variance and the Variance of Sample Means.
var(nicotine)
var(s.means)
#Population variance = 0.1524558
#Variance of Sample Means = 0.02157383
# 2 values are not equal
# Reason is sample size is too small.
#If sample size is bigger the 2 values will be equal or closer
Level 06
Use R to find the probabilities in the following questions.
- A company claims that their drug treatment cures 92% of cases of hookwarm in children. Suppose that 44 children suffering from hookwarm are to be treated with this drug and that the children are regarded as a simple random sample taken from a large population of children suffering from hookwarm. Let X denote the number of children cured from a sample of 44 children.
i. What is the distribution of X?
ii. What is the probability that 40 children are cured?
iii. What is the probability that less than or equal to 35 children are cured? iv. What is the probability that at least 38 children are cured?
v. What is the probability that between 40 and 42 (both inclusive) children are cured?
2) Data from the maternity ward in a certain hospital shows that there is a historical average of 4.5 babies born in this hospital every day.
i. What is the probability that 6 babies will be born in this hospital tomorrow?
ii. What about the probability of more than 6 babies being born?
3) The time (in hours) required to repair a machine is an exponential distributed random variable with parameter λ=1/2.
i. Find the probability that a repair time takes at most 3 hours.
ii. Find the probability that a repair time exceeds 4 hours.
iii. Find the probability that a repair time takes between 2 to 4 hours
4) Assume that human body temperatures are normally distributed with a mean of 36.8 𝐶 0 and a standard deviation of 0.4 𝐶 0 .
i. A hospital uses 37.9 𝐶 0 as the lowest temperature considered to be a fever. What is the probability that randomly selected person would have a fever?
ii. What is the probability that a random selected person would have a temperature between 36.4 𝐶 0 and 36.9 𝐶 0 ?
iii. Physicians want to select a maximum temperature for requiring further medical tests. What should that temperature be, if want only 1.2% of the people to fall below it?
iv. Physicians want to select a minimum temperature for requiring further medical tests. What should that temperature be, if want only 1.0% of the people to fall above it?
#1) A company claims that their drug treatment cures 92% of cases of hookwarm in
#children. Suppose that 44 children suffering from hookwarm are to be treated with
#this drug and that the children are regarded as a simple random sample taken
#from a large population of children suffering from hookwarm.
#Let X denote the number of children cured from a sample of 44 children.
#i. What is the distribution of X?
#Binomial Distribution
X~Bin(44,0.92)
#ii. What is the probability that 40 children are cured?
?dbinom
dbinom(40,44,0.92)
#Here in the question they give a direct number. Thats why we use density
#binomial for the question ; dbinom()
#iii. What is the probability that less than or equal to 35 children are
#cured?
#when questions like less than equal / greater than or equal comes we use
#pbinom()
#For the above two mentioned conditions(greater than/less than), we use
#pbinom()
pbinom(35,44,0.92)
#iv. What is the probability that at least 38 children are cured?
#P(X ??? 38) = 1 - P(X ??? 37)
1 - pbinom(37,44,0.92)
#we use pbinom() for these situations as well
# 1 is the full amount
# pbinom(37,44,0.92) is the cured amount
#We can get the probability of children who are not cured
#In the question it says atleast 38. So we have to take the lower value,
#which is 37
#v. What is the probability that between 40 and 42 (both inclusive)
#children are cured?
#we use pbinom() for these situations as well
#P(40 ??? X ??? 42) = P(X ??? 42) - P(X ??? 39)
pbinom(42,44,0.92) - pbinom(39,44,0.92)
#Although in the question our lower limit is 40, we put as 39
#same as in above question
#2) Data from the maternity ward in a certain hospital shows that there is a
#historical average of 4.5 babies born in this hospital every day.
X~dpois(4.5)
?dpois
#i. What is the probability that 6 babies will be born in this hospital
#tomorrow?
dpois(6,4.5)
#Here also when a direct number is given we use the dpois()
#ii. What about the probability of more than 6 babies being born?
ppois(6,4.5, lower.tail = FALSE)
#since question says more than 6 , we use ppois()
#we use lower.tail = FALSE becuz they ask more than 6
#less than 6 is not relevant to this case
#3) The time (in hours) required to repair a machine is an exponential distributed
#random variable with parameter λ=1/2.
X~exp(rate=1/2)
#i. Find the probability that a repair time takes at most 3 hours.
pexp(3, rate = 1/2)
?pexp
#at most 3 means not more than 3. since not a direct number we use pexp()
#ii. Find the probability that a repair time exceeds 4 hours.
pexp(4, rate = 1/2, lower.tail = FALSE)
#question says exceeds 4(more than 4). So we dont need the ones below 4
#thats why we use lower.tail = FALSE
#iii. Find the probability that a repair time takes between 2 to 4 hours
#P(2<X<4)
pexp(4,rate = 1/2) - pexp(2, rate = 1/2)
#Important ; unlike in binomial and poisson, In exponential we use the
#lower value as it is, without reducing.
#Thats why 2 is taken as 2
#4) Assume that human body temperatures are normally distributed with a mean of 36.8 𝐶0 and a
#standard deviation of 0.4 𝐶0 .
#i. A hospital uses 37.9 𝐶0 as the lowest temperature considered to be a fever.
#What is the probability that randomly selected person would have a fever?
#P(X ??? 37.9) = 1 - P(X ??? 37.9)
1- pnorm(37.9, 36.8, 0.4)
?pnorm
#ii. What is the probability that a random selected person would have a temperature
#between 36.4 𝐶0 and 36.9 𝐶0 ?
#P(36.4 ??? X ??? 36.9) = P(X ??? 36.9) - P(X ??? 36.4)
pnorm(36.9, 36.8, 0.4) - pnorm(36.4,36.8, 0.4)
#In ND also the lower value we use as it is.
#iii. Physicians want to select a maximum temperature for requiring further medical
#tests. What should that temperature be, if want only 1.2% of the people to
#fall below it?
#When a percentage value is given we use qnorm()
qnorm(0.012, 36.8, 0.4)
?qnorm
#iv. Physicians want to select a minimum temperature for requiring further
#medical tests. What should that temperature be,
#if want only 1.0% of the people to fall above it?
qnorm(0.01, 36.8, 0.4)
#Extra Note
qnorm(1, 36.8, 0.4)
#This gives answer as infinity
#1 means 100%
#So in such a case we use 99% i,e 0.99 to get the answer
qnorm(0.99, 36.8, 0.4)
Level 07
Data Set :
https://github.com/TheCodingTrio/RStudio/blob/main/Data%20-%20Lab%2008.txt
- Draw the scatterplot for the above observations and comment on the plot.
2. State the sample correlation coefficient.
3. Test the hypothesis that there is no correlation between the two variables and interpret the result.
4. Find the fitted regression model.
5. Use the output to the ANOVA test to find whether the slope of the regression is zero.
getwd()
setwd("C:\\Users\\kodit\\Desktop\\Y2S2\\Probability and Statistics - IT2110\\Week 12\\Lab 08-20221202")
getwd()
#1. Draw the scatterplot for the above observations and comment on the plot.
data1<-read.table("Data - Lab 08.txt",header = TRUE, sep = "")
data1
fix(data1)
# Create a scatter plot
plot(data1, main = "Scatter Plot of X vs Y", xlab = "X", ylab = "Y")
#2. State the sample correlation coefficient.
#In R Studio, you can find the sample correlation coefficient using the cor() function.
cor(data1)
#You can also find the correlation matrix for multiple variables at once by passing a data frame to the cor() function:
#data <- data.frame(x = c(2, 4, 5, 7, 8), y = c(3, 6, 5, 8, 9), z = c(1, 3, 2, 5, 6))
#cor(data)
#3. Test the hypothesis that there is no correlation between the two variables
#and interpret the result.
data1
cor.test(data1$x,data1$y, method = "pearson")
#we can see that the correlation coefficient (r) is 0.797025 and the
#p-value is 2.588e-05. Since the p-value is less than 0.05,
#we can conclude that the correlation between the two variables is
#statistically significant and we reject the null hypothesis
#that there is no correlation between the two variables.
#4. Find the fitted regression model.
#To find the fitted regression model in R, you can use the lm() function which
#stands for "linear model"
fit<-lm(data1)
fit
# Print the summary of the fitted model
summary(fit)
#5. Use the output to the ANOVA test to,
#find whether the slope of the regression is zero.
Important points regarding questions 3, 4 & 5 above
(3)
To test the hypothesis that there is no correlation between two variables in R Studio, you can use the function cor.test().
Here is an example of how to do it:
# Create two variables
x <- c(2, 3, 5, 7, 9)
y <- c(5, 7, 8, 10, 12)
# Calculate the correlation coefficient and perform the test
cor.test(x, y, method = "pearson")
The cor.test() function calculates the correlation coefficient between the two variables and performs a hypothesis test to determine whether the correlation is statistically significant or not. The argument method = "pearson" specifies that we want to use the Pearson correlation coefficient.
The output of the cor.test() function includes the correlation coefficient (r) and the p-value of the test. The p-value is used to determine whether the correlation is statistically significant or not. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis that there is no correlation between the two variables.
For example, if the output of the cor.test() function is:
Pearson's product-moment correlation
data: x and y
t = 4.1012, df = 3, p-value = 0.02015
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3691605 0.9939565
sample estimates:
cor
0.9285714
we can see that the correlation coefficient (r) is 0.9285714 and the p-value is 0.02015. Since the p-value is less than 0.05, we can conclude that the correlation between the two variables is statistically significant and we reject the null hypothesis that there is no correlation between the two variables.
(4)
To find the fitted regression model in R, you can use the lm() function which stands for "linear model". Here is an example of how to do it:
Assuming you have two variables x and y, and you want to find the fitted regression model:
# Generate some data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
# Fit the regression model
fit <- lm(y ~ x)
# Print the summary of the fitted model
summary(fit)
The output will show you the coefficients of the model and other statistics. The most important values to look at are the Coefficients table, which shows the estimated values of the intercept and slope, and the Residuals table, which shows the goodness of fit of the model.
You can also plot the fitted regression line along with the scatter plot of the data by using the plot() function and passing the fit object as an argument:
# Plot the scatter plot of the data
plot(x, y)
# Add the fitted regression line
abline(fit)
This will plot the scatter plot of the data and add the fitted regression line.
(5)
To test whether the slope of the regression is zero using the ANOVA test output in R Studio, you can perform an F-test by comparing the sum of squares for the regression to the sum of squares for the error. Here is an example:
# Fit a linear regression model
model <- lm(mpg ~ wt, data = mtcars)
# Perform ANOVA test and get the F-test statistic and p-value
anova_result <- anova(model)
f_statistic <- anova_result$`F value`[1]
p_value <- anova_result$`Pr(>F)`[1]
# Test if the slope of the regression is zero using the F-test
alpha <- 0.05
if (p_value < alpha) {
cat("Reject the null hypothesis that the slope of the regression is zero.")
} else {
cat("Fail to reject the null hypothesis that the slope of the regression is zero.")
}
In this example, mpg is the response variable and wt is the predictor variable. The ANOVA test is performed using the anova() function on the linear regression model model. The F-test statistic and p-value are extracted from the ANOVA test results using the $ operator.
The null hypothesis is that the slope of the regression is zero, which means there is no linear relationship between the response and predictor variables. If the p-value is less than the significance level alpha (typically 0.05), then the null hypothesis is rejected and there is evidence to suggest that there is a linear relationship between the variables. If the p-value is greater than alpha, then the null hypothesis is not rejected and there is insufficient evidence to suggest a linear relationship. The output will indicate whether the null hypothesis is rejected or not.
Congratulations on completing the R Beginner to Advanced Exercises! Your dedication and effort in mastering R programming from the basics to advanced concepts is commendable. By successfully navigating through these questions, you have demonstrated a solid understanding of R and its application in various scenarios. This achievement is a testament to your perseverance and commitment to expanding your skills in statistical computing and data analysis. Well done on this significant milestone, and may your newfound expertise in R open doors to exciting opportunities in the world of data science and analytics. Keep up the great work! 😉