Data Analytics Using Python (Part_6)

Teena Mary
Budding Data Scientist
11 min readJun 6, 2020
Photo by Isaac Smith on Unsplash

This is the sixth post among the 12 series of posts in which we will learn about Data Analytics using Python. In this post, we will look into the concept of randomized block design, two-way ANOVA and linear regression.

Index

  1. Randomized Block Design
  2. Two Way ANOVA
  3. Linear Regression

Randomized Block Design (RBD)

A completely randomized design CRD is useful when the experimental units are homogeneous. If the experiment units are heterogeneous, then blocking is often used to form homogeneous groups.

Why do we use RBD?

The reason why we use RBD is that sometimes a problem can arise whenever differences due to extraneous factors (ones that are not considered in the experiment) cause the MSE term in the ratio to become large. In such cases, the F value = MSTR/MSE can become small, signaling zero difference among treatment means, when in fact such a difference exists. Experimental studies in business often involve experimental units that are highly heterogeneous; as a result, randomized block designs are often employed. Blocking in experimental design is similar to stratification in sampling. The purpose of RBD is to control some of the extraneous sources of variation by removing such variation from the MSE term. This design tends to provide a better estimate of the true error variance and leads to a more powerful hypothesis test in terms of the ability to detect differences among treatment means.

Let us examine using a use case. A study measuring the fatigue and stress of air traffic controllers resulted in proposals for modification and redesign of the controller’s work station. After consideration of several designs for the work station, three specific alternatives are selected as having the best potential for reducing controller stress. The key question is: To what extent do the three alternatives differ in terms of their effect on controller stress?

This is a RBD problem because: In a completely randomized design, a random sample of controllers would be assigned to each work station alternative. However, controllers are believed to differ substantially in their ability to handle stressful situations. What is high stress to one controller might be only moderate or even low stress to another. Hence, when considering the within-group source of variation (MSE), we must realize that this variation includes both random error and error due to individual controller differences. In fact, managers expected controller variability to be a major contributor to the MSE term.

Fig 1: Summary of Stress Data for Air Traffic Controller Stress Test

Solving this example using ANOVA in Python:

So, here from the ANOVA test, we see that the null hypothesis is accepted, ie, there isn’t any significant difference between the means of the three alternatives in terms of their effect on controller stress. But, we know that this is not true. Hence we apply the RBD procedure to make the experimental units homogeneous and the will do ANOVA on it again.

From the Fig 1. Summary of Stress Data for Air Traffic Controller Stress Test, we can get the mean of treatments as:

In the next ANOVA table, we include another row called the ‘Blocks’.

The terms in the RBD problem are defined by:

The computational procedure for a RBD problem is given by:

Now, the ANOVA table for the air traffic controller stress test is:

Here, we get the calculated F-value as 5.53 and at 5% LOS, the p-value is 0.024. The F-value at 0.025 is 5.46. Since F(0.025)< calculated F-value, we reject the null hypothesis. Hence, there is a significant difference in the way the three systems cope with the stress.

We can see that initially, we had accepted the null hypothesis. But, when we applied, we came to a conclusion where the null hypothesis was rejected.

Solving RBD example using ANOVA in Python:

Two Way ANOVA

A factorial experiment is an experimental design that allows simultaneous conclusions about two or more factors. The term factorial is used because the experimental conditions include all possible combinations of the factors. The effect of a factor is defined as the change in response produced by a change in the level of the factor. It is called a main effect because it refers to the primary factors in the study. For example, for a levels of factor A and b levels of factor B, the experiment will involve collecting data on ab treatment combinations. Factorial experiments are the only way to discover interactions between variables.

(a)Factorial Experiment, no interaction, (b) Factorial Experiment, with interaction

The simplest type of factorial experiment involves only two factors, say, A and B. There are a levels of factor A and b levels of factor B. This two-factor factorial is shown in next table . The experiment has n replicates, and each replicate contains all ab treatment combinations.

Here, the observation in the ijth cell for the kth replicate is denoted by yijk. In performing the experiment, the abn observations would be run in random order. Thus, like the single factor experiment, the two-factor factorial is a completely randomized design.

Example: Three CAT preparation programs

As an illustration of a two-factor factorial experiment, we will consider a study involving the Common Admission test (CAT), a standardized test used by graduate schools of business to evaluate an applicant’s ability to pursue a graduate program in that field. Scores on the CAT range from 200 to 800, with higher scores implying higher aptitude. In an attempt to improve students’ performance on the CAT, a major university is considering offering the following three CAT preparation programs:

  1. A three-hour review session covering the types of questions generally asked on the CAT.
  2. A one-day program covering relevant exam material, along with the taking and grading of a sample exam.A one-day program covering relevant exam material, along with the taking and grading of a sample exam.
  3. An intensive 10-week course involving the identification of each student’s weaknesses and the setting up of individualized programs for improvement.

Factor 1 with three treatments: One factor in this study is the CAT preparation program, which has three treatments: Three-hour review, One-day program, and 10-week course. Before selecting the preparation program to adopt, further study will be conducted to determine how the proposed programs affect CAT scores.

Factor 2 with three treatments: The CAT is usually taken by students from three colleges: the College of Business, the College of Engineering, and the College of Arts and Sciences. Therefore, a second factor of interest in the experiment is whether a student’s undergraduate college affects the CAT score. This second factor, undergraduate college, also has three treatments: Business, Engineering, and Arts and sciences.

In experimental design terminology, the sample size of two for each treatment combination indicates that we have two replications. The term interaction between two factors refers to a new effect. If the interaction effect has a significant impact on the CAT scores, we can conclude that the effect of the type of preparation program depends on the undergraduate college.

ANOVA Table for the Two-factor Factorial Experiment with r Replications

Abbreviations:

ANOVA Procedure:

The ANOVA procedure for the two-factor factorial experiment requires us to partition the sum of squares total (SST) into four groups:

  • sum of squares for factor A (SSA)
  • sum of squares for factor B (SSB)
  • sum of squares for interaction (SSAB)
  • sum of squares due to error (SSE)

The formula for this partitioning follows: SST = SSA + SSB + SSAB + SSE

Computations and Conclusions:

CAT Summary Data for The Two-factor Experiment

Procedure to calculate ANOVA Table

Finally, we will get the table as follows:

From the above table, we can see that Factor A and Interaction has no effect on the efficiency of the teaching model since the p-values are greater than 0.05 and hence the null hypothesis is accepted. But, Factor B does have an effect since 0.005<0.05 and hence the null hypothesis is rejected.

Python Implementation

The above example can be implemented in Python using the following codes:

Linear Regression

Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis is a statistical technique that is very useful for these types of problems. This model can also be used for process optimization, such as finding the level of temperature that maximizes yield, or for process control purposes.

Simple Linear Regression Model

The equation that describes how y is related to x and an error term is called the regression model. The simple linear regression model is: y=βo+β1.x+ε, where: βo and β1 are called parameters of the model and ε is a random variable called the error term.

The simple linear regression equation is: E(y) = βo+ β1.x

  • Graph of the regression equation is a straight line.
  • E(y) is the expected value of y for a given x value.
  • β1 is the slope of the regression line.
  • βo is the y intercept of the regression line.
Positive Linear Relationship, Negative Linear Relationship, No Relationship

Estimated Simple Linear Regression Equation:

The estimated simple linear regression equation is ŷ = bo + b1.x. The graph is called the estimated regression line.

• ŷ is the estimated value of y for a given x value.

• b1 is the slope of the line.

• bo is the y intercept of the line.

Estimation Process

There are various methods to estimate the value of the parameters. The most efficient method is the method of least squares.

Least Squares Method

The least squares criterion is:

where:

yi = observed value of the dependent variable for the ith observation

ŷi = estimated value of the dependent variable for the ith observation

Slope for the Estimated Regression Equation (β1):

Sum of squares and sum of cross-products:

The y-intercept for the estimated regression equation: bo=ȳ — b1.x̅

where:

  • xi = value of independent variable for ith observation
  • yi = value of dependent variable for ith observation
  • n = total number of observations
  • ȳ = mean value for dependent variable
  • x̅ = mean value for independent variable

Simple Linear Regression Formulation

Once the parameters are found, the regression model is computed and the regression line is found. For a simple linear regression, there will be one independent variable and one dependent variable. The linear regression model will give a straight line. The line will be drawn in such a way that the deviation of the data points from the regression line will be minimized.

Deviation from the estimated regression model

Let us look at an example. An Auto company periodically has a special week-long sale. As part of the advertising campaign runs one or more television commercials during the weekend preceding the sale. Data from a sample of 5 previous sales are taken.

The estimated regression equation is computed in the following way:

Hence, the estimated regression is modeled and now a scatter plot is drawn to visualize the linearity of the data.

Python Implementation of Regression

So, after the above steps, we get the value of regression coefficients as bo=10 and b1=5, exactly the value that was computed through manual calculation.

Coefficient of Determination

An important concept to be understood is the relationship among SST, SSR, SSE.

The coefficient of determination, denoted R² or r² and pronounced “R squared”, is the proportion of the variance in the dependent variable that is predictable from the independent variable. The value of R² = SSR/SST. R² evaluates the scatter of the data points around the fitted regression line. For the same data set, higher R-squared values represent smaller differences between the observed data and the fitted values. Higher the value of R², better the regression model.

In the previous example, the value of R² is 0.877, implying that the regression relationship is very strong. 88% of the variability in the number of cars sold can be explained by the linear relationship between the number of TV ads and the number of cars sold.

Assumptions About the Error Term ε:

  1. The error ε is a random variable with mean of zero.
  2. The variance of ε, denoted by ε², is the same for all values of the independent variable.
  3. The values of ε are independent.
  4. The error ε is a normally distributed random variable.

--

--

Teena Mary
Budding Data Scientist

I’m a post graduate student doing Data Science from Christ (Deemed to be University) in Bengaluru.