BASIC STATISTICS using SAS
To get meaningful information from the data some functions, Tests of Statistics are used in SAS to perform Analysis of the data. In this article we are performing some basic statistical functions, tests in SAS, including, means, t-tests, chi square, correlation, regression, and analysis of variance.
SAS ( Statistical Analysis System or Software) :
SAS is an Integrated Tool which is collection of programs that are used to read data from different sources i.e., text file, excel file, csv file…to perform various reporting and analysis and to manipulate data with a powerful programming language. We can access the BASE SAS, SAS/STAT and SAS/GRAPH software. SAS runs on different computing platforms they are PC’s, UNIX and LINUX operating systems and mainframe computers.
SAS Datasets :
SAS datasets consists of 2 parts : A descriptor portion (metadata) and data values. Descriptor portion which gives information about the particular dataset such as variable names and data types.
You can use a DATA step to read the file and create your SAS data set when you have raw data in text file and if you have data in excel sheet you can import the data from excel sheet to SAS environment (convert to SAS dataset) by using the IMPORT procedure.
SAS Statistics :
SAS/STAT includes exact techniques for small data sets, high-performance statistical modeling tools for large data tasks and modern methods for analyzing data with missing values.
SAS/STAT software provides statistical techniques for applications that span every industry.Some Statistical modules which are using in SAS :
- Mean : For a dataset , the mathematical expectation or average is the central value of discrete set of numbers, specifically sum of values divided by number of values.
X = sum of values/ number of values
SAS PROC MEANS Procedure
We use SAS PROC MEANS to find arithmetic mean of our data. PROC MEANS produces descriptive statistics (means, standard deviation, minimum, maximum, etc.) for numeric variables in a set of data.
SAS PROC MEANS syntax is:
PROC MEANS <options>; <statements>;
we can also use Statistical options that may be called are: (default statistics are underlined.)
- N — Number of observations
- NMISS — Number of missing observations
- MEAN — Arithmetic average)
- STD — Standard Deviation
- MIN — Minimum (smallest)
- MAX — Maximum (largest)
- RANGE — Range
- SUM — Sum of observations
- VAR — Variance
Other generally used options available in SAS PROC MEANS include:
- DATA= Specify dataset to use
- NOPRINT Do not print output
- MAXDEC=n Use n decimal places to print output.
Frequently used statements with SAS PROC MEANS include:
- BY variable list — Statistics are reported for groups in separate tables
- CLASS variable list — Statistics reported by groups in a single table
- VAR variable list — specifies which numeric variables to use
- OUTPUT OUT = dataset name — statistics will be output to a SAS data file
- FREQ variable — specifies a variable that represents a count of observations.
Performing Arithmetic Mean on an Entire Dataset
This is the common syntax of a SAS PROC MEANS. We just need to specify the name of the dataset and not the variables.
Example-
Proc Means Data=SASHelp.cars;
Run;
SAS PROC MEANS computes a set of descriptive statistics. The command computes descriptive statistics for all the numeric variables in the dataset.
By default, SAS caluculates the statistics N, Mean, Standard Deviation, Minimum and Maximum are computed.
a. Display Different Decimal Places
You can also specify the number of decimal places you want to display using the MAXDEC= option.
Proc Means Data=SASHelp.cars maxdec=0;
Var MSRP Invoice;
Run;
proc Means Data=SASHelp.cars maxdec=2;
Var MSRP Invoice;
Run;
SAS Arithmetic Mean for Specific Variables
Sometimes we are interested in only a few selected variables for our analysis. We use the VAR statement to limit the analysis to only the variables we are interested in. The VAR statement used below limits the analysis to only the MSRP and INVOICE variables. For any other variables in the dataset SAS Arithmetic Mean of Specific Variables
Sometimes you might be interested in only a few selected variables for your analysis. You can add the VAR statement to limit the analysis to only the variables you are interested in. The VAR statement below limits the analysis to only the MSRP and INVOICE variables. No results are computed for any other variables in the dataset.
Proc Means Data=SASHelp.cars;
Var MSRP Invoice;
Run;
SAS Arithmetic Mean by Class
Each variable is different in some aspect with other, like in the cars dataset, the price of a Audi is likely to be very different from that of a Acura. Thus, it makes more sense to separate the analysis for each car maker. Using a CLASS statement can be added to the MEANS procedure to group our analysis. By specifying the variable MAKE as the classification variable, there will be a separate analysis completed for each car producer.
Proc Means Data=SASHelp.cars;
Class Make;
Var MSRP Invoice;
Run;
2. Median : After rewrite the list of values in ascending order, the middle value/data point lies in the center is known as Median.
3. Mode : The value that appears most often is known as Mode.
Generate Histograms : It is the area of the bar that is proportional to the size of the category.
Probability Plots : The quantile-quantile plot (Q-Q plot) to graphically assess whether data can be modeled by a probability distribution which is used by Data Analysts such as the normal distribution. You can use the QQPLOT statement in PROC UNIVARIATE to create a Q-Q plot. However, it can be useful to use a variant of the Q-Q plot called the Probability plot.
T -Test : The t -test tells you how significant the differences between groups of means statistically significant or not.There are three types of the t tests, they are :
1 . One sample T -test
2 . Two sample T -test
3 . Paired T -test
This procedure automatically tests the assumption of homogeneity of variance for two sample designs and computes t- and p-values for the assumptions of equal or unequal variances.
One sample T -test : Testing of significant difference between the sample mean to the given number.
Practical usage : The purpose of the one sample t-test is to determine if the null hypothesis should be rejected, given the sample data.The alternative hypothesis can assume one of three forms depending on the question being asked. If the goal is to measure any difference.
proc ttest data=nutan.salesdata H0=180;
var sale;
run;
when we execute the above code , we get the following output :
Formula of One sample T -test :
Two sample T -test : Testing of significance difference between the two sample means which are coming from same population.
Practical Usage : If the goal is to measure any difference, regardless of direction, a two-tailed hypothesis is used. If the direction of the difference between the sample mean and the comparison value matters, either an upper-tailed or lower-tailed hypothesis is used.
proc ttest data=nutan.salesdata h0=120;
class prdcode;
var sale;
run;
when we execute the above code , we get the following output :
Formula of Two sample T -test :
Paired Sample T -test : The testing of significant difference between the Two paired samples.
Practical usage : To determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations. Common applications of the paired sample t-test include case-control studies or repeated-measures designs.
proc ttest data=nutan.salesdata alpha=0.5;
paired stock*sale;
run;
when we execute the above code , we get the following output :
Formula of Paired T -test :
Rule for tests : Firstly we have to collect T tabulated values from the statistical tables using degrees of freedom and level of significance. Then if Tcal < Ttab(or)Ttab > Tcal then accept null hypothesis (H0). Tcal > Ttab (or) Ttab < Tcal then reject the null hypothesis (H0) ,as well as comparison between probability of t value to level of significance , default Level of significance is 0.05.
ANOVA (Analysis of Variance) : ANOVA is a technique used to find the difference of means of several normal populations. It is a collection of statistical models and their associated estimation procedures. ANOVA needs a balanced data that means each group contains equal number of observations. If it is unbalanced data, Then move to GLM(General Linear Model).
GLM is most powerful compared to ANOVA. But it will take more processing time.
ANOVA contains two types of processes. They are :
1 . One way ANOVA
2 . Two way ANOVA
1 . One way ANOVA : One-Way ANOVA compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. One-Way ANOVA is a parametric test.
Basic Syntax for One way ANOVA :
Proc anova data=<dataset>;
Class <class var>;
Model <Dependent var>=<Independent var>;
means <variables>;
quit;
run;
Example :
For example suppose a dataset named Customers contains classification variable product code (prdcode) and the analysis variable goods, by using below code we can compare the means of both classification and analysis variable..
proc anova data=nutan.customers;
class prdcode;
model goods=prdcode;
quit;
run;
when we execute the above code , we get the following output :
Here from the above output, collect R-square value, F value and probability of F value.
R- square = 0.01870
R-square represents variation in dependent variable explained by the variation in independent variable.
F value= 0.19 .
Probability= 0.8282 .
Analyze:
In the above output, F- value is 0.19. It is very near to 1. Probability of F- value is 0.8282 . α- value is 0.05. Then there is no significant difference between the populations , then we have more probability to accept null hypothesis (H0) i.e, we reject alternative hypothesis (H1).
2 . Two way ANOVA :
A two-way ANOVA tests the effect of two independent variables on a dependent variable. A two-way ANOVA test analyzes the effect of the independent variables on the expected outcome along with their relationship to the outcome itself.
Basic syntax for two way ANOVA :
proc anova data= <dataset>;
Class <class var> <class var>;
Model <Dependent var>=<Independent var> | <independent var>;
means <variables>;
quit;
run;
Example :
For example suppose a dataset named Customers contains two classification variables product code (prdcode) and customer number (custno) and the analysis variable goods, by using below code we can compare the means of two factors of both classification and analysis variable..
proc sort data=nutan.customers out=cust_out;
by prdcode custno;
run;
proc anova data=vs;
class drug gender;
model asbp=drug|Gender;
Run;
quit;
proc anova data=cust_out;
class prdcode custno;
model goods=prdcode|custno;
quit;
run;
Test the significant difference between prodcode by custno wise :
when we execute the above code , we get the following output :
From the above output collect R- square value F- value and the probability of F- value :
R- square =0.292539
R square converted into %age = %29 .
F- value = 0.54 .
prob of F = 0.7102 .
Analyze :
In the above output F value is 0.54 . It is very near to 1. Probability of F- value is 0.7102 . α- value is 0.05. Then there is no significant difference between the populations , then we have more probability to accept null hypothesis (H0) i.e, we reject alternative hypothesis (H1).
Non- Parametric tests :
Chi- Square test : A chi-squared test is a statistical hypothesis test that is to perform when the test statistic is chi-squared distributed under the null hypothesis . Pearson’s chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies.
Basic syntax for Chi- square test :
proc freq data = <dataset>;
tables <variables> / <options> chisq;
run;
Example :
For example suppose a dataset named Customers contains the classification variable prdcode by using this variable we can estimate the chi- square value :
proc freq data=nutan.customers;
tables prdcode / nocum nopercent chisq;
run;
when we execute the above code , we get the following output :
In the above output collect the probability of chi- square value = 0.7376 .
Our level of significance is 0.05 (alpha) . Then we have highest probability to accept the null hypothesis (H0) i.e reject the alternative hypothesis .
References :
1 . Google .
2 . Tutorials point .
3 . SAS community .
4. Data Flair.