Can we get SAS Proc Freq with Python?
In this article, we will discuss the SAS Proc Freq procedure and how we can achieve similar results using Python libraries.
Introduction to Proc Freq
The frequency distribution of categorical variables is quite common in descriptive statistics. SAS provides the freq procedure to achieve these stats in a simple way. There are two most common reasons why you need to know how to get frequency distribution in Python:
- Python is the most commonly used coding language in Data Science.
- Your team is migrating the code from SAS to Python.
- SAS OnDemand (earlier University Edition) / — For writing SAS Code
- Kaggle — For writing python code
- Pandas — For dataframe operations
1. Simple Proc Freq
- Importing the CSV
- Frequencies sorted by labels
- The method, value_counts() returns a pandas series containing counts of unique values.
- sort_index() sorts the data based on labels.
3. Convert the Resultant Series to a Dataframe
- We will use the DataFrame method for this purpose.
- Dividing data values by their sum and then multiplying by 100 gives us the percentage for each value. We are using the sum() and cumsum() functions to get the sum and cumulative sums of the variables.
- Rounding the percentages up to two decimal places to match the SAS output
Note: The Index can be dropped while exporting the DF to excel/CSV.
2. Include the missing values
SAS: By default, missing values are dropped, use the missing option to include them as a group.
Python: By default, the missing values are dropped, to keep missing values in the frequency table, add the dropna parameter and set it to False.
3. Sort the rows from most frequent to least frequent
SAS: By default, there is no order, we can specify the option order=freq to make it descending.
Python: Just drop the sort_index() method.
4. Additional options
SAS: Specify nopercent and nocum options for not printing the percentage and cumulative frequency and percentages, respectively.
Python: Just drop the last two columns while converting to dataframe.
5. Creating a Frequency Cross Tabulation
SAS: var1*var2 and dropping additional details to keep it simple.
Python: Using pandas crosstab() method.
- Specifying margins equal to true for adding row and column totals.
- By default, the column name for these subtotals is “ALL”, we will change it to “Total” using the margins_name method to match SAS output.
6. Frequency Procedure — Multiple Variables
SAS: use the tables method to apply the freq procedure on multiple variables
Python: we can loop through the variable in the list to get the frequency distribution for multiple variables. The function takes 2 arguments:
B. a list of columns
That’s all guys for SAS Freq in Python, the one thing we can notice here is that a simple proc freq with more details is simpler in SAS as compared to python. Whereas as we go deeper, the Pandas library does the most of the work for us and we just need to utilise the ready-made methods.
Let me know your thoughts in the comment section.