Can we get SAS Proc Freq with Python?
This article will explore the SAS Proc Freq procedure and how we can achieve similar results using Python libraries.
Introduction to Proc Freq
The frequency distribution of categorical variables is an essential part of descriptive statistics. SAS offers the freq procedure to easily achieve these stats. However, there are two reasons why one should know how to get frequency distribution in Python:
- Python is widely used in data science and offers a range of statistical libraries.
- Migrating code from SAS to Python requires learning how to perform frequency distribution using Python libraries.
- Python is a popular choice for cloud-based data processing and analysis, and knowing how to perform frequency distribution in Python can help teams leverage cloud-based analytics solutions.
Applications/Resources Used:
- SAS OnDemand (earlier University Edition) / — For writing SAS Code
- Kaggle — For writing python code
- Pandas — For dataframe operations
1. Basics
While SAS code can be concise, Python’s Pandas library offers simple and powerful methods to perform similar tasks
1.1 Single Var
The SAS PROC FREQ
procedure is used to obtain the frequency distribution of categorical variables.
SAS
- The
TABLES
statement specifies the variable for which the frequency distribution is to be obtained. TheDATA
statement specifies the input data set.
/* Import the CSV */
FILENAME Gov_C "/folders/myfolders/Assignments/governors_county.csv";
PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;
GETNAMES=YES;
RUN;
/* freq procedure */
proc freq data=Gov_C_SAS;
tables state;
run;
Python
- Importing the CSV
- To import a CSV file, we can use Pandas’
read_csv()
method.
2. Sorted Frequencies by Labels
- To obtain frequency distribution sorted by labels, we can use Pandas library’s
value_counts()
method. This returns a pandas series object containing counts of unique values. - We can sort the data based on labels using
sort_index()
method.
3. Resultant Series to a DataFrame
- To convert the resulting series object to a DataFrame, we can use the
DataFrame()
method. - To obtain percentages for each value, we can divide the data values by their sum and multiply by 100. We can use the
sum()
andcumsum()
functions to get the sum and cumulative sums of the variables. - To match the SAS output, we can round the percentages up to two decimal places.
# Import Pandas
import pandas as pd
# Import CSV
data = pd.read_csv("../input/us-election-2020/governors_county.csv");
# Frequencies in Actual Order
datax = data['state'].value_counts().sort_index()
# Create a dataframe
datay = pd.DataFrame({
'state': datax.index,
'Frequency': datax.values,
'Percent': ((datax.values/datax.values.sum())*100).round(2),
'Cumulative Frequency': datax.values.cumsum(),
'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100)\
.round(2)
})
Output
Note: The Index can be dropped while exporting the DF to excel/CSV.
1.2 Multiple Variables
SAS
Just add another variable in the tables. order=freq
is used to sort the distribution in descending order.
proc freq data=Gov_C_SAS order=freq;
tables GENRE DISTRIBUTOR;
run;
Python
The multi_freq
function takes a list of variables and returns a list of frequency tables for each variable using the value_counts()
method of the pandas DataFrame. The tables are formatted using the tabulate
library, which converts the tables to PostgreSQL-style ASCII art.
The freq_tables
variable is assigned the result of calling the multi_freq
function with the list ['GENRE', 'DISTRIBUTOR']
.
The _
variable is used to capture the None
value returned by the print
function when it is called for each table in freq_tables
. This prevents the None
value from being displayed on the console.
from tabulate import tabulate
def multi_freq(variables):
tables = []
for var in variables:
table = df[var].value_counts().reset_index().\
rename(columns={'index': var, var: 'count'})
table_str = tabulate(table, headers='keys', tablefmt='psql')
tables.append(table_str)
return tables
freq_tables = multi_freq(['GENRE', 'DISTRIBUTOR'])
_ = [print(table) for table in freq_tables]
1.3 Number of Distinct Values
proc freq data=highest_grossers nlevels;
tables GENRE;
run;
To obtain the number of distinct values in a variable using Python, there are built-in functions such as nunique()
and value_counts()
. However, if you want to obtain an output similar to SAS, you can use the below custom function.
The following code creates a function called freq_table() that takes as input the pandas dataframe and the variable of interest. The function then calculates the number of unique values, the frequency distribution of distinct categories, and the number of missing levels. Finally, it outputs a table that summarizes the results.
def freq_table(df, var):
unique = df[var].nunique()
value_counts = df[var].value_counts(dropna=False).count()
missing_levels = value_counts - unique
table = pd.DataFrame({
'Variable': [var],
'Levels': [value_counts],
'Missing Levels': [missing_levels],
'Nonmissing Levels': [unique]
})
print('Number of Variable Levels')
return table
freq_table(df, 'GENRE')
2. Include the missing values
SAS: By default, missing values are dropped, use the missing
option to include them as a group.
Python: By default, the missing values are dropped, to keep missing values in the frequency table, add the dropna
parameter and set it to False.
3. Sorting Frequencies in Descending Order
SAS: By default, there is no order, we can specify the option order=freq to make it descending.
Python: Just drop the sort_index() method.
4. Remove Unwanted Statistics
SAS: Specify nopercent
and nocum
options for not printing the percentage and cumulative frequency and percentages, respectively.
Python: Just drop the last two columns while converting to dataframe.
5. Creating a Frequency Cross Tabulation
5.1 Tabular Reports
SAS: var1*var2 and dropping additional details to keep it simple.
The below code snippet calculates frequency tables for the variables county
and state
in the dataset Gov_C_SAS
. The options norow
, nocol
, and nopercent
are used to suppress the display of row and column percentages.
Python: Using pandas crosstab() method.
- Specifying margins equal to true for adding row and column totals.
- By default, the column name for these subtotals is “ALL”, we will change it to “Total” using the margins_name method to match the SAS output.
5.2 List Reports
The below SAS code generates a frequency table for the variables DISTRIBUTOR
and GENRE
in the dataset h_grosser
using the proc freq procedure.
The list
option in the ‘tables’ statement tells SAS to generate the frequency table in list format.
proc freq data=h_grosser;
tables DISTRIBUTOR * GENRE / list nocum missing;
run;
The below python code generates a similar frequency table for the variables DISTRIBUTOR
and GENRE
in a DataFrame called ‘df’.
# Group data by 'DISTRIBUTOR' and 'GENRE'
grouped_df = df.groupby(['DISTRIBUTOR', 'GENRE'], dropna=False)
# Calculate frequency and percentage for each group
result = grouped_df.size().reset_index(name='Frequency')
result['Percent'] = (result['Frequency'] / result['Frequency'].sum() * 100)\
.round(2)
print(result)
That’s all guys for SAS Freq in Python, the one thing we can notice here is that a simple proc freq with more details is simpler in SAS as compared to python. As we go deeper, the Pandas library does most of the work for us and we need to utilise the ready-made methods.
Let me know your thoughts in the comment section.