Can we get SAS Proc Freq with Python?

This article will explore the SAS Proc Freq procedure and how we can achieve similar results using Python libraries.

Pradeep Singh
Eduonline 24
6 min readNov 21, 2020

--

Introduction to Proc Freq

The frequency distribution of categorical variables is an essential part of descriptive statistics. SAS offers the freq procedure to easily achieve these stats. However, there are two reasons why one should know how to get frequency distribution in Python:

  • Python is widely used in data science and offers a range of statistical libraries.
  • Migrating code from SAS to Python requires learning how to perform frequency distribution using Python libraries.
  • Python is a popular choice for cloud-based data processing and analysis, and knowing how to perform frequency distribution in Python can help teams leverage cloud-based analytics solutions.

Applications/Resources Used:

  • SAS OnDemand (earlier University Edition) / — For writing SAS Code
  • Kaggle — For writing python code
  • Pandas — For dataframe operations

1. Basics

While SAS code can be concise, Python’s Pandas library offers simple and powerful methods to perform similar tasks

1.1 Single Var

The SAS PROC FREQ procedure is used to obtain the frequency distribution of categorical variables.

SAS

  • The TABLES statement specifies the variable for which the frequency distribution is to be obtained. The DATA statement specifies the input data set.
/* Import the CSV */

FILENAME Gov_C "/folders/myfolders/Assignments/governors_county.csv";

PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;
GETNAMES=YES;
RUN;

/* freq procedure */
proc freq data=Gov_C_SAS;
tables state;
run;

Python

  1. Importing the CSV
  • To import a CSV file, we can use Pandas’ read_csv() method.

2. Sorted Frequencies by Labels

  • To obtain frequency distribution sorted by labels, we can use Pandas library’s value_counts() method. This returns a pandas series object containing counts of unique values.
  • We can sort the data based on labels using sort_index() method.

3. Resultant Series to a DataFrame

  • To convert the resulting series object to a DataFrame, we can use the DataFrame() method.
  • To obtain percentages for each value, we can divide the data values by their sum and multiply by 100. We can use the sum() and cumsum() functions to get the sum and cumulative sums of the variables.
  • To match the SAS output, we can round the percentages up to two decimal places.
# Import Pandas
import pandas as pd

# Import CSV
data = pd.read_csv("../input/us-election-2020/governors_county.csv");

# Frequencies in Actual Order
datax = data['state'].value_counts().sort_index()

# Create a dataframe
datay = pd.DataFrame({
'state': datax.index,
'Frequency': datax.values,
'Percent': ((datax.values/datax.values.sum())*100).round(2),
'Cumulative Frequency': datax.values.cumsum(),
'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100)\
.round(2)
})

Output

Note: The Index can be dropped while exporting the DF to excel/CSV.

1.2 Multiple Variables

SAS

Just add another variable in the tables. order=freq is used to sort the distribution in descending order.

proc freq data=Gov_C_SAS order=freq;
tables GENRE DISTRIBUTOR;
run;

Python

The multi_freq function takes a list of variables and returns a list of frequency tables for each variable using the value_counts() method of the pandas DataFrame. The tables are formatted using the tabulate library, which converts the tables to PostgreSQL-style ASCII art.

The freq_tables variable is assigned the result of calling the multi_freq function with the list ['GENRE', 'DISTRIBUTOR'].

The _ variable is used to capture the None value returned by the print function when it is called for each table in freq_tables. This prevents the None value from being displayed on the console.

from tabulate import tabulate

def multi_freq(variables):
tables = []
for var in variables:
table = df[var].value_counts().reset_index().\
rename(columns={'index': var, var: 'count'})
table_str = tabulate(table, headers='keys', tablefmt='psql')
tables.append(table_str)
return tables

freq_tables = multi_freq(['GENRE', 'DISTRIBUTOR'])
_ = [print(table) for table in freq_tables]

1.3 Number of Distinct Values

proc freq data=highest_grossers nlevels;
tables GENRE;
run;

To obtain the number of distinct values in a variable using Python, there are built-in functions such as nunique() and value_counts(). However, if you want to obtain an output similar to SAS, you can use the below custom function.

The following code creates a function called freq_table() that takes as input the pandas dataframe and the variable of interest. The function then calculates the number of unique values, the frequency distribution of distinct categories, and the number of missing levels. Finally, it outputs a table that summarizes the results.

def freq_table(df, var):
unique = df[var].nunique()
value_counts = df[var].value_counts(dropna=False).count()
missing_levels = value_counts - unique

table = pd.DataFrame({
'Variable': [var],
'Levels': [value_counts],
'Missing Levels': [missing_levels],
'Nonmissing Levels': [unique]
})

print('Number of Variable Levels')

return table

freq_table(df, 'GENRE')

2. Include the missing values

SAS: By default, missing values are dropped, use the missing option to include them as a group.

Python: By default, the missing values are dropped, to keep missing values in the frequency table, add the dropna parameter and set it to False.

3. Sorting Frequencies in Descending Order

SAS: By default, there is no order, we can specify the option order=freq to make it descending.

Proc freq with descending order
SAS Proc Freq in Descending Order

Python: Just drop the sort_index() method.

Python Equivalent of SAS Proc Freq in Descending Order

4. Remove Unwanted Statistics

SAS: Specify nopercent and nocum options for not printing the percentage and cumulative frequency and percentages, respectively.

SAS Proc Freq with Only Frequency

Python: Just drop the last two columns while converting to dataframe.

Python Equivalent of SAS Proc Freq with Only Frequency

5. Creating a Frequency Cross Tabulation

5.1 Tabular Reports

SAS: var1*var2 and dropping additional details to keep it simple.

The below code snippet calculates frequency tables for the variables county and state in the dataset Gov_C_SAS. The options norow, nocol, and nopercent are used to suppress the display of row and column percentages.

Cross Tab of Two Vars in SAS

Python: Using pandas crosstab() method.

  • Specifying margins equal to true for adding row and column totals.
  • By default, the column name for these subtotals is “ALL”, we will change it to “Total” using the margins_name method to match the SAS output.
Python Cross Tab of Two Vars Equivalent To SAS

5.2 List Reports

The below SAS code generates a frequency table for the variables DISTRIBUTOR and GENRE in the dataset h_grosser using the proc freq procedure.

The list option in the ‘tables’ statement tells SAS to generate the frequency table in list format.

proc freq data=h_grosser;
tables DISTRIBUTOR * GENRE / list nocum missing;
run;

The below python code generates a similar frequency table for the variables DISTRIBUTOR and GENRE in a DataFrame called ‘df’.

# Group data by 'DISTRIBUTOR' and 'GENRE'
grouped_df = df.groupby(['DISTRIBUTOR', 'GENRE'], dropna=False)

# Calculate frequency and percentage for each group
result = grouped_df.size().reset_index(name='Frequency')
result['Percent'] = (result['Frequency'] / result['Frequency'].sum() * 100)\
.round(2)

print(result)
List Report — SAS vs Python
List Report — SAS vs Python

That’s all guys for SAS Freq in Python, the one thing we can notice here is that a simple proc freq with more details is simpler in SAS as compared to python. As we go deeper, the Pandas library does most of the work for us and we need to utilise the ready-made methods.

Let me know your thoughts in the comment section.

--

--

Pradeep Singh
Eduonline 24

MLOps Engineer @ Genpact / psrajput.com / Running (10k in 59.12, 5k in 26.15) / Cricket / Trekking / Chess