# Can we get SAS Proc Freq with Python?

## This article will explore the SAS Proc Freq procedure and how we can achieve similar results using Python libraries.

# Introduction to Proc Freq

The frequency distribution of categorical variables is an essential part of descriptive statistics. SAS offers the freq procedure to easily achieve these stats. However, there are two reasons why one should know how to get frequency distribution in Python:

- Python is widely used in data science and offers a range of statistical libraries.
- Migrating code from SAS to Python requires learning how to perform frequency distribution using Python libraries.
- Python is a popular choice for cloud-based data processing and analysis, and knowing how to perform frequency distribution in Python can help teams leverage cloud-based analytics solutions.

## Applications/Resources Used:

- SAS OnDemand (earlier University Edition) / — For writing SAS Code
- Kaggle — For writing python code
- Pandas — For dataframe operations

# 1. Basics

While SAS code can be concise, Python’s Pandas library offers simple and powerful methods to perform similar tasks

## 1.1 Single Var

The SAS `PROC FREQ`

procedure is used to obtain the frequency distribution of categorical variables.

**SAS**

- The
`TABLES`

statement specifies the variable for which the frequency distribution is to be obtained. The`DATA`

statement specifies the input data set.

`/* Import the CSV */`

FILENAME Gov_C "/folders/myfolders/Assignments/governors_county.csv";

PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;

GETNAMES=YES;

RUN;

/* freq procedure */

proc freq data=Gov_C_SAS;

tables state;

run;

**Python**

*Importing the CSV*

- To import a CSV file, we can use Pandas’

method.*read_csv()*

*2. Sorted Frequencies by Labels*

- To obtain frequency distribution sorted by labels, we can use Pandas library’s

method. This returns a pandas series object containing counts of unique values.*value_counts()* - We can sort the data based on labels using

method.*sort_index()*

*3. Resultant Series to a DataFrame*

- To convert the resulting series object to a DataFrame, we can use the

method.*DataFrame()* - To obtain percentages for each value, we can divide the data values by their sum and multiply by 100. We can use the

and*sum()*

functions to get the sum and cumulative sums of the variables.*cumsum()* - To match the SAS output, we can round the percentages up to two decimal places.

`# Import Pandas`

import pandas as pd

# Import CSV

data = pd.read_csv("../input/us-election-2020/governors_county.csv");

# Frequencies in Actual Order

datax = data['state'].value_counts().sort_index()

# Create a dataframe

datay = pd.DataFrame({

'state': datax.index,

'Frequency': datax.values,

'Percent': ((datax.values/datax.values.sum())*100).round(2),

'Cumulative Frequency': datax.values.cumsum(),

'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100)\

.round(2)

})

**Output**

Note: The Index can be dropped while exporting the DF to excel/CSV.

## 1.2 Multiple Variables

**SAS**

Just add another variable in the tables. `order=freq`

is used to sort the distribution in descending order.

`proc freq data=Gov_C_SAS order=freq;`

tables GENRE DISTRIBUTOR;

run;

**Python**

The `multi_freq`

function takes a list of variables and returns a list of frequency tables for each variable using the `value_counts()`

method of the pandas DataFrame. The tables are formatted using the `tabulate`

library, which converts the tables to PostgreSQL-style ASCII art.

The `freq_tables`

variable is assigned the result of calling the `multi_freq`

function with the list `['GENRE', 'DISTRIBUTOR']`

.

The `_`

variable is used to capture the `None`

value returned by the `print`

function when it is called for each table in `freq_tables`

. This prevents the `None`

value from being displayed on the console.

`from tabulate import tabulate`

def multi_freq(variables):

tables = []

for var in variables:

table = df[var].value_counts().reset_index().\

rename(columns={'index': var, var: 'count'})

table_str = tabulate(table, headers='keys', tablefmt='psql')

tables.append(table_str)

return tables

freq_tables = multi_freq(['GENRE', 'DISTRIBUTOR'])

_ = [print(table) for table in freq_tables]

## 1.3 Number of Distinct Values

`proc freq data=highest_grossers nlevels;`

tables GENRE;

run;

To obtain the number of distinct values in a variable using Python, there are built-in functions such as `nunique()`

and `value_counts()`

. However, if you want to obtain an output similar to SAS, you can use the below custom function.

The following code creates a function called *freq_table()* that takes as input the pandas dataframe and the variable of interest. The function then calculates the number of unique values, the frequency distribution of distinct categories, and the number of missing levels. Finally, it outputs a table that summarizes the results.

`def freq_table(df, var):`

unique = df[var].nunique()

value_counts = df[var].value_counts(dropna=False).count()

missing_levels = value_counts - unique

table = pd.DataFrame({

'Variable': [var],

'Levels': [value_counts],

'Missing Levels': [missing_levels],

'Nonmissing Levels': [unique]

})

print('Number of Variable Levels')

return table

freq_table(df, 'GENRE')

# 2. Include the missing values

**SAS**: By default, missing values are dropped, use the `missing`

option to include them as a group.

**Python**: By default, the missing values are dropped, to keep missing values in the frequency table, add the `dropna`

parameter and set it to False.

# 3. Sorting Frequencies in Descending Order

**SAS: **By default, there is no order, we can specify the option *order=freq* to make it descending.

**Python: **Just drop the *sort_index()* method.

# 4. Remove Unwanted Statistics

**SAS: **Specify

and *nopercent*

options for not printing the percentage and cumulative frequency and percentages, respectively.*nocum*

**Python: **Just drop the last two columns while converting to dataframe.

# 5. Creating a Frequency Cross Tabulation

## 5.1 Tabular Reports

**SAS**: var1*var2 and dropping additional details to keep it simple.

The below code snippet calculates frequency tables for the variables `county`

and `state`

in the dataset `Gov_C_SAS`

. The options `norow`

, `nocol`

, and `nopercent`

are used to suppress the display of row and column percentages.

**Python: **Using pandas *crosstab()* method.

- Specifying margins equal to true for adding row and column totals.
- By default, the column name for these subtotals is “
*ALL*”, we will change it to “*Total*” using the*margins_name*method to match the SAS output.

## 5.2 List Reports

The below SAS code generates a frequency table for the variables `DISTRIBUTOR`

and `GENRE`

in the dataset `h_grosser`

using the proc freq procedure.

The

option in the ‘tables’ statement tells SAS to generate the frequency table in list format.**list**

`proc freq data=h_grosser;`

tables DISTRIBUTOR * GENRE / list nocum missing;

run;

The below python code generates a similar frequency table for the variables `DISTRIBUTOR`

and `GENRE`

in a DataFrame called ‘df’.

`# Group data by 'DISTRIBUTOR' and 'GENRE'`

grouped_df = df.groupby(['DISTRIBUTOR', 'GENRE'], dropna=False)

# Calculate frequency and percentage for each group

result = grouped_df.size().reset_index(name='Frequency')

result['Percent'] = (result['Frequency'] / result['Frequency'].sum() * 100)\

.round(2)

print(result)

That’s all guys for SAS Freq in Python, the one thing we can notice here is that a simple proc freq with more details is simpler in SAS as compared to python. As we go deeper, the Pandas library does most of the work for us and we need to utilise the ready-made methods.

Let me know your thoughts in the comment section.