--

# Introduction to Proc Freq

The frequency distribution of categorical variables is an essential part of descriptive statistics. SAS offers the freq procedure to easily achieve these stats. However, there are two reasons why one should know how to get frequency distribution in Python:

• Python is widely used in data science and offers a range of statistical libraries.
• Migrating code from SAS to Python requires learning how to perform frequency distribution using Python libraries.
• Python is a popular choice for cloud-based data processing and analysis, and knowing how to perform frequency distribution in Python can help teams leverage cloud-based analytics solutions.

## Applications/Resources Used:

• SAS OnDemand (earlier University Edition) / — For writing SAS Code
• Kaggle — For writing python code
• Pandas — For dataframe operations

# 1. Basics

While SAS code can be concise, Python’s Pandas library offers simple and powerful methods to perform similar tasks

## 1.1 Single Var

The SAS `PROC FREQ` procedure is used to obtain the frequency distribution of categorical variables.

SAS

• The `TABLES` statement specifies the variable for which the frequency distribution is to be obtained. The `DATA` statement specifies the input data set.
`/* Import the CSV */FILENAME Gov_C "/folders/myfolders/Assignments/governors_county.csv";PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS; GETNAMES=YES;RUN;/* freq procedure */proc freq data=Gov_C_SAS; tables state;run;`

Python

1. Importing the CSV
• To import a CSV file, we can use Pandas’ `read_csv()` method.

2. Sorted Frequencies by Labels

• To obtain frequency distribution sorted by labels, we can use Pandas library’s `value_counts()` method. This returns a pandas series object containing counts of unique values.
• We can sort the data based on labels using `sort_index()` method.

3. Resultant Series to a DataFrame

• To convert the resulting series object to a DataFrame, we can use the `DataFrame()` method.
• To obtain percentages for each value, we can divide the data values by their sum and multiply by 100. We can use the `sum()` and `cumsum()` functions to get the sum and cumulative sums of the variables.
• To match the SAS output, we can round the percentages up to two decimal places.
`# Import Pandasimport pandas as pd# Import CSVdata = pd.read_csv("../input/us-election-2020/governors_county.csv");# Frequencies in Actual Orderdatax = data['state'].value_counts().sort_index()# Create a dataframedatay = pd.DataFrame({  'state': datax.index,  'Frequency': datax.values,  'Percent': ((datax.values/datax.values.sum())*100).round(2),  'Cumulative Frequency': datax.values.cumsum(),  'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100)\.round(2)})`

Output

Note: The Index can be dropped while exporting the DF to excel/CSV.

## 1.2 Multiple Variables

SAS

Just add another variable in the tables. `order=freq` is used to sort the distribution in descending order.

`proc freq data=Gov_C_SAS order=freq; tables GENRE DISTRIBUTOR;run;`

Python

The `multi_freq` function takes a list of variables and returns a list of frequency tables for each variable using the `value_counts()` method of the pandas DataFrame. The tables are formatted using the `tabulate` library, which converts the tables to PostgreSQL-style ASCII art.

The `freq_tables` variable is assigned the result of calling the `multi_freq` function with the list `['GENRE', 'DISTRIBUTOR']`.

The `_` variable is used to capture the `None` value returned by the `print` function when it is called for each table in `freq_tables`. This prevents the `None` value from being displayed on the console.

`from tabulate import tabulatedef multi_freq(variables):    tables = []    for var in variables:        table = df[var].value_counts().reset_index().\                rename(columns={'index': var, var: 'count'})        table_str = tabulate(table, headers='keys', tablefmt='psql')        tables.append(table_str)    return tablesfreq_tables = multi_freq(['GENRE', 'DISTRIBUTOR'])_ = [print(table) for table in freq_tables]`

## 1.3 Number of Distinct Values

`proc freq data=highest_grossers nlevels;tables GENRE;run;`

To obtain the number of distinct values in a variable using Python, there are built-in functions such as `nunique()` and `value_counts()`. However, if you want to obtain an output similar to SAS, you can use the below custom function.

The following code creates a function called freq_table() that takes as input the pandas dataframe and the variable of interest. The function then calculates the number of unique values, the frequency distribution of distinct categories, and the number of missing levels. Finally, it outputs a table that summarizes the results.

`def freq_table(df, var):    unique = df[var].nunique()    value_counts = df[var].value_counts(dropna=False).count()    missing_levels = value_counts - unique    table = pd.DataFrame({        'Variable': [var],        'Levels': [value_counts],        'Missing Levels': [missing_levels],        'Nonmissing Levels': [unique]    })    print('Number of Variable Levels')    return tablefreq_table(df, 'GENRE')`

# 2. Include the missing values

SAS: By default, missing values are dropped, use the `missing` option to include them as a group.

Python: By default, the missing values are dropped, to keep missing values in the frequency table, add the `dropna` parameter and set it to False.

# 3. Sorting Frequencies in Descending Order

SAS: By default, there is no order, we can specify the option order=freq to make it descending.

Python: Just drop the sort_index() method.

# 4. Remove Unwanted Statistics

SAS: Specify `nopercent` and `nocum` options for not printing the percentage and cumulative frequency and percentages, respectively.

Python: Just drop the last two columns while converting to dataframe.

# 5. Creating a Frequency Cross Tabulation

## 5.1 Tabular Reports

SAS: var1*var2 and dropping additional details to keep it simple.

The below code snippet calculates frequency tables for the variables `county` and `state` in the dataset `Gov_C_SAS`. The options `norow`, `nocol`, and `nopercent` are used to suppress the display of row and column percentages.

Python: Using pandas crosstab() method.

• Specifying margins equal to true for adding row and column totals.
• By default, the column name for these subtotals is “ALL”, we will change it to “Total” using the margins_name method to match the SAS output.

## 5.2 List Reports

The below SAS code generates a frequency table for the variables `DISTRIBUTOR` and `GENRE` in the dataset `h_grosser` using the proc freq procedure.

The `list` option in the ‘tables’ statement tells SAS to generate the frequency table in list format.

`proc freq data=h_grosser;tables DISTRIBUTOR * GENRE / list nocum missing;run;`

The below python code generates a similar frequency table for the variables `DISTRIBUTOR` and `GENRE` in a DataFrame called ‘df’.

`# Group data by 'DISTRIBUTOR' and 'GENRE'grouped_df = df.groupby(['DISTRIBUTOR', 'GENRE'], dropna=False)# Calculate frequency and percentage for each groupresult = grouped_df.size().reset_index(name='Frequency')result['Percent'] = (result['Frequency'] / result['Frequency'].sum() * 100)\                    .round(2)print(result)`

That’s all guys for SAS Freq in Python, the one thing we can notice here is that a simple proc freq with more details is simpler in SAS as compared to python. As we go deeper, the Pandas library does most of the work for us and we need to utilise the ready-made methods.

Let me know your thoughts in the comment section.

--

--

MLOps Engineer @ Genpact / psrajput.com / Running (10k in 59.12, 5k in 26.15) / Cricket / Trekking / Chess