5 Categorical Feature Encoding Techniques in SAS (Classic Encoder)— By Suraj Saini

suraj saini (Amar)
Analytics Vidhya
Published in
4 min readFeb 18, 2021

What is Categorical Feature Encoding?

Categorical variables are usually represented as strings in limited numbers while categorical feature encoding is the process of converting data into a format understandable by machine learning models.

The performance of machine models depends on several factors. One factor that determines the performance of the models is the methods used to process data and feed it to the model. As such, encoding data is a crucial process because it converts data into categorical variables understandable by machine learning models. Encoding data elevates model quality and helps in feature engineering.

In this blog, we explore the different classic encoding methods along with a snapshot of how each encoding method works in SAS Macro.

1. Label Encoding

Label Encoding assigns the value of 1-N to a class of categorical features. For instance, if there is a variable “Hair colour” with values of Black, Brown, and Red, Label encoding will replace these values with 1, 2, and 3. However, one problem with Label Encoding is that it does not consider the order or any relationship between class levels. This will not stop machine learning algorithms from treating them in this incorrect order, which may lead to inaccurate readings.

SAS Macro for Label Encoding

Here is an example macro to perform Label Encoding in SAS:

%macro label_encode(dataset,var); 
proc sql noprint; select distinct(&var)
into:val1-
from &dataset;
select count(distinct(&var))
into:mx
from &dataset;
quit;
data new;
set &dataset;
%do i=1 %to &mx;
if &var="&&&val&i" then new=&i;
%end;
run;
%mend;

2. Binary Encoding

Binary Encoding converts class values into numeric values as Label Encoding does. However, Binary Encoding takes it a step further and converts the numeric values into binary numbers where each digit will have its own separate column.

“If there are n unique categories, then binary encoding results in the only log (base 2) ⁿ features”.

For more information, visit here.

SAS Macro for Binary Encoding

%macro binary_encoding(dataset,var);
proc sql noprint;
select distinct(&var)
into:val1-
from &dataset;
select count(distinct(&var))
into:mx
from &dataset;
quit;
data new;
set &dataset;
%do i=1 %to &mx;
if &var="&&&val&i" then new=&i; %end; format new binary.;
run;
%mend;

This macro creates a single variable with a binary formatted value. To split those values into multiple columns, you could create a Split Column Macro.

SAS Macro for Splitting Column

Here is an example macro for splitting columns in SAS:

%macro split_column(data,var); 
data try;
set &data;
cha=put(&var, binary.);
run;
proc sql noprint;
select max(length(cha))
into :ln from try ;
quit;
data &data;
set try;
%do i=1 %to &ln;
c_&i=substr(cha,&i,1);
%end;
run;
%mend;

3. One-Hot Encoding

One-Hot Encoding is the process of converting categorical variables into 1’s and 0’s. The binary digits are fed into machine learning, deep learning, and statistical algorithms to make better predictions or improve the efficiency of the ML/DL/Statistical models.

SAS Macro for One-Hot Encoding

%macro hot_encoding(data,var); 
proc sql noprint;
select distinct &var
into:val1-
from &data;
select count(distinct(&var))
into:len from &data; quit;
data encoded_data;
set &data;
%do i=1 %to &len;
if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ; else %sysfunc(compress(&&&val&i,'$ - /'))=0;
%end;
run;
%mend;

4. Count/Frequency Encoding

As the name suggests, Frequency Encoding counts unique class values, then divides it by the total number of values. This encoding technique helps the model understand and assign the weight either inversely or directly.

SAS Macro for Count/ Frequency Encoding

%macro frequency_encoding(dataset, var); 
proc sql noprint;
create table freq as select distinct(&var) as values, count(&var) as number
from &dataset
group by Values ;
create table new as select *, round(freq.number/count(&var),00.01) As freq_encode
from &dataset left join freq on &var=freq.values;
quit;
data new(drop=values number);
set new;
run;
%mend;

5. Effect/Sum/ Deviation Encoding

The Deviation Encoding technique has different names, like Effect encoding, some analysts call it Deviation Encoding, and some say Sum Encoding, but the meaning and the definition is the same. Deviation encoding is the same as Hot Encoding, but the only difference is if there are 0 values in all the columns, then the values will become -1. For example

One Hot Encode

Effect/ Sum/ Deviation

SAS Macro for Effect/Sum/ Deviation Encoding

%macro sum_encoding(data,var); 
proc sql noprint;
select distinct &var
into:val1-
from &data;
select count(distinct(&var))
into:len
from &data;
quit;
data encoded_data;
set &data;
%do i=1 %to &len;
if &var="&&&val&i" then %sysfunc(compress(&&&val&i,'$ - /'))=1 ; else %sysfunc(compress(&&&val&i,'$ - /'))=0;
%end;
run;
data sum_encode;
set encoded_data;
if %sysfunc(compress(&&&val&Len,'$ - /'))=1 then do;
%do x=1 %to %eval(&len-1);
%sysfunc(compress(&&&val&x,'$ - /'))=-1;
%end;
end;
drop %sysfunc(compress(&&&val&Len,'$ - /'));
run;
%mend;

Wrapping Up

A data scientist spends over 70–80% of their time cleaning and preparing data, which means encoding or converting categorical data is a crucial part of their work. However, it is important to select the right encoding technique to ensure data quality, which is why it is important to understand the different encoding methods.

If you are looking for more information, more specifically, on the SAS Macro Definition code, you can check it out on my Github page here.

Originally published at https://seleritysas.com on February 18, 2021.

--

--

suraj saini (Amar)
Analytics Vidhya

SAS Certified Programming Specialist, passionate about Machine Learning, Feature Engineering and Data Science.