Pandas — One Hot Encoding (OHE)
Pandas Dataframe Examples: AI Secrets— #PySeries#Episode 22
Hi, this post deals with make categorical data numerical in a Data set for application of machine learning algorithms. (Colab File link:)
In machine learning one-hot encoding is a frequently used method to deal with categorical data.
Because many machine learning models need their input variables to be numeric, categorical variables need to be transformed in the pre-processing part. (Wikepedia)
Pandas has a function which can turn a categorical variable into a series of zeros and ones, which makes them a lot easier to quantify and compare.
That is:
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Preparing your dataset:
Download this dataset for this post:
df = pd.read_table("https://data.princeton.edu/wws509/datasets/salary.dat", delim_whitespace = True)df.head()
Get the first impressions of the data we are going to work on … the university professor salary dataset:
Discrimination in Salaries
These are the salary data used in Weisberg’s book, consisting of observations on six variables for 52 tenure-track professors in a small college. The 6 variables are:
sx = Sex, coded 1 for female and 0 for male
rk = Rank, coded
1 for assistant professor,
2 for associate professor, and
3 for full professor
yr = Number of years in current rank
dg = Highest degree, coded 1 if doctorate, 0 if masters
yd = Number of years since highest degree was earned
sl = Academic year salary, in dollars.The file is available in the usual plain text formats as salary.dat using character codes and salary.raw using numeric codes, and in Stata format as salary.dta. Here’s the link to file:file: salary.dat
Let’s get down to practice!
01#PyEx — Python —One Hot Encoding (OHE) — Transforms categories into Numbers — Sex
In order to know all the options of a categorical data set, let’s use Pandas’ unique method, first in the sx column:
df['sx'].unique()array(['male', 'female'], dtype=object)
Let’s do OHE, first in the sex column:
# Turn my column into a dummy valuedummy1 = pd.get_dummies(df['sx'], drop_first=True)# Take a lookdummy.head()
By dropping the first column we did not lose any information, right?
Now we need to either Pandas’ merge or concatenate methods on the set to work with them properly in our DB. We will inject it as a column (axis=1):
df = pd.concat([df, dummy1], axis=1).drop('sx', axis=1)df.head()
Fine, now let’s rearrange the columns by doing a Python trick so that the column male appears on the left (It’s up to you to decide what is the best arrangement:):
cols = df.columns.tolist()# Indexing & Slicing Techniques
cols = cols[-1:] + cols[:-1]cols['male', 'rk', 'yr', 'dg', 'yd', 'sl']
Now bring the table in personalize columns order:
df = df[cols]df.head()
02#PyEx — Python — One Hot Encoding (OHE) — Transforms categories into Numbers — Rank
Now let’s deal with rank (rk):
df['rk'].unique()array(['full', 'associate', 'assistant'], dtype=object)
Creating another dummy set:
# Turn my column into a dummy valuedummy2 = pd.get_dummies(df['rk'])# Take a lookdummy2.head()
Let’s concatenate this second dummy db:
df = pd.concat([df, dummy2], axis=1).drop('rk', axis=1)df.head()
03#PyEx — Python — One Hot Encoding (OHE) — Transforms categories into Numbers — Degree
Now the last one (dg = Highest degree, coded 1 if doctorate, 0 if masters):
df['dg'].unique()array(['doctorate', 'masters'], dtype=object)
Let’s see:
# Turn my column into a dummy valuedummy3 = pd.get_dummies(df['dg'])# Take a lookdummy3.head()
# let’s simplify it once more…
dummy3 = dummy3.drop('masters', axis=1)dummy3.head()
We are almost there; Concatenating now:
# Concatenating now...
# And finally let's go to learn machine learning :)df = pd.concat([df, dummy3], axis=1).drop('dg', axis=1)df.head()
Now the Datasets is ready for AI Algorithm:)
That’s All for this lecture!
See you in the next Python Episode!
Bye!!!!
Colab File link:)
Google Drive link:)
Credits & References
Datasets by data.princeton.edu
One-Hot Encoding in Python — Implementation using Sklearn by morioh.com
The Dummy’s Guide to Creating Dummy Variables by Rowan Langford
Posts Related:
00Episode#PySeries — Python — Jupiter Notebook Quick Start with VSCode — How to Set your Win10 Environment to use Jupiter Notebook
01Episode#PySeries — Python — Python 4 Engineers — Exercises! An overview of the Opportunities Offered by Python in Engineering!
02Episode#PySeries — Python — Geogebra Plus Linear Programming- We’ll Create a Geogebra program to help us with our linear programming
03Episode#PySeries — Python — Python 4 Engineers — More Exercises! — Another Round to Make Sure that Python is Really Amazing!
04Episode#PySeries — Python — Linear Regressions — The Basics — How to Understand Linear Regression Once and For All!
05Episode#PySeries — Python — NumPy Init & Python Review — A Crash Python Review & Initialization at NumPy lib.
06Episode#PySeries — Python — NumPy Arrays & Jupyter Notebook — Arithmetic Operations, Indexing & Slicing, and Conditional Selection w/ np arrays.
07Episode#PySeries — Python — Pandas — Intro & Series — What it is? How to use it?
08Episode#PySeries — Python — Pandas DataFrames — The primary Pandas data structure! It is a dict-like container for Series objects
09Episode#PySeries — Python — Python 4 Engineers — Even More Exercises! — More Practicing Coding Questions in Python!
10Episode#PySeries — Python — Pandas — Hierarchical Index & Cross-section — Open your Colab notebook and here are the follow-up exercises!
11Episode#PySeries — Python — Pandas — Missing Data — Let’s Continue the Python Exercises — Filling & Dropping Missing Data
12Episode#PySeries — Python — Pandas — Group By — Grouping large amounts of data and compute operations on these groups
13Episode#PySeries — Python — Pandas — Merging, Joining & Concatenations — Facilities For Easily Combining Together Series or DataFrame
14Episode#PySeries — Python — Pandas — Pandas Data frame Examples: Column Operations (this one)
15Episode#PySeries — Python — Python 4 Engineers — Keeping It In The Short-Term Memory — Test Yourself! Coding in Python, Again!
16Episode#PySeries — NumPy — NumPy Review, Again;) — Python Review Free Exercises
17Episode#PySeries — Generators in Python — Python Review Free Hints
18Episode#PySeries — Pandas Review…Again;) — Python Review Free Exercise
19Episode#PySeries — MatlibPlot & Seaborn Python Libs — Reviewing theses Plotting & Statistics Packs
20Episode#PySeries — Seaborn Python Review — Reviewing theses Plotting & Statistics Packs
21Episode#PySeries — Pandas— Pandas — One Hot Encoding (OHE) — Pandas Dataframe Examples: AI Secrets (this one)
Only love and art make existence feasible.
Available at: <https://dicionariocriativo.com.br/citacoes/leniente/citacoes/toler%C3%A2ncia>. Accessed on: 03/30/2021.