Pandas — One Hot Encoding (OHE)

Pandas Dataframe Examples: AI Secrets— #PySeries#Episode 22

J3
Jungletronics
6 min readMar 29, 2021

--

Hi, this post deals with make categorical data numerical in a Data set for application of machine learning algorithms. (Colab File link:)

In machine learning one-hot encoding is a frequently used method to deal with categorical data.

Because many machine learning models need their input variables to be numeric, categorical variables need to be transformed in the pre-processing part. (Wikepedia)

Pandas has a function which can turn a categorical variable into a series of zeros and ones, which makes them a lot easier to quantify and compare.

That is:

Preparing your dataset:

Download this dataset for this post:

Get the first impressions of the data we are going to work on … the university professor salary dataset:

Fig 1. Source: Discrimination in Salaries — salary.dat — From: https://data.princeton.edu/wws509/datasets/#salary

Discrimination in Salaries

These are the salary data used in Weisberg’s book, consisting of observations on six variables for 52 tenure-track professors in a small college. The 6 variables are:

Let’s get down to practice!

01#PyEx — Python —One Hot Encoding (OHE) — Transforms categories into Numbers — Sex

In order to know all the options of a categorical data set, let’s use Pandas’ unique method, first in the sx column:

Let’s do OHE, first in the sex column:

fig 2.All the set
fig 3. Dropping the first one: without-data lost:)

By dropping the first column we did not lose any information, right?

Now we need to either Pandas’ merge or concatenate methods on the set to work with them properly in our DB. We will inject it as a column (axis=1):

Fig 4. Let’s change the order of the columns…

Fine, now let’s rearrange the columns by doing a Python trick so that the column male appears on the left (It’s up to you to decide what is the best arrangement:):

Now bring the table in personalize columns order:

fig 5. Much, much better…We are almost there…

02#PyEx — Python — One Hot Encoding (OHE) — Transforms categories into Numbers — Rank

Now let’s deal with rank (rk):

Creating another dummy set:

fig 6. Dummy 2 from rank, codes valid: 0 or 1 (Awesome, right? That’s a simplification:)

Let’s concatenate this second dummy db:

Fig 7. Dummy 2 concatenated with df Dataset

03#PyEx — Python — One Hot Encoding (OHE) — Transforms categories into Numbers — Degree

Now the last one (dg = Highest degree, coded 1 if doctorate, 0 if masters):

Let’s see:

Fig 8. Dummy 3; let’s simplify it once more…
Fig 9. That’s better! One has doc title, Zero none…

We are almost there; Concatenating now:

Fig 10. There you have it! \o/

Now the Datasets is ready for AI Algorithm:)

That’s All for this lecture!

See you in the next Python Episode!

Bye!!!!

Colab File link:)

Google Drive link:)

Credits & References

Datasets by data.princeton.edu

One-Hot Encoding in Python — Implementation using Sklearn by morioh.com

The Dummy’s Guide to Creating Dummy Variables by Rowan Langford

Posts Related:

00Episode#PySeries — Python — Jupiter Notebook Quick Start with VSCode — How to Set your Win10 Environment to use Jupiter Notebook

01Episode#PySeries — Python — Python 4 Engineers — Exercises! An overview of the Opportunities Offered by Python in Engineering!

02Episode#PySeries — Python — Geogebra Plus Linear Programming- We’ll Create a Geogebra program to help us with our linear programming

03Episode#PySeries — Python — Python 4 Engineers — More Exercises! — Another Round to Make Sure that Python is Really Amazing!

04Episode#PySeries — Python — Linear Regressions — The Basics — How to Understand Linear Regression Once and For All!

05Episode#PySeries — Python — NumPy Init & Python Review — A Crash Python Review & Initialization at NumPy lib.

06Episode#PySeries — Python — NumPy Arrays & Jupyter Notebook — Arithmetic Operations, Indexing & Slicing, and Conditional Selection w/ np arrays.

07Episode#PySeries — Python — Pandas — Intro & Series — What it is? How to use it?

08Episode#PySeries — Python — Pandas DataFrames — The primary Pandas data structure! It is a dict-like container for Series objects

09Episode#PySeries — Python — Python 4 Engineers — Even More Exercises! — More Practicing Coding Questions in Python!

10Episode#PySeries — Python — Pandas — Hierarchical Index & Cross-section — Open your Colab notebook and here are the follow-up exercises!

11Episode#PySeries — Python — Pandas — Missing Data — Let’s Continue the Python Exercises — Filling & Dropping Missing Data

12Episode#PySeries — Python — Pandas — Group By — Grouping large amounts of data and compute operations on these groups

13Episode#PySeries — Python — Pandas — Merging, Joining & Concatenations — Facilities For Easily Combining Together Series or DataFrame

14Episode#PySeries — Python — Pandas — Pandas Data frame Examples: Column Operations (this one)

15Episode#PySeries — Python — Python 4 Engineers — Keeping It In The Short-Term Memory — Test Yourself! Coding in Python, Again!

16Episode#PySeries — NumPy — NumPy Review, Again;) — Python Review Free Exercises

17Episode#PySeriesGenerators in Python — Python Review Free Hints

18Episode#PySeries — Pandas Review…Again;) — Python Review Free Exercise

19Episode#PySeriesMatlibPlot & Seaborn Python Libs — Reviewing theses Plotting & Statistics Packs

20Episode#PySeriesSeaborn Python Review — Reviewing theses Plotting & Statistics Packs

21Episode#PySeries — Pandas— Pandas — One Hot Encoding (OHE) — Pandas Dataframe Examples: AI Secrets (this one)

Image from https://morioh.com/p/3c7873d7be5e

Only love and art make existence feasible.

Available at: <https://dicionariocriativo.com.br/citacoes/leniente/citacoes/toler%C3%A2ncia>. Accessed on: 03/30/2021.

--

--

J3
Jungletronics

😎 Gilberto Oliveira Jr | 🖥️ Computer Engineer | 🐍 Python | 🧩 C | 💎 Rails | 🤖 AI & IoT | ✍️