CODEX

Python — Categorical Data with Pandas

alpha2phi
CodeX
5 min readJan 16, 2021

--

Photo by billow926 on Unsplash

Overview

Numerical, categorical, time series, text, and geolocation data are the common data types that data scientists or analysts deal with daily. I talked about time series data with Pandas previously. In this article let’s go through categorical data.

The Basic

Normally a categorical variable takes on a limited, and usually fixed, number of possible values, with or without an order.

Represent Categories by Numeric

Let’s create a Pandas series with a range of different colors.

import pandas as pdcolors = pd.Series(['green', 'yellow', 'black','blue', 'green', 'red', 'yellow'])
print(colors)
pd.unique(colors)

As you can see, the default data type is the object.

0     green
1 yellow
2 black
3 blue
4 green
5 red
6 yellow
array(['green', 'yellow', 'black', 'blue', 'red'], dtype=object)

For efficiency and better performance, normally in analytics, we represent the values are integers.

# black = 0, blue = 1, green = 2, red = 3, yellow = 4
values = pd.Series([0,0,4,3, 2,1,1, 0, 4] * 2)
colors = pd.Series(['black', 'blue', 'green', 'red'…

--

--

alpha2phi
CodeX
Writer for

Software engineer, Data Science and ML practitioner.