CODEX
Overview
Numerical, categorical, time series, text, and geolocation data are the common data types that data scientists or analysts deal with daily. I talked about time series data with Pandas previously. In this article let’s go through categorical data.
The Basic
Normally a categorical variable takes on a limited, and usually fixed, number of possible values, with or without an order.
Represent Categories by Numeric
Let’s create a Pandas series with a range of different colors.
import pandas as pdcolors = pd.Series(['green', 'yellow', 'black','blue', 'green', 'red', 'yellow'])
print(colors)
pd.unique(colors)
As you can see, the default data type is the object.
0 green
1 yellow
2 black
3 blue
4 green
5 red
6 yellow
array(['green', 'yellow', 'black', 'blue', 'red'], dtype=object)
For efficiency and better performance, normally in analytics, we represent the values are integers.
# black = 0, blue = 1, green = 2, red = 3, yellow = 4
values = pd.Series([0,0,4,3, 2,1,1, 0, 4] * 2)
colors = pd.Series(['black', 'blue', 'green', 'red'…