Handling categorical data for neural network

vc
vclab
Published in
2 min readJun 16, 2016

This is part of an experiment on studying applicability of neural network.

Categorical data

Categorical data contain no intrinsic ordering among their values. Simply mapping the data to a one-dimensional list of numbers would introduce an ordering to the categories and mislead classifiers. Hence, the categories should be treated as independent dimensions.

Curse of dimensionality

Expanding categorical data to a number of new dimensions would sometimes make the dataset become too complex to handle. If it is the case, complexity of the dataset can be reduced by dimensionality reduction tools, like principle component analysis (PCA) and independent component analysis (ICA), sacrificing classification accuracy due to information loss in dimension reduction.

Experiment results

We convert categorical variables into dummy variables.

df = pd.concat( [ pd.get_dummies( df[ x ] ) for x in cols ], axis=1 ).assign( target=df.target )

A neural network with a hidden layer and a dropout layer is trained with categorical cross entropy as objective and adam as optimizer.

m = Sequential()
m.add( Dense( 16, input_dim=X.shape[ 1 ], init='glorot_normal', activation='relu' ) )
m.add( Dropout( .5 ) )
m.add( Dense( 2, init='glorot_normal', activation='softmax' ) )
m.compile( loss='categorical_crossentropy', optimizer='adam' )

The network classifies 99.29% of data correctly.

On the contrary, if each categorical variable is mapped to a single dimension, a network with same configuration classifies 81.29% correctly.

Moreover, if PCA and ICA are applied to the dataset with dummy variables to halve its number of dimensions, a network with same configuration classifies 97.30% and 95.66% correctly respectively, but the running time is reduced by 48.65% and 39.19% respectively.

References

Tropp, A., N. Halko, and P. G. Martinsson. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. No. 2009–05. Technical report, 2009.

Hyvärinen, Aapo, and Erkki Oja. “Independent component analysis: algorithms and applications.” Neural networks 13.4 (2000): 411–430.

Rögnvaldsson, Thorsteinn, Liwen You, and Daniel Garwicz. “State of the art prediction of HIV-1 protease cleavage sites.” Bioinformatics (2014): btu810.

--

--