# Predict gender with voice and speech data

A beginner’s guide to implementing classification algorithms in Python

Audience

The intended audience for this short blog post is someone who understands machine learning basics and is interested in the implementation of supervised learning using Python. We will briefly introduce a number of popular classification methods available in the scikit-learn library, including decision tree, random forests, boosting, SVM, and neural networks.

Objective

To predict gender with corresponding voice and speech features.

Dataset

The dataset can be downloaded here on Kaggle. It is consisted of 3,168 observations with the 21 variables, as listed below.

1 target variable:

label (male or female)

20 independent variables:

meanfreq: mean frequency (in kHz)
sd: standard deviation of frequency
median: median frequency (in kHz)
Q25: first quantile (in kHz)
Q75: third quantile (in kHz)
IQR: interquantile range (in kHz)
skew: skewness (see note in specprop description)
kurt: kurtosis (see note in specprop description)
sp.ent: spectral entropy
sfm: spectral flatness
mode: mode frequency
centroid: frequency centroid (see specprop)
meanfun: mean fundamental frequency measured across acoustic signal
minfun: minimum fundamental frequency measured across acoustic signal
maxfun: maximum fundamental frequency measured across acoustic signal
meandom: mean of dominant frequency measured across acoustic signal
mindom: minimum of dominant frequency measured across acoustic signal
maxdom: maximum of dominant frequency measured across acoustic signal
dfrange: range of dominant frequency measured across acoustic signal
modindx: modulation index

Exploration

Let us first see the histograms between the target and independent variables. As we can see below, variables such as sd, Q25, IQR, sp.ent, sfm, mode, and meanfun may help us separate male voices from female voices.

Preparation

Here we will preprocess the data, as some algorithms such as neural networks and SVM tend to perform better with scaled data. In addition, we will also split the full dataset into training and test datasets.

Models

1…Decision Tree

`Accuracy on training set: 1.000Accuracy on test set: 0.961`

2…Random Forests

`Accuracy on training set: 0.998Accuracy on test set: 0.976`

`Accuracy on training set: 0.996Accuracy on test set: 0.975`

4…Support Vector Machine

`Accuracy on training set: 0.985Accuracy on test set: 0.984`

5…Multilayer Perceptron

`Accuracy on training set: 0.995Accuracy on test set: 0.981`
Python Code
`import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport mglearnfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.svm import SVCfrom sklearn.neural_network import MLPClassifier`
`#Read the voice datasetmydata = pd.read_csv("voice/voice.csv")`
`#Preview voice datasetmydata.head()print(mydata.shape)`
`#Plot the histogramsmale = mydata.loc[mydata['label']=='male']female = mydata.loc[mydata['label']=='female']`
`fig, axes = plt.subplots(10, 2, figsize=(10,20))`
`ax = axes.ravel()`
`for i in range(20):    ax[i].hist(male.ix[:,i], bins=20, color=mglearn.cm3(0), alpha=.5)    ax[i].hist(female.ix[:, i], bins=20, color=mglearn.cm3(2), alpha=.5)    ax[i].set_title(list(male)[i])    ax[i].set_yticks(())    ax[0].set_xlabel("Feature magnitude")ax[0].set_ylabel("Frequency")ax[0].legend(["male", "female"], loc="best")fig.tight_layout()`
`#Prepare data for modelingmydata.loc[:,'label'][mydata['label']=="male"] = 0mydata.loc[:,'label'][mydata['label']=="female"] = 1`
`mydata_train, mydata_test = train_test_split(mydata, random_state=0, test_size=.2)scaler = StandardScaler()scaler.fit(mydata_train.ix[:,0:20])X_train = scaler.transform(mydata_train.ix[:,0:20])X_test = scaler.transform(mydata_test.ix[:,0:20])y_train = list(mydata_train['label'].values)y_test = list(mydata_test['label'].values)`
`#Train decision tree modeltree = DecisionTreeClassifier(random_state=0).fit(X_train, y_train)print("Decision Tree")print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))`
`#Train random forest modelforest = RandomForestClassifier(n_estimators=5, random_state=0).fit(X_train, y_train)print("Random Forests")print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))`
`#Train gradient boosting modelgbrt = GradientBoostingClassifier(random_state=0).fit(X_train, y_train)print("Gradient Boosting")print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))`
`#Train support vector machine modelsvm = SVC().fit(X_train, y_train)print("Support Vector Machine")print("Accuracy on training set: {:.3f}".format(svm.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(svm.score(X_test, y_test)))`
`#Train neural network modelmlp = MLPClassifier(random_state=0).fit(X_train, y_train)print("Multilayer Perceptron")print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))`
`#Plot the variable importancedef plot_feature_importances_mydata(model):    n_features = X_train.shape[1]    plt.barh(range(n_features), model.feature_importances_, align='center')    plt.yticks(np.arange(n_features), list(mydata))    plt.xlabel("Variable importance")    plt.ylabel("Independent Variable")`
`plot_feature_importances_mydata(tree)plot_feature_importances_mydata(forest)plot_feature_importances_mydata(gbrt)`
`#Plot the heatmap on first layer weights for neural networkplt.figure(figsize=(100, 20))plt.imshow(mlp.coefs_[0], interpolation='none', cmap='viridis')plt.yticks(range(20), list(mydata),fontsize = 50)plt.xlabel("Columns in weight matrix", fontsize = 50)plt.ylabel("Input feature", fontsize = 50)plt.colorbar().set_label('Importance',size=50)    plt.show()`