Predict gender with voice and speech data

A beginner’s guide to implementing classification algorithms in Python

Image: Information Age

The intended audience for this short blog post is someone who understands machine learning basics and is interested in the implementation of supervised learning using Python. We will briefly introduce a number of popular classification methods available in the scikit-learn library, including decision tree, random forests, boosting, SVM, and neural networks.


To predict gender with corresponding voice and speech features.


The dataset can be downloaded here on Kaggle. It is consisted of 3,168 observations with the 21 variables, as listed below.

1 target variable:

label (male or female)

20 independent variables:

meanfreq: mean frequency (in kHz)
sd: standard deviation of frequency
median: median frequency (in kHz)
Q25: first quantile (in kHz)
Q75: third quantile (in kHz)
IQR: interquantile range (in kHz)
skew: skewness (see note in specprop description)
kurt: kurtosis (see note in specprop description)
sp.ent: spectral entropy
sfm: spectral flatness
mode: mode frequency
centroid: frequency centroid (see specprop)
meanfun: mean fundamental frequency measured across acoustic signal
minfun: minimum fundamental frequency measured across acoustic signal
maxfun: maximum fundamental frequency measured across acoustic signal
meandom: mean of dominant frequency measured across acoustic signal
mindom: minimum of dominant frequency measured across acoustic signal
maxdom: maximum of dominant frequency measured across acoustic signal
dfrange: range of dominant frequency measured across acoustic signal
modindx: modulation index


Let us first see the histograms between the target and independent variables. As we can see below, variables such as sd, Q25, IQR, sp.ent, sfm, mode, and meanfun may help us separate male voices from female voices.

Histograms of 20 independent variables against the target variable

Here we will preprocess the data, as some algorithms such as neural networks and SVM tend to perform better with scaled data. In addition, we will also split the full dataset into training and test datasets.


1…Decision Tree

Accuracy on training set: 1.000
Accuracy on test set: 0.961

2…Random Forests

Accuracy on training set: 0.998
Accuracy on test set: 0.976

3…Gradient Boosting

Accuracy on training set: 0.996
Accuracy on test set: 0.975

4…Support Vector Machine

Accuracy on training set: 0.985
Accuracy on test set: 0.984

5…Multilayer Perceptron

Accuracy on training set: 0.995
Accuracy on test set: 0.981
Heat map of the first layer weights in the trained neural network
Python Code
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
#Read the voice dataset
mydata = pd.read_csv("voice/voice.csv")
#Preview voice dataset
#Plot the histograms
male = mydata.loc[mydata['label']=='male']
female = mydata.loc[mydata['label']=='female']
fig, axes = plt.subplots(10, 2, figsize=(10,20))
ax = axes.ravel()
for i in range(20):
ax[i].hist(male.ix[:,i], bins=20, color=mglearn.cm3(0), alpha=.5)
ax[i].hist(female.ix[:, i], bins=20, color=mglearn.cm3(2), alpha=.5)

ax[0].set_xlabel("Feature magnitude")
ax[0].legend(["male", "female"], loc="best")
#Prepare data for modeling
mydata.loc[:,'label'][mydata['label']=="male"] = 0
mydata.loc[:,'label'][mydata['label']=="female"] = 1
mydata_train, mydata_test = train_test_split(mydata, random_state=0, test_size=.2)
scaler = StandardScaler()[:,0:20])
X_train = scaler.transform(mydata_train.ix[:,0:20])
X_test = scaler.transform(mydata_test.ix[:,0:20])
y_train = list(mydata_train['label'].values)
y_test = list(mydata_test['label'].values)
#Train decision tree model
tree = DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
print("Decision Tree")
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
#Train random forest model
forest = RandomForestClassifier(n_estimators=5, random_state=0).fit(X_train, y_train)
print("Random Forests")
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))
#Train gradient boosting model
gbrt = GradientBoostingClassifier(random_state=0).fit(X_train, y_train)
print("Gradient Boosting")
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
#Train support vector machine model
svm = SVC().fit(X_train, y_train)
print("Support Vector Machine")
print("Accuracy on training set: {:.3f}".format(svm.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(svm.score(X_test, y_test)))
#Train neural network model
mlp = MLPClassifier(random_state=0).fit(X_train, y_train)
print("Multilayer Perceptron")
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
#Plot the variable importance
def plot_feature_importances_mydata(model):
n_features = X_train.shape[1]
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), list(mydata))
plt.xlabel("Variable importance")
plt.ylabel("Independent Variable")
#Plot the heatmap on first layer weights for neural network
plt.figure(figsize=(100, 20))
plt.imshow(mlp.coefs_[0], interpolation='none', cmap='viridis')
plt.yticks(range(20), list(mydata),fontsize = 50)
plt.xlabel("Columns in weight matrix", fontsize = 50)
plt.ylabel("Input feature", fontsize = 50)

Questions, comments, or concerns?