Data Play: Balance Scale

Andrew Xia
BuzzRobot
Published in
8 min readJan 8, 2018

I am a computer scientist with plenty of experience building websites, apps, and such; However, my skills in machine learning and working with data leaves room for improvement. Therefore, through the process of exploring various datasets and documenting my methods, I hope to improve my data science skills and teach others who may be in a similar position as me! :)

The dataset we will tackle today is a small toy set called the Balance Scale set. It’s a classification set describing the weights on a scale and predicts if the scale is balanced. The reason I selected this is: because of its small size and simplistic nature, it is a good start to practicing analysis and reasoning. Furthermore, because of the class dependence on the features, it should be possible to achieve 100% accuracy (or close) easily.

Dataset: https://archive.ics.uci.edu/ml/datasets/Balance+Scale

The tools we’ll be using are:

  • Python 3.4 (Anaconda)
  • Pandas
  • Numpy
  • Scikit-Learn
  • Seaborn
  • Matplotlib PyPlot

Preparing

The data is available in CSV format with five columns. I’ve decided to add titles to each column in the first row of the file so Pandas plays a little nicer. So the first few rows look like:

 balance,left_weight,left_distance,right_weight,right_distance
B,1,1,1,1
R,1,1,1,2
R,1,1,1,3

The only other thing we have to do is vectorize the class. Here is the code to do that:

 import pandas as pd
def load_data():
"""
loading csv to pandas dataframe

:return: dataframe
"""
return pd.read_csv('data/data.txt', sep=',', header=0)
def prepare_data(data):
"""
vectorize balance

:return: data with vectorized classes
"""
data['balance'] = LabelEncoder().fit_transform(data['balance'])
return data


if __name__ == '__main__':
raw_data = load_data()
prepared_data = prepare_data(raw_data)

Note: an alternative approach is to apply one-hot encoding using a one-vs-all classifier, similar performance is expected.

Exploring

Although this is a tiny dataset with a simple enough problem that we can skip right to modelling the data without too much worry; we should still apply best practices and explore the dataset first.

Right off the bat, we notice a few obvious things

Number of rows: 625
Number of class 0: 288 | 46.08%
Number of class 1: 49 | 7.84%
Number of class 2: 288 | 46.08%

The features are all integers ranging from 1 to 5 and have the following makeup.

So they are all evenly distributed throughout their values. Let’s see how they interact with the class.

Now this tells us that the features are decent predictors of the class which affirms our earlier assumption. Now let’s check to see if there are any underlying structures within the data by checking the T-SNE and PCA scatter plots.

Well, this isn’t the best. The data doesn’t seem very easily separable and definitely messier than a perfectly separable dataset should be. Let’s see if we can improve it.

P.S. the code to do all of this exploration is as follows:

def explore_data(data):
"""
perform various forms of data visualisation and analysis

:param data:
:return: None
"""
[m, n] = data.shape
attributes = data.columns.values

def data_summary():
"""
Prints summary of data

:return: None
"""
print('Number of rows: {}'.format(m))
for i in range(0, 3):
[m_class, _] = data[data['balance'] == i].shape
print('Number of class {}: {} | {}%'.format(i, m_class, m_class/m * 100))
print(data.info())
print(data.describe())

def occurrence_histogram():
"""
displays histogram of feature value occurrences.

:return: None
"""
nrows = n // 2
fig, ax = plt.subplots(ncols=2, nrows=nrows)
for i in range(1, len(attributes)):
row, col = (i-1)//2, 1 if i % 2 == 0 else 0
ax_ref = ax[col] if nrows <= 1 else ax[row, col]
sns.distplot(data[attributes[i]], kde=False, vertical=True, ax=ax_ref)
ax_ref.set(xlabel='# of occurrences', ylabel='value', title=attributes[i])
fig.subplots_adjust(hspace=0.75, wspace=0.75)
plt.show()

def correlation_matrix():
"""
Show matrix of correlations

:return: None
"""
plt.subplots(figsize=(10, 10))
sns.heatmap(data[attributes].corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.title('Correlation Matrix')
plt.show()

def covariance_matrix():
"""
Show matrix of covariances between attributes.

:return: None
"""
plt.subplots(figsize=(10, 10))
sns.heatmap(data[attributes].cov(), annot=True, fmt=".2f", cmap="coolwarm")
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.title('Covariance Matrix')
plt.show()

def t_sne_scatter():
"""
Shows a scatter plot of T-SNE embed

:return: None
"""
features = data.drop(['balance'], axis=1)
labels = data['balance']

t_sne = TSNE(n_components=2)
embed = t_sne.fit_transform(features)
x1, y1, x2, y2, x3, y3 = [], [], [], [], [], []
for i, e in enumerate(embed):
if labels[i] == 1:
x1 += [e[0]]
y1 += [e[1]]
elif labels[i] == 2:
x2 += [e[0]]
y2 += [e[1]]
else:
x3 += [e[0]]
y3 += [e[1]]
plt.scatter(x1, y1, c='r')
plt.scatter(x2, y2, c='b')
plt.scatter(x3, y3, c='g')
plt.title('T-SNE Embedding')
plt.show()

def pca_scatter():
"""
Shows a scatter plot of PCA decomposition

:return: None
"""
features = data.drop(['balance'], axis=1)
labels = data['balance']

pca = PCA(n_components=2)
decomposition = pca.fit_transform(features)
x1, y1, x2, y2, x3, y3 = [], [], [], [], [], []
for i, e in enumerate(decomposition):
if labels[i] == 1:
x1 += [e[0]]
y1 += [e[1]]
elif labels[i] == 2:
x2 += [e[0]]
y2 += [e[1]]
else:
x3 += [e[0]]
y3 += [e[1]]
plt.scatter(x1, y1, c='r')
plt.scatter(x2, y2, c='b')
plt.scatter(x3, y3, c='g')
plt.title('PCA Decomposition')
plt.show()

data_summary()
occurrence_histogram()
correlation_matrix()
covariance_matrix()
t_sne_scatter()
pca_scatter()

Feature Engineering

Let’s think about the problem for a second. We have a scale with two weights, in order to determine the balance of the scale, we have to take both the weight and distance from the middle into account. So, let’s simply create two new features by multiplying the weight and distance of each side with the following code:

def engineer_data(data):
"""
Returns modified version of data with left and right aggregate features while dropping weight and distance features

:param data: data to work with
:return: modified dataframe
"""
data['left'] = data['left_weight'] * data['left_distance']
data['right'] = data['right_weight'] * data['right_distance']
data = data.drop(['left_weight', 'left_distance', 'right_weight', 'right_distance'], axis=1)
return data

We drop the the old features as they won’t be providing us with any more information than our new features.

(Edit: I realized afterwards that multiplication was already mentioned in the dataset description. Lesson learned: read descriptions carefully).

Now let’s take a look at the the embedded and decomposition structures.

Much better!

Modelling

We start our modelling process by first setting aside a test set. This is a set we don’t touch until the very end of our process. This is to ensure we are not tuning our parameters for the test set and preserving generalizability of the model. Of the remaining working set, we split it once again between a train and validation set. We use the train set to train our models and validation set to test performance on held-out data and fine tune our models.

Afterwards, it’s a bit more free form. It’s a process of experimenting with different models and finding which ones work best. The process I like to use is first picking models that would logically make sense to perform well on this set (i.e. SVM, KNN, etc.) and applying their vanilla forms (without tuning) to the dataset. If these models perform “reasonably” well, I keep them, otherwise I drop them. In the end I have a few well performing models that I tune using GridSearch. It turns out, a SVM (SVC in SKLearn) performs perfectly. Furthermore, if I combine a set of weaker tuned models (MLP, KNN, and Random Forest) in a voting ensemble. We can get pretty good performance as well.

Here’s the code to do all of this:

def model_data(data):
features = data.drop(['balance'], axis=1)
labels = data['balance']
X_work, X_test, y_work, y_test = train_test_split(features, labels)
# we don't touch the test set until the end.

def grid_search(estimator, grid, X, y):
gs = GridSearchCV(estimator, cv=5, n_jobs=-1, param_grid=grid)
gs.fit(X, y)
print(gs.best_params_)
return gs.best_estimator_

def support_vector(X, y):
svc = SVC()
grid = {
'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}
return grid_search(svc, grid, X, y)

def random_forest(X, y):
rfc = RandomForestClassifier(n_jobs=-1)
grid = {
'n_estimators': np.arange(5, 15),
'criterion': ['gini', 'entropy']
}
return grid_search(rfc, grid, X, y)

def knn(X, y):
knc = KNeighborsClassifier()
grid = {
'n_neighbors': np.arange(1, 10)
}
return grid_search(knc, grid, X, y)

def perceptron(X, y):
mlp = MLPClassifier()
grid = {
'activation': ['identity', 'logistic', 'tanh', 'relu']
}
return grid_search(mlp, grid, X, y)

def vote(X, y):
estimators = [
('random_forest', random_forest(X, y)),
('knn', knn(X, y)),
('perceptron', perceptron(X, y))
]
vc = VotingClassifier(estimators, n_jobs=-1)
vc.fit(X, y)
return vc

avg_train, avg_validate = 0, 0
skf = StratifiedKFold(n_splits=5)
for train_idx, test_idx in skf.split(X_work, y_work):
X_train, X_test, y_train, y_test = \
X_work.iloc[train_idx], X_work.iloc[test_idx], y_work.iloc[train_idx], y_work.iloc[test_idx]
model = vote(X_train, y_train)
avg_train += accuracy_score(y_train, model.predict(X_train))
avg_validate += accuracy_score(y_test, model.predict(X_test))
print('train: {}'.format(avg_train/5))
print('validate: {}'.format(avg_validate/5))

Evaluation

Now let’s see how our model does on that test set that we have purposely ignored up until now. We gauge performance by determining accuracy on both the training and testing sets. This is to detect overfitting should it occur, if the training set accuracy is very high while the test set performs poorly, then this is a strong indicator of overfitting.

In the end we find the following results for the voting classifier:

  • train: 1.0
  • test: 0.9872611464968153

and for the support vector machine classifier:

  • train: 1.0
  • test: 1.0

In this case, there really isn’t much of a need for a precision recall curve as it would be pretty uninteresting, though in other circumstances, plotting the curve may yield additional information about the data and the model used.

Summary

With this dataset, we learned the basics of exploring a dataset. We went through various visualizations and performed basic feature engineering based off of the data visualizations we created. Finally we played with various models and evaluated them against unseen data. It was a tiny toy dataset but it was useful to reinforce standard practices and establish a foundation for more complex problems in the future.

Code: https://github.com/andrew-x/DataPlay/tree/master/BalanceScale

Dataset: https://archive.ics.uci.edu/ml/datasets/Balance+Scale

--

--