# Naive approaches to solving CIFAR10 dataset image classification problem — overview and results

When solving a general image classification problems you can use variety of tools at hand. As I wrote in the previous article, using so-called *naive* classifiers is an interesting preprocessing step. If we’d be able to separate even a single class using such simple tools, it’d make the final model (which is a convoluted neural network) smaller — thus easier and faster to learn.

I’ve tried to apply two tools here — one is the k-means algorithm, used for solving *grouping* problem. The second one is SGD classifier.

#### Grouping versus Classifying

There is a distinction between problems of *grouping (or clustering) *and *classifying* things. In the problem of classification, resulting classes are given upfront and they’re given as a part of your training set. This information is used in the algorithm to train a given classifier. Then, such trained classifier gets applied to the unknown samples.

On the other hand, *grouping *is a problem of partitioning the given data set into distinct groups. You don’t know the *meaning* of such groups — but you know they’re statistically intentional.

In fact, those problems are not connected in a simple way. But you can use a grouping tools to analyse the set to get some knowledge about the data set which can be helpful in building the classifier.

#### K-means algorithm — an overview

One of the most popular algorithms available for grouping problem is k-means (also called a centroid algorithm). The general idea behind the algorithm is very simple, yet it produces nice results when data tends to form circle-shaped clusters.

Given a set of *n *data samples the centroid algorithm computes *centroids *— which are data samples being “centers” of a group. Such *k* centroids are used to assign each data sample into one of *k *groups. The criterion is very simple — each data sample is getting assigned to the group which is represented by the centroid which is closest to it (in terms of Euclidean distance). Then new centroids are computed by taking a mean of data sample attributes within each group. The process ends after certain number of iterations or when there is no change at all after computing next iteration.

I decided to perform such grouping on the CIFAR10 dataset by perform grouping of images into 10 separate groups. But what was my motivation?

- After performing a grouping, I’d like to compute a quantity which tells me how diverse those groups are in terms of labels. Let’s assume that k-means generated a group which has 1000 images. If it is
*perfectly diverse*in terms of labels (so within those 1000 images there are 100 images of cars, 100 images of trucks and so on) there is a little value from such group for further classification problem. There are quantities that can tell me how*diverse*is such group. - Then I’d like to see what is
*mode*inside each group (the most represented label inside the group). If I’d have a high value of previous quantity and the resulting*mode*is like 1, 2, 3, …, 10 (so each two groups resulting mode is different) it’d classify such information as useful for further classification problem.

You’re curious what quantity I’ve used to check for diversity of resulting groups. It is called *Gini index* and is widely used in economics. First I compute counts of labels within a given group. This is achievable with a Python code like this (where *group_labels* is labeling made by k-means and *train_labels* are real labels from the training set):

Then you need to divide it by amount of items within a group. The equation for Gini index is like below:

Where *i*-th *p *is the count of data sample labeled by *i*-th class within this group divided by the total count of data samples within the group.

This index is close to 0 if a group is diverse and close to 1 if the group is dominated by one particular label. This is mainly used in Decision Tree classifiers (for example in CART algorithm), but I found it useful here.

Running the grouping and printing out Gini indexes shown me the following results (I ran k-means 5 times):

I was quite surprised. That meant that I have quite homogenous groups computed! I only needed to check what representatives in terms of labels those groups have. Unfortunately, the results was quite bad:

So there was no representatives of some labels (like 7th label) and there are repeats inside the representatives list (like 0th label represented two times). Ouch. Fortunately k-means computes quite fast (a couple of minutes) on my PC so it was not waste.

The code responsible for those steps is presented below:

That concluded my adventure with k-means. Of course many other steps can be done here and if I find some time I’ll do it — like preprocessing the data with kernel trick or performing LLE/PCA/kPCA dimension reduction algorithms on this dataset.

#### Constructing SGD classifier

The next try I did was implementing the SVM classifier. Since they tend to compute quite longer than, for example, k-means. I stumbled upon the SGD method which can be paralellized and works with the same error function as linear SVM method. Unfortunately without prior preprocessing this method achieves like 72% of errors.

The code is available below (`squared_hinge` tends to get a little better results than default `hinge` one):

#### Summary

Using naive classification and analysis mechanisms and achieving poor results was a proof that CIFAR10 tends to be a little harder dataset to grasp on. I didn’t do much, frankly, but such easy steps can be easily used by you for simpler datasets like Iris.

My next step would be to create a neural network based on the idea of convolutional neural networks. I’ll use Theano and Lasagne for that. Stay tuned!

BTW. You can see my code here, on GitHub.