Classification Algorithms: How to approach real world Data Sets

Anurag Mukherjee
SIGMA XI VIT
Published in
7 min readNov 29, 2020
Network Illustration by Alexey Brin from istockphoto.com

If you take a look at trending technologies these days, you cant miss the buzzwords of AI and ML. Indeed, I’m pretty sure you would have heard of them even if you didn’t go out of your way to search for them. So hyped are these emerging concepts currently that almost every second college student is gushing over how amazing these things are and how they will change our foreseeable future. And almost every person who has ever started to learn machine learning will have come across the holy grail of beginner’s examples….the hallowed ‘Iris Dataset’. However, application of classification and regression techniques to real world data is usually not so simple. This blog will take you through a real world scenario to explain the questions of ‘when’ ,‘how’ and ‘which’ of machine learning classification algorithms.

The When

The first step to using a classification algorithm is to identify a situation or problem where it can be used. Classification algorithms, like their names suggest are used to classify entities into categories, or assign them a ‘target label’ based on already ‘classified entities’ which the algorithm assumes are true. So, here comes the first hurdle. To use a classification algorithm, you need a situation where you have a bunch of correctly pre-labelled data and an entity belonging to that dataset for which a label must be identified. Lets take a look at this with an example.

What you are looking at is a table of instantaneous frequencies of a sinusoid sampled at 1 sec intervals. Each row represents a unique wave with fundamental frequency of F (first column). Here, we can assume each column of inst. freq. as an independent feature and the fundamental freq. as the target label. So, a classification algorithm can be used here to identify the fund. freq. of a wave whose inst. freq. are known. Again, a large part of dealing with such data is identifying what can be used as the target label and what can be assigned as the features.

The How

Now we have established a scenario where a classification algorithm is needed. But before we can begin with the coding, data processing is a very very important aspect we must look at. Now, this dataset, a part of what you saw, has a total of 750 columns (more the data, more accurate the result. 750 rows is way too less for an actual model, but is enough for a prototype pilot) and was generated by us as part of a project. Real world data may have a lot of inconsistencies such as missing and overlapping data. Thus, though not needed in this case, always process and ensure the quality of your dataset before continuing.

We will be using the scikit-learn library and python 3 to establish our classifiers.

We import our library and dataset, splice it and divide it into parts which have only the inst. freq. The Numpy and Pandas libraries help with the setting up of the dataset as a data frame (a pandas data structure) which can be easily modified and worked around with. Think of it as the excel analogue of Python.

After our data frame is ready, we further divide our data into a training and testing sets and use these to establish our very first classifier, a random forest classifier. The training set is used to set up the model while the testing set is used to validate the model. here, we split the two in a 8:2 ratio. You can also use a 7:3 or 9:1 ratio, but the 8:2 ratio is what is generally used. The Label encoder function is used here to work around some dimension problems of the data frame. You can think it as part of the pre-processing. Such steps are unique to a particular dataset. The value ‘n_estimators’ decides the number of decision trees in the classifier. The higher its value, the more accurate but slower the result.

Next we establish a sample wave of 10 instantaneous frequencies and use the model we just created to get a target label for this sample. Here, the classification algorithm returns a target label of ‘6.76’ (its in Kilo Hertz) and tells us that this sample is closest to the 88th wave in our original dataset.

congratulations, we have now set up our first model.

The Which

Here, we used the random forest classifier, but other models can be used too. From K nearest neighbors to Support vector machines, each model will have its own advantages and disadvantages.

Keeping the rest of the code intact, these changes will establish the SVM and Knn models respectively. So now we must identify exactly which of the above models we will actually use.

Knn is the simplest and most robust of models. It doesn’t require a training time and is usually the fastest. Random forest is usually the most accurate, but it is proportionately slower. SVM is memory efficient but usually requires a clear distinction between classes. Lets focus on SVM for now.

What you are looking at are heatmaps for rows 1–10 and 31–40 of the original dataset . As can be seen, rows 1–10 are sufficiently unique, whereas rows 31–40 have almost indistinguishable values. So obviously SVM, having a requirement for clear class boundaries will give ambiguous results here even though it seemed perfectly viable at first. Similarly, in choosing between Knn and Random forest, you will have to prioritize between accuracy and speed. Thus, when applying a model to real world data, it is imperative that you carefully analyze the data to find possible loopholes such as the similar values in this case. Visualizing the data in the forms of charts and graphs is often a great way to do so. Here, the heatmaps were generated using the matplotlib library in Python.

You must also keep in mind the end purpose of the algorithm. ‘What is it trying to achieve’, ‘under what conditions is it doing so’ and ‘why is it doing so’. Questions like these will help you identify the distinct advantages of one model over another for your specific dataset. And here I have talked about only 3 models. Depending on whether you need a binary target label, even regression is viable. So you must also study all possible models carefully before deciding on one.

The Approach

Now that we have gone over an example, lets take a look at what we have learned so far and define the steps we took:

  1. Identify Situation: not all places require a ML classification Algorithm
  2. Obtain Data: get yourself pre-labelled authentic Data
  3. Analyze: each dataset has its own weaknesses and strengths. Find them and you will make your life so much easier.
  4. Pre-process: Ready your dataset for the model. Ensure that minimum quality standards are met
  5. Select Model: go through all possible models and select one based on your conclusions from steps 1 and 3.
  6. Implement model: usually the easiest part of the process.
  7. Celebrate

What Next

Here you have seen an example of how ML classification algorithms are applied to real world data. This was actually part of a Signals project on identifying vehicles using their characteristic doppler shift frequencies. The rest of the project involves MATLAB and Simulink codes to simulate waves and generate the dataset we worked with.

Use this link below if you are interested in reading the full project report or if you want the dataset shown here.

--

--

Anurag Mukherjee
SIGMA XI VIT

Just another IT, electronics , research and anime enthusiast……weird combination isn’t it???