Introduction To Analytics Modeling : Week 1 — Classification.

The first Week’s Notes from Intro To Analytics Modeling! Check out the course on Edx.

Udesh Habaraduwa
Udesh’s Data Science Notes
12 min readSep 27, 2019

--

Analytics

What does that mean? essentially the core of what it is, is about asking questions!

Descriptive: What’s going on?

Predictive : What might happen?

Prescriptive: How can we influence what will happen?

Models

Models are mathematical ways of describing phenomenon. They can be thought of as equations and formulas that fit given observations of interacting variables.

This includes machine learning, regression and optimization and more.

Cross-cutting

Cross-cutting can be thought of as the manipulation of data before and after being used in a model.

This includes Data prep/cleaning, output quality checking and dealing with missing data and more.

Classification

You may want to put data points in different classes. Say, for example, should a loan application be approved or not? Yes or no, which is a binary classification because there are only 2 options.

You could also put the data points in one of many, generally three or more, classes, like categorizing a song into a genre, a multi-class classification.

Fig 1

Let’s assume that there is a relationship between the distance of home from a group of fast food restaurants and bodyweight of the town’s inhabitants. The further they are from the restaurants, the lower their bodyweight. Let’s say the orange points represent customers who buy more than twice a week and shouldn’t be offered any discounts and the green, customers who buy less than twice a week and so might be tempted with a discount. These would be the classifications or classes of the points and the purple line is the classifier (decision boundary) that separates the two groups. As you can see, there is some error in this classification as some customers have been misclassified.

Choosing a classifier

Well let’s think about a different case. Say we are considering the loan application problem again.

Fig 2

Let’s define our classes. The orange points are applications that are not approved and the green points are approved applications. In this example, we can see that as credit score and income increases, applications are more likely to be approved.

Let’s consider that the cost of approving a loan to someone who’s really close to the margin maybe greater than the benefit of getting a great client so we may want to be more conservative. The classifier can be nudged to reduce the risk of granting a loan to a very risky applicant.

On the other hand, if the cost and benefit is the same, it won’t matter as much and the line can be placed between the groups with as good a fit as possible.

In the examples above, the groups are not perfectly separable which will be the case more often than not in the real world. The lines make a couple of mistakes but fits the data pretty well.

If a line is really close to each of the points, say the orange and green in this case, then the line is close to making some mistakes. This means it is sensitive to small deviations in the data points. In the example above, if someone earns just a little bit more, it might be enough to flip them into the green, approved class, even though in reality this might be a bad idea.

Fig 3

If the classifications looked like the image above, we can see that the only feature that actually matters is credit score as moving up along the Y — axis doesn’t influence the classification of points.

Data Types

Structured Data and Unstructured Data

Fig 4

Structured data often looks like the image above if stored in a data frame in R for example.

Structured data is easily described and stored. They can be categorical — like genre of movie, country of origin — or quantitative — like body weight or temperature. As the name suggests, the information is stored in an easily accessible and organized way. The data is “cleaned” as in most publicly available datasets. Actually cleaning data is a big part of preparing data for analysis.

Unstructured data is messy — like the real world! Information needs to be parsed out and stored in a way that a computer can work through.

Common types of data

Quantitative

The values have meaning : sales, age, number of points etc.

Categorical

The information has been placed in different groups based on attributes. Categorical data can be numeric (zip code, group number, age…etc) or non-numeric (hair color, sex, country…etc)

Binary

The data can only be one of two categories ( male / female, on/off … etc)

Unrelated Data

The points are not connected to each other (loan applicants, customers…etc)

Time Series Data

The same data recorded over a period of time ( sales of a product over the year, stock prices …etc)

Support Vector Machines

Fig 5

The support vector machine tries to maximize the margin between the classes of data.

Visit the link below for a really great explanation of why the SVM does what it does by Andrew NG (check out his channel for everything AI education)

Fig 6

Let’s define the following :

Then, the solid line in the image above, our classifier, would be defined as the following where A0 is the y-intercept of the line.

fig 7

We can move the line by using different A0 values like if we wanted to be more conservative about giving out loans.

We’ll define the labels (also known as responses ) of points by using 1 and -1. -1 being orange(not approved) and 1 being green (approved)

fig 8

We are simply defining that one side is 1 and the other side is -1. We are giving those values to the different classes. Imagine having a data frame where there is a column where there is a binary classification for each row (which in this case would be an application). We label it 1 if it was approved, -1 if it wasn’t.

The distance between the margins is :

fig 9

The SVM tries to maximize the margin while subject to the following:

fig 10

The equation above is simply saying we (as in the algorithm) have to choose between classifiers (i.e coefficients for each feature) that correctly classify all the points while maximizing the margin (equation in fig 3)

Remember: We are trying to find the best line that can classify this data. That means, we are looking for the set of coefficients for all the factors ( features, attributes) that give us this line.

Sum of Errors

What if the data is not perfectly classifiable?

Then we’ll have to make a soft classifier where some errors will be made.

We can measure how far the point is on the wrong side.

If a point is on the correct side, this is true

fig 11

if the point is on the wrong side, this is true

fig 12

How far the point is away from the line, calculated by the left side of the equation, will tell us how much of an error we have made.

Error for data point j is :

fig 13

We get either 0 or the distance the point is from the line. Which ever is the lager of the two.

So the total error of all the points :

fig 14

The sum of the errors of each data point ‘j’ over each attribute ‘i’ all the way to n data points.

Balancing Error & Margin

So, now we know the following.

Error of each data point j :

fig 15

Total error:

fig 16

margin size:

fig 117

So what we want to do is minimize the combination of total error + margin size.

We can control the balance between the number of misclassified points and the size of the margin by using λ.

fig 18

What Exactly Is a Support Vector?

Fig 19

Each point of data is a support vector. This means that the classifier uses each data point (a vector) to decide where the decision boundary should be.

For example if we take the shape below, it is supported by the points on the lines.

Fig 20

Advanced SVM

Recall the difference between :

Soft Classifiers — When it is impossible to separate between two classes perfectly. We allow some data points to be misclassified.

Hard Classifiers — Perfect separation between the data

In the case of perfect separation, it will look something like this:

Fig 21

Let the margins be -1 and 1. This will be important in a case where the costs of misclassification are not equal.

For example, if approving a loan to a bad candidate is much greater than the cost of missing out on a good one, we would want to shift the decision boundary such that it becomes harder to be classified as a potential good candidate.

If the decision boundary is defined as :

Fig 22

and the cost of giving out a bad loan is twice as costly as giving out a good loan, we can move the decision boundary like this:

Fig 23

And since we have defined the labels of our data points as the following :

fig 24

moving the line by -1/3 makes it harder for points to be in the ≥ 1 threshold.

In the case of a soft classifier , we can introduce a multiplier M_j to each error:

Fig 25

Since the first part of the equation calculates the size of the error of a misclassified point, we can control the penalty it receives based on the side it has been misclassified on. In the loan example, if the point was wrongly classified as a green point (good applicant), which is twice as costly, we can penalize a model that makes this mistake by amplifying the error.

Scaling

Data needs to be brought onto the same scale across variables so that the size of one variable doesn’t dominate the others.

For example, income is measured in thousands to tens of thousands ($1000 etc) where as credit score is measured in hundreds (500 etc). A small change in income could overshadow a change in credit score.

Which attributes matter?

Once the data has been scaled, we can look at the coefficients in our model to get an idea of which ones actually matter

Fig 26

As the value of a approaches 0 , the effect of that variable diminishes.

Kernalized SVM and Logistic Regression

For example, in a graph the the graph that looked like this, we saw that income didn’t really have any effect on the classifier:

Fig 27

Another benefit of SVMs is that they work just fine with a large number of dimensions ( attributes).

There are some problems that may not be solvable using linear classifiers. .

For these purposes, we can use a kernalized SVM. As the name implies , we use a “kernel” or function that is better able to model the patterns seen in the data when they are more complicated than a simple linear model.

We can also get probability answers to questions by using other methods like logistic regressions.

Scaling & Standardization

Let’s consider our example of housing loans.

Fig 28

First predictive factor : Household income — 10e5

Second predictive factor : credit score — 10e2

In which case the decision boundary would be defined as :

Fig 29

Let’s say for example the values are as follows:

Fig 30

Let’s assume that the coefficients changed a little bit.

Fig 31

We can see that a small change in the credit score made a significant change in the outcome. To have the same impact, the income value would have to change drastically.

When the data is not scaled , the model maybe more sensitive to changes in one coefficient than the others.

Adjusting the data — scaling

Scaling by factor

Scaling by factor is bringing all the data within the same interval usually between 0 and 1.

Fig 32

We can define the following:

Fig 33
Fig 44

Here, we are defining the maximum and minimum values the factor can take

Next, we can scale each attribute to using these max and min values.

Fig 45

General scaling between ‘b’ and ‘a’ is defined as:

Fig 46

Standardizing

Scaling to a normal distribution allows you to see how far a data point is from the mean. This is commonly done by scaling to mean = 0, standard deviation = 1

So if factor ‘j’ has a mean μj ,

Fig 47

Mean of factor J is the sum of all values for factor j, divided by the number of data points.

If standard deviation is σj of factor j, then for each data point (i),

Fig 48

Adjusting the data — which method to use?

Scaling is a good idea when you need data in a bounded range. Sometimes the data itself might be within a bounded range ( SAT scores, RGB colors etc).

Standardization is better for other cases such as principle component analysis and clustering.

We will most likely have to try both and test which one works best.

Either way, scaling the data is VERY IMPORTANT AND YOU SHOULD ALWAYS DO IT!

KNN — K Nearest Neighbor Classification

Let’s consider our running example for bank loan approvals.

Let’s say we want to classify the new data point into green or orange based on the new data points 5 nearest neighbors.

KNN classifier would look at the 5 points closest to the new one, and classify the new point, say if 3 out of 5 are green, as green.

This approach works even when you have more than 1 class.

Somethings to keep in mind about KNN:

Distance metrics: Generally, straight line distance between the points is used — the sum of difference between each attribute.

Fig 49

Relative importance of attributes : Some attributes may be more important than others so a simple distance between points might not be good enough.

One way to deal with this is to weight each dimension’s distance. As the weight increases, so will the distance.

Unimportant attributes: Some attributes might not matter at all (weight = 0)

Choosing a good k-value : We’ll have to try different values of k and to find a good value. For one thing, KNN might not even be a good solution to the problem.

--

--

Udesh Habaraduwa
Udesh’s Data Science Notes

There is no enduring good. Except, perhaps, the enduring search for it.