ogi_on_ds - Medium

Introduction to Machine Learning 1: What is Machine Learning?

Ogulcan Ertunc — Tue, 09 Mar 2021 15:52:01 GMT

Introduction to Machine Learning: What is Machine Learning?

https://carlolepelaars.nl/2018/10/15/100daysofmlcode-summary/

When most people hear “Machine Learning”, they picture a robot: a deadly Terminator or a dangerous computer, which wants to destroy humanity just like in the Matrix movie.

https://medium.com/media/f96a6295130514ed3c8ea273d04468fa/href

But welcome to reality, Machine Learning is not just a futuristic fantasy; it’s already here. In fact, it has been around for decades in some specialized applications, such as movie recommendation systems in online Movie/Series Platforms.

https://medium.com/media/0c3f1da423fa82331d79d5ee2c341cc2/href

But in my opinion, the first ML app that really became indispensable, the spam filter. It’s not exactly Skynet or Agent (from Matrix), but it does technically qualify as a Machine Learning. It has actually learned so well that you seldom need to flag an email as spam anymore. So grab a coffee and let’s get started!

https://medium.com/media/9135c80ddc9fc61e202081a7ff020bc9/href

What is Machine Learning?

Photo by Andy Kelly on Unsplash

It is something between science and art of programming computers so they can learn from data.

Field of study that gives computers the ability to learn without being explicitly programmed.

— Arthur Samuel, 1959

In the spam filter I have given as an example, examples of spam emails consist of mail data previously marked by users, and non-spam, which we can call “raw”, consists of mail data from those that are not marked by users, and it is a machine learning program that can learn to mark spam from this sample set. The examples used by the system to learn are called training sets.

Each training instance is called a training instance/sample. In this case, the task is to find out if a mail is a spam for new emails and flag the spam email, the experience is training data and the performance metric needs to be defined: for example, you can use the proportion of correctly classified emails. This particular measure of performance is called accuracy and is often used in classification tasks. However, accuracy can be misleading in most cases. In these cases, it would be better to look at Precision to measure the really correct ones. I will explain later what the measure of successful techniques is.

So Why use Machine Learning?

Sometimes the problem is difficult/complicated and repeated, and you don’t want to spend your time on this kind of thing. So you can create your own machine learning program. And if you can train it enough, it can solve your problems. Easy, right?

So Machine Learning is great for:

Problems where existing solutions require a lot of tweaking or long lists of rules: a machine learning algorithm can often simplify code and outperform the traditional approach. However, this may take a long time or may require high computing power.
Complex problems where using the traditional approach does not provide a good solution. Perhaps the best machine learning techniques can find a solution.
Volatile environments: A machine learning system can adapt to new data.
Gain insights into complex problems and large amounts of data.

Best and Basic Examples of Applications

Let’s look at some concrete examples of Machine Learning task along with the techniques that can tackle them:

Detecting a Covid patient through the lung film.
This is semantic segmentation, where each pixel in the image is classified, typically using CNN.
Automatically classifying Medium articles
This is natural language processing(NLP), and more specifically text classification which can be tackled using recurrent neural networks(RNNs), CNNs or Transformers.
Forecasting some companies revenue next year, based on many performance metrics
This is a regression task that may be tackled using any regression model such as Random Forest Regression, Polynomial Regression, or artificial neural network. If you want to improve it you can also use Transformers for it.
Detecting credit card fraud
This is anomaly detection
Customer Segmentation
Segmenting clients based on their purchases so that you can design a different marketing strategy for each segment

Wow, Almost every example has a different algorithm, What are the types of machine learning?

Actually, there are so many different types of ML algorithms that it is useful to classify them in broad categories but basically, there are three algorithms.

Supervised Learning
The purpose of a supervised learning algorithm is to use the dataset to generate a model. Field a feature vector x as input and extract information allowing the label to be extracted
Unsupervised Learning
In unsupervised learning, the dataset is a collection of unlabeled samples. Again, x is a feature vector, and the purpose of an unsupervised learning algorithm is to take the feature vector x as input and convert it to either another vector or a value that can be used to solve a practical problem.
Semi-supervised Learning
In semi-supervised learning, the dataset contains both labeled and unlabeled examples.
Usually, the quantity of unlabeled examples is much higher than the number of labeled examples. The goal of a semi-supervised learning algorithm is the same as the goal of the supervised learning algorithm. The hope here is that using many unlabeled examples can help the learning algorithm to find (we might say “produce” or “compute”) a better model.
Reinforcement Learning
Reinforcement learning is a subfield of machine learning where the machine “lives” in an environment and can perceive the state of that environment as a feature vector. The machine can take action in any situation. Different actions bring different rewards and can move the machine to another environment situation. The purpose of the reinforcement learning algorithm is to learn a policy.

References

THE HUNDRED-PAGE MACHINE LEARNING BOOK, Andriy Burkov
Hands on Machine Learning with Scikit-Learn, Keras and Tensorflow, Aurelien Geron
https://www.expert.ai/blog/machine-learning-definition/

Introduction to Machine Learning 1: What is Machine Learning? was originally published in ogi_on_ds on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Logic Behind A/B Testing with Sample Python Code

Ogulcan Ertunc — Sat, 06 Feb 2021 22:56:59 GMT

In that topic, we will consider the evaluation of the advertising methods of a large company.

The data we have includes the new advertisement proposal method of a large company and the old advertisement proposal method. Thanks to this data, the company wants to compare the old method, that is, the current method, with the new method. Here, we will look for a conclusion about which method is more successful. In this way, the company can continue with the method we decided and get more profit or Impressions/clicks.

Now we want to understand the performance between methods.

Should we just “observe” the performance of these two methods and conclude?

Of course no, we should choose the more methodical/statistical approach to compare the performances between these two methods, so what is this approach?

As can be understood from our topic, the statistical method we will try is the A/B Test.

So, what is this A / B Test?

Today, the A / B Test is “a random online experiment that consists of two variants A and B”. This test quantitatively compares two samples with a single “measure of choice” in the assessment to determine whether there is any statistical significance between them.

However, we understand that the job is essentially a modern online adaptation of statistical experimental frameworks called Hypothesis tests.

So how should we apply the test?

0.Importing libraries and necessary data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statistics
from scipy import stats
from scipy.stats import shapiro
import statsmodels.stats.api as sms
import warnings
warnings.filterwarnings("ignore")

control_data = pd.read_excel("Lectures/Week 5/Dosyalar/ab_testing_data.xlsx", sheet_name="Control Group")
test_data = pd.read_excel("Lectures/Week 5/Dosyalar/ab_testing_data.xlsx", sheet_name="Test Group")

control_data["Group"] = "A"
test_data["Group"] = "B"

is_any_outlier(control_data, "Purchase")
is_any_outlier(test_data, "Purchase")
control_data.head()
test_data.head()

1.Preferred Metrics

Normally we should look at which advertising method is expected to bring us a conversion, ie “Click Through Rate”, but we do not have any data directly presenting it, so we need to do Impression / Click. Before that, we will make an analysis of the Purchase column.

If we define the clickthrough rate (CTR) as phat, it will be phat = Impression/click.

We’ll set a Significance Level, a will be the metric we’ll take from p-values and use to compare hypotheses. I prefer the generally accepted value of 5%. = 0.05

2.Applying the Hypothesis

Before applying the hypothesis, we have to do an assumption check first, here are the steps we will consider;

Assumptions of normality
Homogeneity of variance

2.1 Let’s adjust our table for Assumptions of normality

AB_test = control_data.append(test_data)
test_statistics, pvalue = shapiro(AB_test.loc[AB_test["Group"] == "A", "Purchase"])
print('Test Statistics is  %.4f, p-value = %.4f' % (test_statistics, pvalue))
test_statistics, pvalue = shapiro(AB_test.loc[AB_test["Group"] == "B", "Purchase"])
print('Test Statistics is  %.4f, p-value = %.4f' % (test_statistics, pvalue))

Using these hypotheses, we can arrive at a decision, if the p-value is <0.05, then H0 is Rejected, if the p-value is> 0.05, then H0 cannot be rejected.

When we apply this test to our dataset, we cannot reject the H0 hypothesis because our p-value is not less than 0.05 in both A and B groups. Therefore, the normal distribution assumption is provided.

2.2 Assumptions of homogeneity of variance:

First, let’s make our hypotheses about the homogeneity of variance.

stats.levene(AB_test.loc[AB_test["Group"] == "A", "Purchase"],
             AB_test.loc[AB_test["Group"] == "B", "Purchase"])

Here, we applied our test with the levene function to our test A and B groups over the Purchase variable. We got the value of p-value = 0.10 from our test, since this value is not less than 0.05, H0 could not be rejected, and variances are homogeneous.

Since we see that our assumptions are satisfied in this step, we will conduct two independent samples t-test (parametric test).

3. (AB Test) Independent samples t-test

Since we see that our assumptions are satisfied in this step, we will conduct two independent samples t-test (parametric test).

test_statistics, pvalue = stats.ttest_ind(AB_test.loc[AB_test["Group"] == "A", "Purchase"],
                                           AB_test.loc[AB_test["Group"] == "B", "Purchase"],
                                           equal_var=True)

We cannot reject our H0 hypothesis because our p-value is 0.34 (p value> 0.05) as a result of the t-test we have done.

Long story short, when we examine the values of the two methods we have, we can say that there is no need to process the new method since there is no statistically significant difference between the two group averages. However, since the data set we have is not large enough, it will not be right to make this decision so soon. Perhaps the number of observations can be increased and the result can be more accurate.

# Bonus Now let’s take a quick look at Click-through Rate.

Click-Through Rate (CTR)
It is the ratio of users who visit the website, see the ad and click the ad.
Clicks / Impression

control_data['ctr'] = control_data["Click"]/control_data["Impression"]
test_data['ctr'] = test_data["Click"]/test_data["Impression"]

ogi_AB(control_data,test_data, "ctr")

Group A has a higher value compared to the overall rates seen for now. At first glance, we see that the click-through rate is in favor of the control group. So while the ad is showing, the rate of site visitors clicking seems to be better in the current system.

But this will be a simple view, will there be a statistically significant difference here, so we need to do our AB test.

I would like to express my sincere thanks to Vahit Keskin and my mentor Atilla Yardimci who helped and taught the completion of this project

The notebook associated with this article is on Github if you want to follow along.

The Logic Behind A/B Testing with Sample Python Code was originally published in ogi_on_ds on Medium, where people are continuing the conversation by highlighting and responding to this story.