An Introduction to Machine Learning Notes-1

Samet Girgin
PursuitOfData
Published in
5 min readJun 14, 2019

I suppose to share my notes about the fundamentals of machine learning and data science in a series and this piece is the first of those. I hope to increase my sharings periodically. You can reach all my blog articles about ML and DS from this link and the codes I share in this GitHub link.

In fact, data science is mostly turning business problems into data problems and collecting data and understanding data and cleaning data and formatting data, after which machine learning is almost an afterthought.

What is Model?:

1.Business Model: For example, building a model for a business. Inputs like “number of users” and “ad revenue per user and “ number of employees” and output our annual profit for next years. The business model is probably based on simple “mathematical relationships”: profit is revenue minus expenses, revenue is units sold times average price,

2. Cookbook recipe entails a model that relates inputs like “number of eaters” and “hungriness” to quantities of ingredients needed. on. The recipe model is probably based on “trial and error” — someone went into a kitchen and tried different combinations of ingredients until they found one they liked.

3.Poker game: the players estimate each player’s “win probability” in real time based on a model that takes into account the cards that have been revealed so far and the distribution of cards in the deck. the poker model is based on probability theory.

What is machine learning ?:

Machine learning can be defined briefly as learning from data, predictive modeling or data mining…

1.) Supervised models (in which there is a set of data labeled with the correct answers to learn from)

2.) Unsupervised models (in which there are no such labels)

3) Semisupervised (in which only some of the data are labeled) and online (in which the model needs to continuously adjust to newly arriving data)

Overfitting and Underfitting:

Overfitting produces a model that performs well at training dataset but poorly at new dataset (for instance test data). It is also called noise in data. Underfitting, producing a model that doesn’t perform well even on the training data,

Note: degree 0 represents underfitting, degree 9 represents overfitting and degree 1 represents a nice regression.

Firstly split our dataset, for instance, two-thirds of it is used to train the model, after which we test the model’s performance on the remaining third:

There is matrix x of input variables and a vector y of output variables.

1- Define a function that splits the data randomly (i.e split_data)

2- Define a train_test_split function to assign the split_data function

However, The Python module train_test_split automatically split the dataset. For instance, the two lines below describe how to split the dataset into training and test dataset. X is input and Y is the output of the dataset. In this example, %20 of a dataset will be test data and the remaining will be training dataset.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.2, random_state=0)
random_state : int, RandomState instance or None, optional (default=None)If int,random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by `np.random`.

It is recommended to divide the dataset into three parts; the first one is training the dataset to build a model, the second one is validation dataset to compare the trained models, and the third one is the dataset to test the final model.

Correctness:

If our model is a binary judgment. For instance, is this email spam or not? Given a set of labeled data and such a predictive model, every data point lies in one of four categories:

True positive: “This message is spam, and we correctly predicted spam.”

False positive (Type 1 Error): “This message is not spam, but we predicted spam.”

False negative (Type 2 Error): “This message is spam, but we predicted not spam.”

True negative: “This message is not spam, and we correctly predicted not spam.”

It is called the confusion matrix to define such 4 categories:

Correct= tp + tn

Accuracy is defined as the fraction of correct predictions: (tp + tn)/(tp + fp + fn + tn)

Precision measures how accurate our positive predictions were: tp / (tp + fp)

Recall measures what fraction of the positives our model identified: tp / (tp + fn)

Sometimes precision and recall are combined into the F1 score (p:precision; r: recall): F1_score= 2 * p * r / (p + r)

Usually, the choice of a model involves a trade-off between precision and recall. And also it is a trade-off between false positives and false negatives. Saying “yes” too often will give you lots of false positives; saying “no” too often will give you lots of false negatives.

The Bias-Variance Trade-off

High bias (which means it performs poorly even on your training data) and low variance typically correspond to underfitting.

Very low bias but very high variance (since any two training sets would likely give rise to very different models) corresponds to overfitting.

Again remember the degree 0, degree 9 and degree 1 comparisons in the above graph about underfitting and overfitting.

If our model has high variance, then we can similarly remove features. But another solution is to obtain more data if it is possible.

Feature Extraction and Selection:

Features are the inputs we provide to our model. Not enough features may result in underfitting and too many features may cause overfitting.

It is given a good practice in the book to understand extracting features. Let's consider a spam filter. It needs to extract some features to select spam e-mails. For instance:

  • Does the email contain the word “Viagra”?
  • How many times does the letter d appear?
  • What was the domain of the sender?

The first question has a Y/N answer, the second question’s answer is a number and the third one’s answer is a set of options. Naive Bayes is suited for Yes/No features. Regression models are suitable for the second model. Decision tree model will be suited for numeric or categorical data so it is suitable for the third one.

How do we choose features? That’s where a combination of experience and domain expertise comes into play.

Reference:

Data Science from Scratch by Joel Grus Copyright © 2015 O’Reilly Media

--

--

Samet Girgin
PursuitOfData

Data Analyst, Petroleum & Natural Gas Engineer, PMP®