Intro To Machine Learning and Different types of data processing for models

Published in

Let’s Deploy Data.

12 min readJul 20, 2020

This article Introduces you with machine learning, gives you a detailed idea on how different types of data for instance image, text, etc is processed before feeding into machine learning Models, how the Machine Learning model learns and also highlights different types of approaches to Machine Learning.

[This is my note form Microsoft Azure Machine Learning Scholarship with Udacity 2020 Program’s Lesson 2: Introduction To Machine Learning.]

DEFINITION:

Machine learning is a

data science technique
used to extract patterns from data, allowing computers to identify related data,
and forecast future outcomes, behaviors, and trends.

ML vs TRADITIONAL PROGRAMMING:

In ML

Rules are generated
Historical answers are input to train an algorithm
gives answers for unseen data

In Traditional Programming,

Rules are explicitly programmed

AI vs ML vs DEEP LEARNING:

Artificial Intelligence — Human Intelligence Exhibited by Machines
Machine Learning — An Approach to Achieve Artificial Intelligence
Deep Learning — A Technique for Implementing Machine Learning

DEEP LEARNING VS ML:

All deep learning algorithms are machine learning algorithms but not all machine learning algorithms are deep learning algorithms.
Deep learning algorithms are based on neural networks
the classical ML algorithms are based on classical mathematical algorithms.
such as linear regression, logistic regression, decision tree, SVM, and so on.

Deep learning advantages:

Suitable for high complexity problems
Better accuracy, compared to classical ML
Better support for big data
Complex features can be learned

Deep learning disadvantages:

Difficult to explain trained data
Require significant computational power

Classical ML advantages:

More suitable for small data
Easier to interpret outcomes
Cheaper to perform
Can run on low-end machines
Does not require large computational power

Classical ML disadvantages:

Difficult to learn large datasets
Require feature engineering
Difficult to learn complex functions

DATA SCIENCE PROCESS:

To train a Machine Learning model data has to go through some phases, it may need preprocessing, scaling, encoding, cleansing etc. the image below gives you the brief understanding of the whole data science process.

MODEL VS ALGORITHMS

Algorithm

An algorithm is used to train a model
An algorithm is either a pre built/or customized code stack in the form of a function or package which is used to learn from data so as to create a model/equation
Algorithms are the processes of learning
We can think of the algorithm as a function ,we give the algorithm data and it produces a model:

Model=Algorithm(Data)

Model

Models are the specific representations learned from data
model is the result of using algorithm on data
when you train an “algorithm” with data it will become a “model”
For example, y = Wx + b is an algorithm, we can calculate y from x if the values for W and b are known

We can plug the data into the algorithm and calculate W = 1 and b = 0.

When the values of the weights (or coefficients) is learnt, then it is able to predict y, without requiring inputting the value of the coefficient/weight

So, an algorithm becomes a model when it learns the value of coefficient/weight, and can give you y when x is given , without you requiring to input W and b

TYPES OF DATA FOR MODEL TRAINING

Data can be divided into 5 categories

5 TYPES OF DATA:

Numeric
Time series
Categorical
Image
Text
Tabular-data

NOTE: in machine learning eventually all data (image/text etc ) has to be converted into numerical form

Example:

Suppose,we want to use gender information in the dataset to predict if an individual has heart disease.

Before we can use this information with a machine learning algorithm, we need to transfer male vs. female into numbers, for instance, 1 = male and 2 = female, so it can be processed.

Conversion of non-numeric data to vector

All non-numerical data types (such as images, text, and categories) must eventually be represented as numbers
In machine learning, the numerical representation will be in the form of an array of numbers — that is, a vector

WHAT IS A VECTOR ?

A vector is simply an array of numbers, such as (1, 2, 3) — or a nested array that contains other arrays of numbers, such as (1, 2, (1, 2, 3))

Tabular-data Example

In tabular data, typically each row describes a single item, while each column describes different properties of the item

Each Row represents instance / entity / input vector

Each Row represents instance / entity / input vector
Each Column represents attribute / feature

Time series data Example:

The graph shows data points given across an ordered series of times — thus, it is time-series data.

Categorical Data Example:

Below is the example of categorical data, there are four categories as you can see, and before training all categories has to be represented in a numeric form.

SCALING DATA

Scaling data means,

transforming the data so that the values fit within some range or scale, such as 0–100 or 0–1. it is practiced to scale data before feeding it into a machine learning algorithm.

Why do we need it?

scaling process does not affect the algorithm output since every value is scaled in the same way. But it can speed up the training process, as the algorithm only needs to handle numbers <= 1.

Approaches to scaling data:

There are 2 mostly used approaches to scaling data:

standardization and
normalization.

Standardization:

Standardization resales data so that it has a mean of 0 and a standard deviation of 1.

target: mean =0 , std = 1
Formula: (𝑥 − 𝜇)/𝜎

Example:

Standardize the following dataset:

50 ,100 ,150

here, mean = 100, and standard deviation = 50.

Let’s try standardizing each of these data points. The calculations are:

(50 − 100)/50 = -50/50 = -1
(100 − 100)/50 = 0/50 = 0
(150 − 100)/50 = 50/50 = 1

Thus, our transformed/scaled data points are: -1 ,0 , 1

Again, the result of the standardization is that our data distribution now has a mean of 0 and a standard deviation of 1.

Normalization:

Normalization resales the data into the range [0, 1].
Target: 0–1 range
The formula for this is: (𝑥 −𝑥𝑚𝑖𝑛)/(𝑥𝑚𝑎𝑥 −𝑥𝑚𝑖𝑛)

Example:

Let’s try working through an example with those same three data points:

50, 100 ,150

here, 𝑥𝑚𝑖𝑛 = 50, 𝑥𝑚𝑎𝑥 = 150 and 𝑥𝑚𝑎𝑥 −𝑥𝑚𝑖𝑛 = 150 − 50 = 100.

Plugging everything into the formula, we get:

(50 − 50)/100 = 0/100 = 0
(100 − 50)/100 = 50/100 = 0.5
(150 − 50)/100 = 100/100 = 1

Thus, our transformed data points are:

0, 0.5 ,1

CATEGORICAL DATA ENCODING:

Encoding is required for Categorical Data preprocessing, remember machine learning model needs numeric data, through Encoding Categorical data can be represented into numeric form.

Why do we do it?

Machine learning algorithms need to have data in numerical form

Data can be encoded in 2 ways

Ordinal Encoding
One hot encoding

first let’s see the result of both of these encoding for the following dataset

After applying one hot encoding:

One hot encoding:

for each property a field has to be added in the table. eg: red,color,blue etc,
Advantage: It can prevent unnecessary ordering of category
Disadvantage: for each property a field has to be added in the table. eg: red,color,blue etc
When to use? : if your data has no natural order, like for color it’s not suitable to order them

Now, moving to the Ordinal encoding,

Ordinal encoding:

Each category is represented with a numeric value starting form 0
This approach is that it implicitly assumes an order across the categories.
When to use? : if data has a natural order it may be used, for instance if you category is something like : beginner, advance, pro etc these types of category has a natural order, so ordinal encoding may suit this particular case.

PREPARING IMAGE DATA FOR MODEL TRAINING:

To train a ML model with image data

Color Images are converted into 3d vector / grey-scale images into 1D vector
The size of the vector required for any given image would be the height * width * depth of that image (colors channel, for RGB it’s 3 and greyscale 1).
each pixel is represented in numeric form indicating color which range from 0–255. for example.

for grey-scale 0 is black, while 255 is bright white

Purple might be represented as 128, 0, 128 (a mix of moderately intense red and blue, with no green)

Color depth: The number of channels required to represent the color is known as the color depth or simply depth.

in RGB image, depth = 3, because there are three channels (Red, Green, and Blue), In contrast, a grayscale image has depth = 1

Preprocessing:

Apart from encoding an image numerically, we may also need to do some other preprocessing steps.

such as ensuring input images have a uniform aspect ratio e.g., by making sure all of the input images are square in
Other preprocessing operations such as clean the input images, include rotation, cropping, resizing, and centering the image etc

EXAMPLE: RGB IMAGE REPRESENTATION IN VECTOR

The image has
width = 100 pixel, height =100 pixels,
It’s a color image having Red, blue, green colors, each color is represented with a vector
so the vector size is 100X100X3
in green/red/blue channel has vector dimension of 100

Can you guess how many numeric value is required to represent this image?

Answer : for ,100 X 100 pixel image, each pixel contains 3 channel , 10,000 pixel X 3 = 30,000 numerical values

PREPARING TEXT DATA FOR MACHINE LEARNING MODEL

As Machine Learning can not process data in text form, we also need to represent text in numeric form.

first text are changed to base form, unnecessary-words (a, the, if) are removed then the text data is converted to numeric form to feed into the Machine Learning Model.

TEXT-Normalization

Text normalization is the process of transforming a piece of text into a canonical (official) form

Why normalization ?

“To be” verb may have different forms. eg: is, am, are,
document may contain alternative spellings of a word, such as behavior or behaviour

So we need to have a base format for each text.

Normalization can be done in following ways

Lemmatization
Removing Stop words

lemmatization

lemmatization is the process of reducing multiple inflections to that single dictionary form.

For example, we can apply this to the is, am, are example we mentioned above:

Removing Stop words:

high-frequency words that are unnecessary (or unwanted) during the analysis.

Example:

“which cookbook has the best pancake recipe” here, which , the = less relevant and cookbook, pancake, recipe = more significant words

TOKENIZATION:

split each string of text into a list of smaller parts or tokens

Example of Tokenization and stop word removal:

Here, text is tokenized.
removed stop words (the), and standardized spelling (changing lazzy to lazy).

Vectorization:

After normalization, encoding the text in a numerical form/ turning a piece of text into a vector, this is called Vectorization.

The goal here is to identify the particular features of the text that will be relevant for the particular task — and then extract in a numerical form that is accessible to the machine learning algorithm.

There are different ways of vectorizing text, below TF-IDF method is explained.

Term Frequency-Inverse Document Frequency (TF-IDF) vectorization

TF-IDF was invented for document search and information retrieval.
The numeric value is determined by this formula : how many times this word is here)/how many times this word is everywhere
The higher the score, the more relevant that word is in that particular document.
common words in every document (what, if, the ,a ,an etc) has lower score as they appear many times in this documents and others as well, so they don’t mean much to that document in particular
if the word “rabid” appears many times in a document, while not appearing many times in others like the common words (the,an,what etc) it means that it’s very relevant and will have higher score

LEARNING FUNCTION

Machine learning algorithms aim to learn a target function (f)
That describes the mapping between data input variables (X) and an output variable (Y). , Y=f(X)

Irreducible error:

Since the process extrapolates from a limited set of values, there will always be an error e
e is independent of the input data (X) , Y=f(X)+e
no matter how good we get at estimating the target function (ff), we cannot reduce this error.

Irreducible error VS model error

Irreducible error is caused by the data collection process — such as when we don’t have enough data or don’t have enough data features.
In contrast, the model error measures how much the prediction made by the model is different from the true output.
The model error is generated from the model and can be reduced during the model learning process where as Irreducible error is constant.

PARAMETRIC MACHINE LEARNING ALGORITHM

The mapping function can be assumed
have a constant number of parameters.

The parametric algorithms involve two steps:

select the form of the function and
learn the coefficients using the training data

Example

An example is linear regression algorithms, where the simplified functional form can be something like: Y = B0+B1∗X1+B2∗X2

after selecting the initial function, the remaining problem is simply to estimate the coefficients B0, B1, and B2 using input variables: X1, X2.

Benefits:

Easy to understand and interpret
Parametric models are very fast and takes less time to train.

Limitations:

Has limited complexity and are suitable for simpler problems

NON PARAMETRIC MACHINE LEARNING ALGORITHM

Non-parametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry about choosing just the right features.
Non-parametric algorithms do not make assumptions regarding the form of the mapping function
They are free to learn any functional form the training data.

Benefits:

Flexible and can solve complex problems
Can result in higher performance models for prediction.

Limitations:

Required More data and more time to train the model.
Prone to overfitting.

APPROACHES TO MACHINE LEARNING

Machine Learning has 3 main approaches .

Supervised learning

Learns from data that contains both the inputs and expected outputs (e.g., labeled data). Common types are:
Classification: Outputs are categorical.
Regression: Outputs are continuous and numerical.
Similarity learning: Learns from examples using a similarity function that measures how similar two objects are.
Feature learning: Learns to automatically discover the representations or features from raw data.
Anomaly detection: A special form of classification, which learns from data labeled as normal/abnormal.

2. Unsupervised learning

Learns from input data only; finds hidden structure in input data.

Example:

Based on customer purchase data unsupervised ML algorithms can cluster/group the customer with common behavior, if some customers are regular and buy expensive product it can cluster them together, on the other hand it can also cluster customer who are looking for less expensive products.

Note here the inputs are not labelled , the algorithm figures out the similar pattern in the given data.

Types of Unsupervised learning algorithm

Clustering: Assigns entities to clusters or groups.
Feature learning: Features are learned from unlabeled data.
Anomaly detection: Learns from unlabeled data, using the assumption that the majority of entities are normal.

3. Reinforcement learning

Learns how an agent should take action in an environment in order to maximize a reward function.

The main difference between reinforcement learning and other machine learning approaches is that

reinforcement learning is an active process where the actions of the agent influence the data observed in the future, hence influencing its own potential future states.
In contrast, supervised and unsupervised learning approaches are passive processes where learning is performed without any actions that could influence the data.