Machine Learning 101

Part 2: Data Patterns, Types of Variables, and Algorithms

4 min readNov 13, 2022

In the previous part — Part 1: Introduction we have gone through the basics of Machine Learning. In this part, we will deep dive into various data patterns or distributions, types of variables, and types of algorithms.

So, What are Data Patterns or Distributions?

As stated in the previous part, a Machine Learning algorithm looks at patterns present in the data and based on that makes predictions. Hence, we can say that Data Patterns are one of the important aspects of Machine Learning.

Data Patterns, also referred to as Data Distributions are a way to organize the data using a graphical representation that is more meaningful and simple to understand by the audience.

Let’s go through some of the basic types of Data Distribution that are most commonly used in Machine Learning — Normal Distribution, Uniform distribution, Exponential Distribution, and Skewed Distribution.

1) Normal Distribution

Suppose, the average height of a 14-year-old female population in India is 1.5 meters. After plotting the population data on the graph, if we get a Bell-like shaped curve in the center, we can say that the data is Normally Distributed.

Note: Several Machine Learning algorithms need Normally Distributed data (e.g. Linear Regression) for higher efficiency.

2) Skewed Distribution

The opposite of the Normally Distributed data is called Skewed Distribution where the Bell-like shaped curve is aligned to the left or right rather than in the center of the graph. Generally, we make the skewed data Normally Distributed using various transformations (e.g. Log Transformation).

3) Uniform Distribution

Let’s assume that snakes of different lengths/sizes are found in a forest. After plotting the population data of snakes on the graph, if we see that there is an equal possibility of finding a snake of any length in a forest, then we can say that the data is Uniformly Distributed.

4) Exponential Distribution

Suppose, an earthquake occurs every 500 years in a certain region, on average. When we plot the data on the graph, the likelihood of the earthquake occurring in the upcoming years is represented by Exponential Distribution.

The next important aspect of a Machine Learning algorithm is variables and their types. The columns or attributes of the data are called Variables.

Before we go through the types of variables, we need to first understand what is Correlation.

Let’s say, we plot a line graph where the Y axis consists of column Ice-cream sales and the X axis consists of column Temperature. We see that when the temperature increases, Ice-cream sales increase. Here, the Temperature and Ice-cream sales are Correlated.

If the values of one column increase, on the increase of the other column or vice versa, such a relation is called a Positive Correlation (as we can see in the above example).

Similarly, if the values of one column increase, on the decrease of the other column or vice versa, such a relation is called a Negative Correlation.

Types of Variables

Let’s take an example of loan default prediction, where we predict whether the customer is eligible to pay back the loan or not. Suppose the loan data related to the customer is stored in a CSV file.

The columns/features like gender, occupation, salary, and bank transactions that are used to predict the target column “pay or not” are called Independent variables.

Note: The columns/features are called independent variables because they are not dependent/correlated with each other.

The target column “pay or not” that is to be predicted is called the Dependent variable.

Types of Machine Learning Algorithms

Let’s continue with the Ice-cream sales example. Suppose we know the Distribution of Ice-cream sales data (e.g. “Sales” column is Normally Distributed), and we are aware of the mathematical function that can be used for learning from the data. In that case, Parametric ML algorithms (e.g. Linear Regression) can be used.

Similarly, if we are not aware of the distribution of Ice-cream sales and the mathematical function that is used for learning from the data, then the Non-Parametric ML algorithms (e.g. Decision Trees CART) can be used. It learns from the data itself.

Stay tuned, in the next part we will take a look at the Types of Machine Learning models. Please share your views, thoughts, and comments below. Feel free to ask any questions.

References:

Machine Learning 101

Part 2: Data Patterns, Types of Variables, and Algorithms

Written by Bzubeda