Data Types in Statistics Used for Machine Learning.

Jagadish Bolla
7 min readAug 30, 2020

--

Introduction to Statistics

The field of statistics is the science of learning from data. Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results. statistics is a crucial process behind how we make discoveries in science, make decisions based on data, and make predictions. Statistics allows you to understand a subject much more deeply.

To become a successful Data Scientist you must know our basics. Math and Stats are the building blocks of Machine Learning algorithms. It is important to know the techniques behind various Machine Learning algorithms to know how and when to use them. Now the question arises, what exactly is Statistics?

“Statistics is a Mathematical Science of data collection, analysis, interpretation and presentation”.

Statistical analysis

Why Learn Statistics?

One of the central concepts of data science is gaining insights from data. Statistics is an excellent tool for unlocking such insights in data. Statistics is a form of math, and it involves formulas, but it doesn’t have to be that scary even if you’ve never encountered it before.

Machine learning came from statistics. The algorithms and models used in machine learning all come from what’s called statistical learning. Knowing some basic statistics is extremely helpful whether you are deep into machine learning algorithms or just staying up-to-date on the latest machine learning research.

Introduction to Data Types

Having a good understanding of the different data types, also called measurement scales, is a crucial prerequisite for doing Exploratory Data Analysis (EDA) since you can use certain statistical measurements only for specific data types.

You also need to know which data type you are dealing with to choose the right visualization method. Think of data types as a way to categorize different types of variables. We will discuss the main types of data and look an example for each.

Types of Data
Types of Data

Qualitative versus Quantitative Data

The distinction between qualitative and quantitative data is the most fundamental way to divide types of data. Is the characteristic something you can objectively measure with numbers or not?

1) Qualitative

The information represents characteristics that you do not measure with numbers. Instead, the observations fall within a countable number of groups. This type of variable can capture information that isn’t easily measured and can be subjective. Taste, the colour of a car, architectural style, and marital status are all types of qualitative data. Analysts also refer to this as categorical data.

i)Nominal Data

Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as labels. Note that nominal data that has no order. Therefore if you changed the order of its values, the meaning would not change. You can see two examples of nominal features below:

Nominal data example

Visualization Methods: To visualize nominal data you can use a pie chart or a bar chart.

For Nominal Visualization

In Data Science, you can use one-hot encoding, to transform nominal data into a numeric feature.

ii) Ordinal Data

Ordinal data mixes both numerical and categorical data. The data fall into categories, but the numbers placed on the categories have meaning. For example, rating a restaurant on a scale from 0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical, where the groups are ordered when graphs and charts are made. However, unlike categorical data, numbers do have mathematical meaning. It is therefore nearly the same as nominal data, except that its ordering matters. You can see an example below:

Customer Rating for service providing this example order is matters

ordinal scales are usually used to measure non-numeric features like happiness, customer satisfaction, Rank of students in the class, education qualification etc.

Therefore you can summarize your ordinal data with frequencies, proportions, and percentages. And you can visualize it with pie and bar charts. Additionally, you can summarise your data using percentiles, median, mode and interquartile range.

In addition to ordinal and nominal values, there is a special type of categorical data called binary.

Binary data types only have two values — yes or no. This can be represented in different ways such as “True” and “False” or 1 and 0. Binary data is used heavily for classification machine learning models. Examples of binary variables can include whether a person has stopped their subscription service or not, or if a person bought a car or not.

Binary data types

2)Quantitative:

The information is recorded as numbers and represents an objective measurement or a count. Temperature, weight, and a count of transactions are all quantitative data. Analysts also refer to this type as numerical data.

i) Discrete Data

Discrete quantitative data are a count of the presence of a characteristic, result, item, or activity. These measures cannot be meaningfully divided into smaller increments. For example, a single household can have 1 or 2 cars, but it cannot have 1.6. There are a finite number of possible values that you can record for an observation.

With discrete variables, you can calculate and assess a rate of occurrence or a summary of the count, such as the mean, sum, and standard deviation. For example, U.S. households had an average of 2.11 vehicles in 2014.

Bar charts are a standard way to graph discrete variables. Each bar represents a distinct value, and the height represents its proportion in the entire sample.

Bar chart for numbers of cars in household

ii) Continuous Data

Continuous variables can take on almost any numeric value and can be meaningfully divided into smaller increments, including fractional and decimal values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data.

For example, the mean height in India is 5 feet 9 inches for men and 5 feet 4 inches for women.

In Continuous data and there are 2 types

a) Interval Data

Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. An example would be a feature that contains the temperature of a given place as you can see below:

Positive and negative intervals

The problem with interval values data is that they don’t have a “true zero”.

b)Ratio Data

Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.

Length of a table

When you are dealing with continuous data, you can use the most methods to describe your data. You can summarize your data using percentiles, median, interquartile range, mean, mode, standard deviation, and range.

Visualization Methods:

To visualize continuous data, you can use a histogram or a box plot. With a histogram, you can check the central tendency, variability, modality, and kurtosis of a distribution. Note that a histogram can’t show you if you have any outliers. This is why we also use box plots.

These plots and graphs are for continuous data analysis

Summary

In this post, you discovered the different data types that are used throughout statistics. You learned the difference between discrete & continuous data and learned what nominal, ordinal, binary, interval and ratio measurement scales are. Furthermore, you now know what statistical measurements you can use at which datatype and which are the correct visualization methods. You also learned, with which methods categorical variables can be transformed into numeric variables. This enables you to create a big part of an exploratory analysis on a given dataset.

Please follow me and support me to write more articles.

--

--