Chapter-2 Data and It’s Different Types

Ashish Patel
ML Research Lab
Published in
8 min readJun 6, 2018

Machine Learning Series!

Hello Folks, I hope you like my first article of Introduction to machine learning.Today, I am writing this article with the help of my friend sanket patel who are also data scientist with me. I am special Thanks to sanket patel to help for write this such nice article. Today I am discussing about the Data and It’s type.I also Discuss about some question which you have faced or you will be facing while understanding the data nature.So I am going to discuss this point in this chapter.

  1. Qualitative and Quantitative Data
  2. What is variable and It’s Data types
  3. What is parametric and Non-parametric Test?
  4. What is population and sample?
  5. Quick Guide of data

Definition of Data

The singular form is “datum”, so we say “that datum is very high”.

“Data” is the plural so we say “the data are available”, but data is also a collection of facts, so “the data is available” is fine too.

Data are collection or measurement collected from different data sources.

A data unit is one entity (such as a person or business) in the population being studied, about which data are collected. A data unit is also referred to as a unit record or record.

A data item is a characteristic (or attribute) of a data unit which is measured or counted, such as height, country of birth, or income. A data item is also referred to as a variable.because the characteristic may vary between data units, and may vary over time.

Statistic Language

An observation is an occurrence of a specific data item that is recorded about a data unit. It may also be referred to as datum, which is the singular form of data. An observation may be numeric or non-numeric (categorical). For example, 173 is a numeric observation of the data item ‘height (cm)’, whereas ‘Australia’ is a non-numeric (categorical) observation of the data item ‘country of birth’.

A dataset is a complete collection of all observations.

1. Quantitative and Qualitative Data

Quantitative = Quantity

Qualitative = Quality

Qualitative and Quantitative Data

# Quantitative Data

Quantitative data can be Discrete or Continuous:

  • Discrete data can only take certain values (like whole numbers)
  • Continuous data can take any value (within a range)

Qualitative data are measures of ‘types’ and may be represented by a name, symbol, or a number code. Example categorical variables (e.g. what type).

Data collected about a numeric variable will always be quantitative and data collected about a categorical variable will always be qualitative.

For example, if data are collected on annual income (quantitative), occupation data (qualitative) could also be gathered to get more detail on the average annual income for each type of occupation.

Statistic Language

1.1 How can you use quantitative and qualitative data?

Frequency counts:

The number of times an observation occurs (frequency) for a data item (variable) can be shown for both quantitative and qualitative data.

The graphs below arrange the quantitative and qualitative data to show the frequency distribution of the data.

Quantitative Data

Qualitative Data

As absolute frequencies can be calculated on quantitative and qualitative data, relative frequencies can also be produced, such as percentages, proportions, rates and ratios. For example, the graphs above show 4 people (20%) worked less than 30 hours per week, and 6 people (30%) are teachers.

2. What is a variable?

Variable is any characteristics, number, or quantity that can be measure or counted.

Types of Data
Statistics language

A) Numeric Variable : values that describe a measurable quantity as a number, like ‘how many’ or ‘how much’. Therefore numeric variables are quantitative variables.

  • A continuous variable is a numeric variable. Observations can take any value between a certain set of real numbers. The value given to an observation for a continuous variable can include values as small as the instrument of measurement allows. Examples of continuous variables include height, time, age, and temperature.
  • A discrete variable is a numeric variable. Observations can take a value based on a count from a set of distinct whole values. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of of which measured as whole units (i.e. 1, 2, 3 cars).

B) Categorical variables have values that describe a ‘quality’ or ‘characteristic’ of a data unit, like ‘what type’ or ‘which category’. Categorical variables fall into mutually exclusive (in one category or in another) and exhaustive (include all possible options) categories. Therefore, categorical variables are qualitative variables and tend to be represented by a non-numeric value.

Categorical variables may be further described as ordinal or nominal:

  • An ordinal variable is a categorical variable. Observations can take a value that can be logically ordered or ranked. The categories associated with ordinal variables can be ranked higher or lower than another, but do not necessarily establish a numeric difference between each category. Examples of ordinal categorical variables include academic grades (i.e. A, B, C), clothing size (i.e. small, medium, large, extra large) and attitudes (i.e. strongly agree, agree, disagree, strongly disagree).
  • A nominal variable is a categorical variable. Observations can take a value that is not able to be organised in a logical sequence. Examples of nominal categorical variables include sex, business type, eye colour, religion and brand.

3. Parametric and Non-Parametric Data

Parametric vs Non Parametric

A) Parametric Test

The parametric test is the hypothesis test which provides generalizations for making statements about the mean of the parent population. A t-test based on Student’s t-statistic, which is often used in this regard.

The t-statistic rests on the underlying assumption that there is the normal distribution of variable and the mean in known or assumed to be known. The population variance is calculated for the sample. It is assumed that the variables of interest, in the population are measured on an interval scale.

B)Non-parametric Test

The non-parametric test is defined as the hypothesis test which is not based on underlying assumptions, i.e. it does not require population’s distribution to be denoted by specific parameters.

The test is mainly based on differences in medians. Hence, it is alternately known as the distribution-free test. The test assumes that the variables are measured on a nominal or ordinal level. It is used when the independent variables are non-metric.

C) Hypothesis Test Taxonomy

Hypothesis Test Taxonomy
Equivalent Tests

4. What is population and sample?

Population vs Sample

1.Population

A population is any complete group with at least one characteristic in common. Populations are not just people. Populations may consist of, but are not limited to, people, animals, businesses, buildings, motor vehicles, farms, objects or events. population denotes a large group consisting of elements having at least one common feature.

  1. Finite Population: When the number of elements of the population is fixed and thus making it possible to enumerate it in totality, the population is said to be finite.
  2. Infinite Population: When the number of units in a population are uncountable, and so it is impossible to observe all the items of the universe, then the population is considered as infinite.
  3. Existent Population: The population which comprises of objects that exist in reality is called existent population.
  4. Hypothetical Population: Hypothetical or imaginary population is the population which exists hypothetically.

Examples

  • The population of all workers working in the sugar factory.
  • The population of motorcycles produced by a particular company.
  • The population of mosquitoes in a town.
  • The population of tax payers in India.

2. Sample

A sample is a subset of units in a population, selected to represent all units in a population of interest. It is a partial enumeration because it is a count from part of the population.

In other words, the respondents selected out of population constitutes a ‘sample’, and the process of selecting respondents is known as ‘sampling.’ The units under study are called sampling units, and the number of units in a sample is called sample size.

Population vs Sample

5.Quick guide of Data

5.1 Types of data
1. qualitative(categorical) data
2. quantitative(continues) data

5.2 Important Question of Data

Q: how to handle missing values in qualitative and quantitative data?
Q: what are the techniques used to impute missing values in qualitative and quantitative data
Q: what is parametric and non-permatric data?
Q: what is ideal condition for both types of data that we can use in our data science?

5.3 Data with example

1. Qualitative data has two class- string(object) and numeric normally all string data convert to numeric in processing phase. Numeric Qualitative data categories in two class

  1. nominal : like categories of people — men and women(string)- 0 & 1 (numeric nominal)- 0=men/1=women in nominal data
    there is no order between categories so it is nominal you can switch 0/1 for respective categories
  2. ordinal- like blood pressure- low,normal,high(string)- (0,1,2)-0=low/1=normal/2=high in ordinal data there is order
    between categories so it is ordinal you cant switch 0/1/2 for respective categories
  3. binary- like in bank are you loan defaulter?-yes/no(string)- (0,1)-0=no/1=yes in binary data

Q: how to convert string categorical data to numeric?
Q: is categorical imbalance effect machine learning?

2. Quantitative data has generally no fix boundary it has infinite range in terms of context(data science project) like apple give us advertising budges and previous sold units and aim to predict unit sold in current year so both are quantitative data

Continuous vs Discrete Data

Q: which type of quantitative data is good for machine learning?
Q: which test used to check quantitative data?

3. Why we need both data for machine learning?

1. Quantitative data : help machine learning to gain logical relationship of contextual data
2. Qualitative data : as quantitative data is always lies on assumptions and in statistics all real data not follow assumption 99% times so it is worthy to use qualitative data

Note: Continues data gives more insight then categorical(discrete) data

References:

  1. www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language
  2. https://keydifferences.com/
  3. https://www.mathsisfun.com/data/data.html
  4. https://towardsdatascience.com/data-types-in-statistics-347e152e8bee

--

--

Ashish Patel
ML Research Lab

LLM Expert | Data Scientist | Kaggle Kernel Master | Deep learning Researcher