Missing Data & It’s Types

Ganesh Dhasade
Analytics Vidhya
Published in
3 min readAug 28, 2020
photo by: NEW DATA SERVICES | Unsplash

In the life cycle of the data science project, the data has been collected from various sources like internal databases, 3rd party API’s or by surveys. Data engineers usually take care of adding collected data into databases. But, when data come to data scientists or analysts most of the time data has some missing values or some unacceptable characters. It is found that 80 percent of overall project time has been utilized in data preparation in order to make data ready for analysis.

First, understand why there are missing values.

  • Non-Response: Information not filled by subjects, for example, peoples usually don’t like to reveal their salaries, age, mobile number, etc.
  • Human Error: Data collection is done improperly or mistakes are made in data entry.

The above-mentioned types are the most common ways that lead to missing data. Now, let's see what are the different types of missing data.

Types of missing data:

  1. Missing Completely At Random (MCAR):

The data are missing is independent of the observed and unobserved data i.e there is NO RELATIONSHIP between data missing with any other variables/ columns/ features observations within the dataset.

For example, When a random sample is taken from the population, where each member has the same chance of being included in the sample. The (unobserved/ not taken) data of members in the population that were not included in the sample are MCAR.

2. Missing Data Not At Random (MNAR):

The data are missing is systematically related to the unobserved data i.e there is a RELATIONSHIP between data missing with any other variables/ columns/ features observations within the dataset.

For example, in public opinion surveys occurs if those with weaker opinions respond less often.

MNAR is the most complex case. Need strategies to handle missing data to find more data about the causes for the missingness or to perform what-if analyses to see how sensitive the results are under various scenarios.

3. Missing At Random (MAR):

The missing data is systematically related to the observed but not the unobserved data i.e probability of missing data for that category is the same. MAR is more general and more realistic than MCAR. Modern missing data methods generally start from the MAR assumption.

For example, People usually try to avoid sharing personal data during surveys like most men don’t like to share their salaries, and similarly, women don't like to share their age.

Conclusion:

It’s important for understanding why data is missing and try out some assumptions to identify their value as missing data can highly affect the overall project result.

According to Rubin’s theory states the conditions under which a missing data method can provide valid statistical inferences. Most simple fixes only work under the restrictive and often unrealistic MCAR assumption. If MCAR is implausible, such methods can provide biased estimates.

Reference:

  1. u.osu.edu. (n.d.). Brief History of Missing Data Theory | How to Deal with Missing Data. [online] Available at: https://u.osu.edu/missingdata/a-very-brief-history-of-missing-data-theory/.
  2. Wikipedia. (2020). Missing data. [online] Available at: https://en.wikipedia.org/wiki/Missing_data#:~:text=Sometimes%20missing%20values%20are%20caused.

--

--

Ganesh Dhasade
Analytics Vidhya

Data - Scientist | Analyst | Engineer | Enthusiast | ML Engineer