Missingno: Find Visual Clues to Data Completion 👁️

Manoj Das
6 min readJul 27, 2023

--

What is Missingno package in python? How to use Missingno package to visualize and analyze missing values. How to use Missingno for doing EDA in Python.

Photo by Ehimetalor Akhere Unuabona on Unsplash

Missingno is a Python library, used for visualizing missing data in datasets. The primary purpose of the Missingno library is to provide an easy and intuitive way to identify and visualize missing data patterns in datasets. It allows data analysts and scientists to quickly understand the distribution and extent of missing values in their data, helping them make informed decisions on how to handle these missing values during data preprocessing.

Typically, Missingno offers several visualization techniques such as:

  1. Matrix visualization
  2. Bar chart visualization

The library can be useful in exploratory data analysis, data cleaning, and data preparation stages of a data science or machine learning project.

History

The Missingno package was created by Aleksey Bilogur. It was first released in 2015 and gained popularity within the data science and data visualization communities due to its usefulness in handling missing data in datasets.

The motivation behind creating Missingno was to provide data analysts and scientists with a simple yet effective way to visualize and understand the patterns of missing data in their datasets. Missing data is a common issue in real-world datasets, and dealing with it appropriately is crucial for accurate analysis and modeling.

Missingno has been particularly valuable in exploratory data analysis (EDA) and data preprocessing tasks, as it allows users to identify which columns have significant amounts of missing data and make informed choices on how to handle them.

The library is open-source and available on the Python Package Index (PyPI), making it easily accessible to the Python community. Its user-friendly approach to visualizing missing data has made it a popular tool for data cleaning and preparation processes.

Installation

You can install Missingno using pip:

pip install missingno

Advantages

The Missingno package in Python offers several advantages for data analysts, scientists, and machine learning practitioners when dealing with missing data in datasets.

Easy Visualization of Missing Data: Missingno provides simple and intuitive visualizations to explore missing data patterns. With just a few lines of code, you can generate matrix plots and bar charts that effectively display the distribution of missing values across features in the dataset.

Identifying Missing Data Patterns: By using Missingno visualizations, you can quickly identify patterns and clusters of missing data. This helps you understand if the missingness is random or if there are specific relationships between missing values in different features.

Effective Data Cleaning: Understanding the extent and distribution of missing data is crucial for data cleaning and preparation. Missingno allows you to make informed decisions on how to handle missing values, such as imputation or removal, based on the insights gained from the visualizations.

Time-Saving: Missingno enables rapid exploration of missing data patterns, reducing the time required for data analysis and data preprocessing tasks.

Complements Pandas: Missingno integrates seamlessly with the Pandas library, which is widely used for data manipulation and analysis in Python. It works with Pandas DataFrames, making it convenient for users already familiar with Pandas.

Flexibility: The Missingno library can handle both small and large datasets. It scales well and performs efficiently even with datasets containing millions of records.

Open-Source and Active Community: As an open-source package, Missingno benefits from an active community of contributors and users who help improve and maintain the library. This means you can expect updates, bug fixes, and new features as the Python ecosystem evolves.

Data Imputation Insights: Missingno’s visualizations can aid in deciding appropriate data imputation strategies. By understanding the relationships between missing values in different columns, you can make more informed choices on how to impute missing data.

Enhances Data Exploration: Missingno can be a valuable tool during exploratory data analysis (EDA), as it allows you to gain insights into the quality of the dataset and identify potential data issues.

Suitable for Data Scientists and Non-Experts: The library’s simplicity makes it accessible to both data science experts and non-experts who might be less familiar with advanced data visualization techniques.

Examples

For this example we will use a sample of the NYPD Motor Vehicle Collisions Dataset.

import pandas as pd
import missingno as msno
%matplotlib inline

collisions = pd.read_csv("nyc_collision_factors.csv")
msno.matrix(collisions.sample(250))

Time series data

If you are working with time-series data, you can specify a periodicity using the freq keyword parameter:

null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)
null_pattern = pd.DataFrame(null_pattern).replace({False: None})
msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')

Bar chart

msno.bar is a simple visualization of nullity by column:

msno.bar(collisions.sample(1000))

Heatmap

The missingno correlation heatmap help to understand nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

msno.heatmap(collisions)

Dendrogram

The dendrogram allows us to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:

msno.dendrogram(collisions)

Limitations

While the Missingno package in Python offers many advantages, it also has some limitations that users should be aware of:

Limited Imputation Functionality: While Missingno provides valuable insights into missing data patterns, it does not offer built-in data imputation methods. After identifying missing data, users need to apply separate imputation techniques using other libraries or methods.

Only Visualizes Missing Data: The Missingno library is specialized in visualizing missing data. It does not provide extensive support for other data visualization tasks or statistical analysis commonly found in comprehensive data visualization libraries like Matplotlib or Seaborn.

Limited to Pandas DataFrames: Missingno is tightly integrated with Pandas DataFrames, which means that users working with other data structures might need to convert their data to Pandas format before utilizing the library.

Not Suitable for Large Datasets with High Missingness: While Missingno can handle moderately large datasets, it might encounter performance issues and slow rendering times when dealing with very large datasets that have a high proportion of missing data.

Binary Missing Data Only: The library assumes binary missingness patterns (present or absent), which might not cover more complex missing data scenarios, such as missing data that follows a specific pattern or missing data with varying degrees of completeness.

Dependency on Matplotlib: Missingno relies on Matplotlib for generating its visualizations. If you encounter compatibility issues or conflicts with Matplotlib or other visualization libraries, it might affect Missingno’s functionality.

No Support for Non-Graphics Environment: Missingno is primarily designed for visualizations in graphical environments. If you are working in non-graphics environments, such as command-line interfaces or web servers, using Missingno might not be feasible.

Not a Complete Data Analysis Solution: While Missingno is excellent for visualizing and understanding missing data, it is just one piece of the data analysis puzzle. To perform a comprehensive data analysis, you will need to use other libraries and techniques for data exploration, feature engineering, modeling, and evaluation.

Despite these limitations, the Missingno package remains a valuable tool for any data analyst or scientist who needs to quickly gain insights into missing data patterns in their datasets.

— — —

Why did the statistician love using Missingno?

It was the “missing” piece to their data puzzle!

🙂🙂🙂

--

--