New Python Packages for Data Cleaning

Jake from Mito
trymito
Published in
3 min readMay 31, 2022
from Author
  1. Mito

Mito allows the user to do data cleaning from a spreadsheet inside Python. You can call Mito into your Python environment, and each edit you make in the Mito spreadsheet will generate the equivalent Python in the code cell below.

Here is a demo video:

Install commands for Mito:

python -m pip install mitoinstaller
python -m mitoinstaller install

Then open Jupyter Lab and call the Mitosheet

import mitosheet
mitosheet.sheet()

The full instructions can be found on the Mito website under “docs.”

In terms of data cleaning functionality, Mito has everything that a user can expect from a spreadsheet. These features range from trimming strings, to summary stats, to deduplicating and filling nulls values. Each of these operations will generate the equivalent Python for you.

Here is what it looks like to use filters, pivot tables, and graphs in Mito.

from Author

Mito is focused on making data cleaning processes faster and more accessible. While Mito provides a visual interface for your cleaning, it still generates fully documented Python that can be used to automate or share your analysis.

2. Arrow

Arrow is a Python package all about helping the user handle, dates, time, and timestamps. These can be some of the most annoying pieces of data for a user to handle and clean — arrow solves this.

Here is a screenshot from their documentation with their full list of features:

https://arrow.readthedocs.io/en/latest/

To install arrow:

pip install -U arrow

Here is a demo video of the package:

Parsing time data and cleaning it is an important step in the data science workflow. Packages like Arrow obfuscate these processes to a much friendlier syntax that make them easier to complete. Not only is the syntax shorter, but it is also easier to remember, so the user does not have to go to Google or Stack Overflow as often.

3. Scrubadub

This package is all about removing personally identifiable data from datasets. As privacy laws become more and more stringent (as they should!), Python users and businesses at large are increasingly focused on removing personal data.

Here are examples of the PII that this packages removes (taken from the package documentation):

https://scrubadub.readthedocs.io/en/stable/

To install:

pip install scrubadub

Here is an overview of how the package works:

https://scrubadub.readthedocs.io/en/stable/usage.html

I hope you find these packages helpful :)

--

--

Jake from Mito
trymito

Exploring the future of Python and Spreadsheets