Python for data analysis 101: Cheat sheet for data preparation using Python

TLDR: Quick reference guide for data prep using Python + Jupyter Notebooks

While most teaching materials related to data analysis cover statistical modeling and machine learning using mock data or clean data sets, data analysts usually spends most of their time in real life on preparing the data for analysis. Python’s Pandas library is a great tool for getting data in shape, particularly when working with Jupyter Notebooks.

This cheat sheet for data preparation in Python introduces some of the most important Pandas features plus some useful Jupyter widgets and magics for pre-processing. It’s a quick ref guide for data analysts who are getting started with Python (based on Python 3). The following points are covered:

  1. Data import
  2. Styling
  3. Metadata and data types
  4. Duplicate detection
  5. Dataframe manipulation
  6. Outlier detection
  7. Variable generation and manipulation
  8. Preparation of data for modeling

The cheat sheet is only a high level reference guide, for more details, refer to the documentation of the libraries (Pandas, Numpy). Those are the also major sources for the cheat sheet. Another great source is a compact presentation on data pre-processing by April Chen and a great blog by Chris Albon. Thanks to Jonas Kemper for his contributions. In case you are not familiar with Python yet, I suggest to complete some basic tutorials on Python/Pandas/Numpy first (find an overview here).

Debugging: If you get errors while running your code, it is usually best to search for the error message on StackOverflow to get hints where you have to go deeper into the documentation. What is important when googling/stackoverflowing error messages: always include the library you are working with in your search request (e.g. search for “pandas dataframe group by” instead of python group by”). Raw Python is not equivalent to its libraries, you might get misleading results.

Questions, feedback and remarks more than welcome —