Minimally Sufficient Pandas Cheat Sheet

Ted Petrou
Jan 31, 2019 · 4 min read

This article summarizes the very detailed guide presented in Minimally Sufficient Pandas.

Begin Mastering Data Science Now for Free!

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

What is Minimally Sufficient Pandas?

  • It is a small subset of the library that is sufficient to accomplish nearly everything that it has to offer.
  • It allows you to focus on doing data analysis and not the syntax
  • All common data analysis tasks will use the same syntax
  • Fewer commands will be easier to commit to memory
  • Your code will be easier to understand by others and by you
  • It will be easier to put Pandas code in production
  • It reduces the chance of landing on a Pandas bug.

Specific Guidance

Use the brackets and not dot notation to select a single column of data because the dot notation cannot column names with spaces, those that collide with DataFrame methods and when the column name is a variable.

>>> df[‘colname’] # do this>>> df.colname    # not that

The ix indexer is ambiguous and confusing (and now deprecated) as it allows selection by both label and integer location. Every trace of ix should be removed and replaced with the explicitlocor iloc indexers.

The at and iat indexers give a small increase in performance when selecting a single DataFrame cell. Use NumPy arrays if your application relies on performance for selecting a single cell of data and not at or iat.

The only difference between these two functions is the default delimiter. Use read_csv for all cases as read_table is deprecated.

isna is an alias of isnull and notna is an alias of notnull. Use isna and notna as they end with ‘na’ like the other missing value methods fillna and dropna.

Use the operators( +, *, >, <=, etc..) and not their corresponding methods ( add, mul, gt, le, etc…) in all cases except when absolutely necessary such as when you need to change the direction of the alignment.

Use the Pandas method over any built-in Python function with the same name.

There are a few different syntaxes available to do a groupby aggregation. Use df.groupby('grouping column').agg({'aggregating column': 'aggregating function'}) as it can handle more complex cases.

A DataFrame with a MultiIndex offers little benefit over one with a single-level index. I advise against using them. Instead, flatten them after a call to groupbyby renaming columns and resetting the index.

A groupby aggregation and a pivot_table produce the same exact data with a different shape. Use gropuby when you want to continue an analysis and pivot_table when you want to compare groups.

The pivot_table method and the crosstab function are very similar. Only use crosstab when finding the relative frequency.

The pivot method pivots data without aggregating. It is possible to duplicate its functionality with pivot_table by selecting an aggregation function. Consider using only pivot_table and not pivot.

Both the melt and stack methods reshape the data in a very similar manner. Use melt over stack because it allows you to rename columns and it avoids a MultiIndex.

Both pivot and unstack work reshape data similarly but from above, pivot_table can handle all cases that pivot can, so I suggest using it over both of the others.

The above examples are the most common areas of Pandas where multiple options are available to its users. There are many other attributes and methods that are not discussed. Below, I provide a categorized list of the minimum amount of DataFrame attributes and methods that can accomplish nearly all of your data analysis tasks. It reduces the number from over 240 to less than 80.

  • columns
  • dtypes
  • index
  • shape
  • T
  • values

These result in a single value for each column

  • all
  • any
  • count
  • describe
  • idxmax
  • idxmin
  • max
  • mean
  • median
  • min
  • mode
  • nunique
  • sum
  • std
  • var
  • abs
  • clip
  • corr
  • cov
  • cummax
  • cummin
  • cumprod
  • cumsum
  • diff
  • nlargest
  • nsmallest
  • pct_change
  • prod
  • quantile
  • rank
  • round
  • head
  • iloc
  • loc
  • tail

Missing Value Handling

  • dropna
  • fillna
  • interpolate
  • isna
  • notna
  • expanding
  • groupby
  • pivot_table
  • resample
  • rolling
  • append
  • merge
  • asfreq
  • astype
  • copy
  • drop
  • drop_duplicates
  • equals
  • isin
  • melt
  • plot
  • rename
  • replace
  • reset_index
  • sample
  • select_dtypes
  • shift
  • sort_index
  • sort_values
  • to_csv
  • to_json
  • to_sql
  • pd.concat
  • pd.crosstab
  • pd.cut
  • pd.qcut
  • pd.read_csv
  • pd.read_json
  • pd.read_sql
  • pd.to_datetime
  • pd.to_timedelta

Master Python, Data Science and Machine Learning

Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:

Get the All Access Pass now!

Dunder Data

Expert python data science training — take a course at dunderdata.com

Ted Petrou

Written by

Author of Master Data Analysis with Python and Founder of Dunder Data

Dunder Data

Expert python data science training — take a course at dunderdata.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade