Data Science
40 Useful Pandas Snippets
Pandas snippets that come in handy in data analysis work
Pandas is a versatile and powerful library for data science. It’s like a swiss army knife for data science because it provides so many useful functions for different tasks for dealing with data.
To be effective with this tool, you need to know some tricks of the trade. In this article, I detail 40 useful pandas snippets that I use regularly.
For those with an understanding of the Pandas library, the following snippets might be useful.
For those who are unfamiliar with Pandas, the following might help you better understand the library by working through some examples.
The dataset used throughout this article is available on Kaggle.
Code for this article → Deepnote
Table of contents
· Reading data
∘ 1. Filter columns
∘ 2. Parse dates on read
∘ 3. Specify Data Types
∘ 4. Set index
∘ 5. No. of rows to read
∘ 6. Skip rows
∘ 7. Specify NA values
∘ 8. Setting boolean values
∘ 9. Read from multiple files
∘ 10. Copy and Paste into Data Frames
∘ 11. Read tables from PDF files
· Exploratory Data Analysis (EDA)
∘ 12. EDA cheat
· Data Types (dtypes)
∘ 13. Filter columns by dtype
∘ 14. Infer dtype
∘ 15. Downcasting
∘ 16. Manual conversion
∘ 17. Convert all at once
· Column operations
∘ 18. Renaming columns
∘ 19. Add suffix and prefix
∘ 20. Create new columns (Mutate in dplyr terms)
∘ 21. Insert columns at specific positions
∘ 22. if-then-else
∘ 23. Dropping columns
· String operations
∘ 24. Column names
∘ 24. Contains
∘ 25. findall
· Missing values
∘ 26. Checking
∘ 27. Dealing with missing values
· Date operations
∘ 28. Get X hours/days/weeks from today / ago
∘ 29. Filter between two dates
∘ 30. Filter by day/month/year
· Styling data frames
∘ 31. Number format
∘ 32. Let there be colors
· Misc
∘ 33. Get the id of max and min in a column
∘ 34. Apply function to data frame
∘ 35. Randomly shuffle data
∘ 36. Percent change
∘ 37. Assign rank
∘ 38. Check memory usage of data frame
∘ 39. Explode list values to multiple rows
∘ 40. Convert smaller categories to “Others”
Reading data
read_csv
can do much more than just reading in your data.
Here’s a taste of it. (More in the docs)
1. Filter columns
Only need a couple of columns from the dataset? Use usecols
2. Parse dates on read
No need to do pd.to_datetime
anymore, parse it on read!
3. Specify Data Types
Setting category data types at read can save a ton of memory for data frames!
4. Set index
Setting indexes are especially useful for time series data.
5. No. of rows to read
Don’t want to read in a dataset with millions of rows before having a peek at it? Use nrows
!
6. Skip rows
Does your data set have rows with faulty data? Skip them!
7. Specify NA values
If your data has values that are supposed to be NA, i.e. values such as ?
set it at read so you won’t have to convert it later.
8. Setting boolean values
Have a boolean column that’s in the form of Yes
and No
? Tell pandas about it!
9. Read from multiple files
Is your data in multiple files? Read them all in with glob!
10. Copy and Paste into Data Frames
Looking at some data on Excel but don’t want to download it? Copy it! Pandas can read from your clipboard.
11. Read tables from PDF files
Need to read in tables from PDf files, tabula-py has your back!
Exploratory Data Analysis (EDA)
12. EDA cheat
Want to visualize your dataset but don’t want to write code for plots? With pandas-profiling, you can do it with just one line of code.
Data Types (dtypes)
Here’s a list of dtypes for pandas
13. Filter columns by dtype
14. Infer dtype
Are your numeric columns read in as objects? Let pandas do the work in converting them!
15. Downcasting
Pandas’ to_numeric
has a nifty feature to downcast the type, allowing you to reduce the data frame’s size.
16. Manual conversion
If there are NaN values in the data, errors="coerce"
can help prevent those nasty errors. At the same time, you can fill those NA values with reasonable values using .fillna
17. Convert all at once
Column operations
18. Renaming columns
19. Add suffix and prefix
20. Create new columns (Mutate in dplyr terms)
21. Insert columns at specific positions
22. if-then-else
23. Dropping columns
String operations
24. Column names
24. Contains
25. findall
Missing values
26. Checking
27. Dealing with missing values
More in the docs
Date operations
28. Get X hours/days/weeks from today / ago
29. Filter between two dates
30. Filter by day/month/year
Styling data frames
31. Number format
32. Let there be colors
More styling options in the docs
Misc
33. Get the id of max and min in a column
34. Apply function to data frame
35. Randomly shuffle data
36. Percent change
Useful for time series data
ex: price of BTC over 3 days [30000, 33000, 31000] -> [NaN, 0.1, -0.06]
37. Assign rank
38. Check memory usage of data frame
39. Explode list values to multiple rows
40. Convert smaller categories to “Others”
Hope you found these code snippets useful in your own data work!
If you want more, check out these resources below
- Pandas Documentation in PDF
- Pandas Cheatsheet (Official)
- Master Python’s Pandas library with these 100 tricks
- Pandas cheat sheet for data science
Thanks for reading!
Liked this article? Here are some articles you may enjoy:
- Data Analysis in 10 Easy Steps — The key steps in the process of turning data into insights
- Top Machine Learning Frameworks in 2021 — The Top ML frameworks used by 25,000 data scientists in 2021.
- Top 5 Machine Learning Algorithms Explained — Exploring the most popular data science methods and their applications.
Be sure to follow the bitgrit Data Science Publication to stay updated!
Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!
Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!