Advanced Pandas: 21 Powerful Tips for Efficient Data Manipulation

Maximize Your Pandas Skills: Essential Tips and Tricks for Mastering Data Manipulation

Fares Sayah
4 min readJul 6, 2024
Photo by Pascal Müller on Unsplash

Pandas, a powerful open-source data manipulation library for Python, is an indispensable tool in any data scientist’s toolkit. Known for its fast performance and intuitive syntax, Pandas excels at handling structured data, making it crucial for tasks ranging from data cleaning to complex analysis. In this article, we’ll explore 21 essential Pandas tips that will boost your productivity, streamline your code, and unlock new capabilities in your data workflows.

1: Print DataFrame in Markdown-friendly format

|    |   a |   b |
|---:|----:|----:|
| 0 | 1 | 5 |
| 1 | 2 | 6 |
| 2 | 3 | 7 |
| 3 | 4 | 8 |
+----+-----+-----+
| | a | b |
+====+=====+=====+
| 0 | 1 | 5 |
+----+-----+-----+
| 1 | 2 | 6 |
+----+-----+-----+
| 2 | 3 | 7 |
+----+-----+-----+
| 3 | 4 | 8 |
+----+-----+-----+

2: Group rows into a list

col1    col3  col2              
a 1.5 [d, e]
b 3.5 [f, g]
c 3.0 [h]

3: DataFrame.explode()

        a   b
0 [1, 2] 11
1 [4, 5] 13
   a   b
0 1 11
0 2 11
1 4 13
1 5 13

4: DataFrame.copy()

    col1  col2
0 7 4
1 8 5
2 9 6
    col1  col2
0 1 4
1 2 5
2 3 6

5: Groupby().count vs Groupby( ).size

  col2 col1      
a 1
b 2
c 2
d 1
      col1  col2
a S 1
b M 1
S 1
c L 2
d L 1
dtype: int64

6: Correlation

a    0.94388
b 0.68313
dtype: float64

7: Cross-Tabulation

person2  Ben  Jone  Patrick  Smith  Warren
person1
Ben 0 0 1 1 0
Jone 0 0 0 0 1
Patrick 1 0 0 1 0
Smith 1 0 1 0 1
Warren 0 1 0 1 0

8: DataFrame.query()

    fruit  price
2 grape 6
3 grape 7
    fruit  price
2 grape 6
3 grape 7

9: Unpivot DataFrame

    fruit  Aldi  Walmart  Costco
0 apple 4 6 1
1 orange 5 7 2
    fruit    store  value
0 apple Aldi 4
1 orange Aldi 5
2 apple Walmart 6
3 orange Walmart 7
4 apple Costco 1
5 orange Costco 2

10: Rename aggregated column

size   price        
L 19.000
M 10.000
S 36.995
size  mean_price          
L 19.000
M 10.000
S 36.995

11: Normalized Value Counts

S     4
M 2
L 1
XL 1
dtype: int64
S     0.500
M 0.250
L 0.125
XL 0.125
dtype: float64

12: df.transform() instead of df.count()

  type  value
0 A 5
1 A 3
2 O 2
4 O 7
5 A 3

13: Fill in Null Values

    orange  apple
0 31.0 4.0
1 5.0 71.0
2 9.0 12.0

14: Value Counts Missing Values

15: Filter Columns in DataFrame

    Temp  Degree
0 Hot 35
1 Cold 3
2 Warm 15
3 Cold 2
    Degree  Temp_Cold  Temp_Hot  Temp_Warm
0 35 0 1 0
1 3 1 0 0
2 15 0 0 1
3 2 1 0 0
    Temp_Cold  Temp_Hot  Temp_Warm
0 0 1 0
1 1 0 0
2 0 0 1
3 1 0 0

16: Convert Data Types Automatically

17: Assign new columns to a DataFrame

                                        text  text_len  word_count  weekend
0 saturday: weekend (not working day) 35 5 True
1 sunday: weekend (not working day) 33 5 True
2 monday: doctor appointment at 2:45pm. 37 5 False
3 tuesday: dentist appointment at 11:30 am. 41 6 False
4 wednesday: basketball game at 7:00pm 36 5 False
5 thursday: back home by 11:15 pm. 32 6 False
6 friday: take the train at 08:10 am. 35 7 False

18: Read HTML Tables

    Year     Office    GOP    DFL Others
0 2020 President 45.3% 52.4% 2.3%
1 2020 Senator 43.5% 48.8% 7.7%
2 2018 Governor 42.4% 53.9% 3.7%
3 2018 Senator 36.2% 60.3% 3.4%
4 2018 Senator 42.4% 53.0% 4.6%
      Location July (°F) July (°C) January (°F) January (°C)
0 Minneapolis 83/64 28/18 23/7 −4/−13
1 Saint Paul 83/63 28/17 23/6 −5/−14
2 Rochester 82/63 28/17 23/3 −5/−16
3 Duluth 76/55 24/13 19/1 −7/−17
4 St. Cloud 81/58 27/14 18/−1 −7/−18

19: ‘nlargest’ and ‘nsmallest ‘

Data Link: IMDB Rating

20: Create a Rank Column

  Students  Marks
0 John 80
1 Smith 56
2 Patrick 95
3 Bob 75
4 Jose 45
  Students  Marks  Rank
0 John 80 2.0
1 Smith 56 4.0
2 Patrick 95 1.0
3 Bob 75 3.0
4 Jose 45 5.0

21: Color Values in DataFrame

--

--

Fares Sayah

Data Scientist | Kaggle Master. Writing on data Science, Machine Learning, and Natural Language Processing