10 Pandas Mistakes You Should NEVER Make

6 min readMar 1, 2023

This article will review the 10 most common mistakes in Pandas that beginners make when they start learning Pandas in Python. Pandas is a great tool but some errors can be very expensive and that we should avoid.

1. Save CSV File

import pandas as pd
df = pd.read_csv('datatset.csv')
df.to_csv('dataset_clear')

Finally, when we finish the dataset cleaning process, we want to save it, in this case, it is a CSV file, we use the to_csv() method, and we managed to save the data, but when you open the new file you will realize that it will have an additional variable that is like a kind of index.

import pandas as pd
df = pd.read_csv('datatset.csv')
df.to_csv('dataset_clear',index = False)

We use the same method with the difference of setting the parameter index = False, to eliminate this annoying column.

learn the proper way of grouping values

2. Sintaxis

infiniti.groupby('model')['price'].mean().sort_values(ascending = False).head(5)

Normally when we want to perform a query in Pandas we do it sequentially, however, it can cause it to be unfriendly to read.

(
infiniti.groupby('model') \
['price'].mean()  \
.sort_values(ascending = False) \
.head(5)

)

Relying on the symbols “()” and “\”, we can make the query in parts making it more friendly to read.

3. Group Values correctly

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

sedan = infiniti.query('type=="sedan"')['price'].max()
suv = infiniti.query('type=="SUV"')['price'].max()
coupe = infiniti.query('type=="coupe"')['price'].max()

I remember when he started learning Pandas to know a statistical measure, he made the horrible mistake of creating several subsets and later performing the operation of interest.

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

infiniti.groupby('type')['price'].max()

Relying on this great groupby() method, allows us to calculate statistical measures based on a category, we can even use it for a customized Python function.

4. Filter Data

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

infiniti[(infiniti.year>=2016) & (infiniti.price>32000)]

We commonly use this syntax to filter data, it can be a bit cumbersome to read but luckily there is a method to filter data in a better way.

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

infiniti.query('year>=2016 and price>=32000')

Thanks to the query method it allows a more understandable syntax and is easier to code than using the stupid square brackets.

5. Query Strings

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

price_min = 32000
price_max = 42000

infiniti.query(f'price>={price} and price<={price_max} and cylinders==6')

You can use the query method with another way of Python called formatted string literals, however, it is not the best way, there is another method to do the same with cleaner code.

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

price_min,price_max = 32_000,42_000
infiniti.query('price>=@price_min and price<=@price_max and cylinders==6')

Create two variables and filter them using the “@” symbol to make it easier and more elegant to use the variables.

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

(
infiniti.query('price>=@price_min') \
.query('price<=@price_max') \
.query('cylinders==6')
)

We can choose to make queries in parts, managing to reduce the length of the query, making it more comfortable to read.

6. Vectorization is better than using apply and Lambda Functions

We commonly use the apply function to modify data, believing that it is the most efficient way to do the tasks, however, there is another more efficient and faster way to write.

By carrying out this simple line of code we obtain the same results with the advantage that it is easier to understand and read, as well as being faster to preprocess. Automatically the cases that do not meet this logical condition will be assigned a False, otherwise a True.

7. Use np.select() instead of nested conditions.

import pandas as pd
import numpy as np

infiniti = pd.read_csv('infiniti.csv')
q50 = infiniti.query('model=="Q50"')

def engine_size(year,cyl):
  if year<=2015:
    engine = 3.7

  if year>=2016 and year<=2018 and cyl==4:
    engine = 2.0
  
  if year>=2016 and cyl==6:
    engine = 3.0

  return engine  

q50['engine_size'] = q50.apply(lambda x: engine_size(x.year,x.cylinders), axis = 1)

We are commonly used to using nested if statements to establish multiple conditions. We create a function to assign the corresponding engine for the Infiniti Q50 model, we use the apply method together with the lambda functions to pass the corresponding parameters to the function. It does its job, but in programming, there are different ways to reach the same result. This method has the great disadvantage that it can be more complicated to read, as well as presenting a higher computational cost compared to the following method.

import pandas as pd
import numpy as np

infiniti = pd.read_csv('infiniti.csv')
q50 = infiniti.query('model=="Q50"')

filters = [
    (q50.year<=2015),
    (q50.year>=2016) & (q50.year<=2018) & (q50.cylinders==4),
    (q50.year>=2016) & (q50.cylinders==6)
    
]
engine = [3.7,2.0,3.0]
q50['engine_size'] = np.select(filters,engine)

We rely on the numpy library, we use the np. select() method where we will assign two lists, the first will have the data filters and the second list will contain the labels, in this case, the displacement of the motor, as you can see it is easier to read the code as well as being faster computationally.

8. Inefficient data labeling

import pandas as pd
infiniti = pd.read_csv('infiniti.csv')

def horsepower_grade(hp):
  
  if hp<=150:
    hp_grade = "low"
  
  if hp>150 and hp<=200:
    hp_grade = "med"

  if hp>200 and hp<=280:
    hp_grade = "high"

  if hp>280:
    hp_grade = "very_high"

  return hp_grade

infiniti['hp_grade'] = infiniti.hp.apply(horsepower_grade)

We made the same mistake similar in the previous case, we realize that they have a hierarchical order for this we can label the data by chance there is a method that makes life easier for us.

infiniti['hp_grade'] = pd.cut(infiniti.hp,
                             bins = 4,
                             labels = ['low','med',
                                     'high','very_high'])

Thanks to the pd.cut() method, it allows labeling the data in a simpler and faster way.

9. Not graphing using Pandas

When we start learning Data Science we believe that using libraries like Matplotlib or similar is the only way to create graphs, however, they can also be created using pandas.

As you can see we can make exactly the same graph using only pandas in addition to presenting a more understandable code.

10. Saving the dataset in an incorrect format

import pandas as pd
df = pd.read_csv('longdatset.csv')
df.to_csv('longdaset.csv',index = False)

Many times we are faced with extremely long datasets if we save the data in CSV or XLSX format, it can be very computationally expensive and would generate a rather large file, fortunately, there are other options for saving formats.

import pandas as pd
df = pd.read_csv('longdatset.csv')
df.to_parquet('longdataset.parquet')

If we save the data in parquet format, we can drastically reduce the computational cost in addition to generating a new file with little weight, we can reduce a file larger than 1GB to just a few KB.

Thank you very much for reading this article, I hope you have learned a lot and that you put it into practice : )

It may interest you….

Eliminate Underfitting and Overfitting with these tricks

Underfitting and Overfitting are problems that we face in our day, in this article you will learn the most powerful…

medium.com

Improve Your ML Models With This Trick

Feature engineering is the most important process you must master. You will learn a series of very powerful techniques…

becominghuman.ai

Create a Beautiful App Using Streamlit

If you want to take your ML projects to another level, where different users can use your ML algorithm in a web…