CodeX
Published in

CodeX

We all make this common mistake while using Pandas.

Trust me they r easy to fix…

From

Today in this world we all make mistakes, even in the data world lots of data scientists make mistakes when it comes to Pandas. Although these mistakes are easily avoidable, fixing them can make your code more readable and understandable.

Mistake 1: Making Pandas find Data Types

When you import your data into a DataFrame and don’t tell Pandas about your columns and datatypes, Pandas will read the entire dataset into memory just to figure out the data types.

Let’s assume, that if you have a column full of text Pandas will read every value, and see whether they’re all strings or not, and after confirmation, it will set the data type to “string” for that column. Then it repeats this process for all other columns.

Generally, we use df.info() to see how much memory a DataFrame uses. And that’s roughly the same amount Pandas will consume to figure out the data types of each column.

In order to solve this, just add the dtypes parameter and a dictionary with your column names and their data types as strings. For example:

pd.read_csv(‘Student_profiles.csv’, dtype={
‘Name’: ‘str’,
‘Class’: ‘str’,
‘Grade’: ‘str’
})

Mistake 2: Leftover DataFrames

The best quality of DataFrames in Pandas is how easy they are to create and change. But the unfortunate side effect of this is most people end up with code like this:

# Change dataframe 1 and save it into a new dataframed
f1 = pd.read_csv(‘file.csv’)df2 = df1.dropna()df3 = df2.groupby(‘thing’)

Though you’ve moved on to df3and started using it. Don’t leave extra DataFrames roaming around in memory. If you’re using a laptop it will hurt the performance of almost every other task you do. If you’re on a server, it will hurt the performance of everyone else on that server working with you.

Instead, keep your memory clean:

  • Chain together multiple DataFrame modifications in one line:

df = df.apply(thing1).dropna()

  • As Roberto Bruno Martins, another way to clean memory is to perform operations within functions. You can still unintentionally abuse memory this way, and explaining the scope is outside the scope of this article, but if you aren’t familiar I’d encourage you to read this.

https://www.datacamp.com/community/tutorials/scope-of-variables-python

Mistake 3: Manually Configuring Matplotlib

This might be the most common mistake, as it’s the least impactful.

Matplotlib is automatically shipped by Pandas, and it also produces some chart configuration for us on every DataFrame. And so, There’s no need to import and configure it for every chart when it’s already stuffed into Pandas for you.

Here’s an example of doing it the wrong way, even though this is a basic chart it’s still a waste of code:

import matplotlib.pyplot as plt
plot1.hist(x=df[‘x’])
plot1.set_xlabel(‘label for column X’)
plt.show()

The right way:

df[‘x’].plot()

Easier, isn’t it? You can do anything on these DataFrame plot objects that you can do to any other Matplotlib plot object. For example:

df[‘x’].plot.hist(title=’Chart title’)

Conclusion

I’m sure and frank that I did these mistakes and also I’m making other silly mistakes I don’t know about. But hopefully sharing these known ones with you and others in this data world community will help put your hardware to better use, let us write less code, and get more done!

Don’t forget to leave your responses.✌

Everyone stay tuned!! To get my stories in your mailbox kindly subscribe to my newsletter.

Thank you for reading! Do not forget to give your claps and to share your responses and share it with a friend!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store