Be a More Efficient Data Scientist by Using Pandas Tricks

In this article, I will present some useful Pandas tricks that I use in my day-to-day job. So let’s get started

Georgios Drakos
Jan 21 · 4 min read

✏️ Table of Contents

  • Fix messy dataframe column names
  • Remove Outliers
  • Inspect CSV files
  • Calculate Percentage Change
  • Date Ranges
  • Left merge with indicator
  • Compress & Save Pandas Dataframe
  • Conclusion

✅ Fix messy dataframe column names

Sometimes dataframe columns have extra spaces or are just plain odd, even if they look normal. One simple way to fix them is:

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

✂️ Remove Outliers

Sometimes you want to remove records where a column has extreme values. This can be done easily by running the following command:

train.loc[train[col_name] < np.percentile(train[col_name],95)]

🔍 Inspect CSV files

Sometimes it is useful to inspect a CSV file before loading it using Pandas in case you need to skip some rows. This can be achieved by the following command:

! head -10 'file_name.csv'
df = pd.read_csv(‘file_name.csv’,skiprows=10)

🛠 Calculate Percentage Change

You can read the full documentation of the above pandas function using the link below:

df["perc_change"] = df[col_name].pct_change()

🗓Date Ranges

You can use pd.date_range to fill the missing dates of a Pandas dataframe.

rng = pd.date_range('2015-02-24', periods=15, freq='W')
df = pd.DataFrame({ 'date': rng, 'val' : np.random.randn(len(rng))})
display(df.head())idx = pd.date_range(df.date.min(),df.date.max())
df = df.set_index('date')
df_fill = df.reindex(idx, fill_value=0)
plt.figure(figsize=(15,10))
plt.xlabel('Dates')
plt.ylabel('Number of tweets')
plt.plot(df.index,df.val,label='unfilled')
plt.plot(df_fill.index,df_fill.val,label='filled')
plt.legend()
plt.grid()
plt.show()

🔗 Left merge with indicator

Two useful arguments of the pd.merge function are:

  • indicator: You pass a boolean value (True or False). If True it adds a column to output DataFrame called “_merge” with information on the source of each row. Possible values are: “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

💾 Compress & Save Pandas Dataframe

We usually save a pandas Dataframe as a CSV file by running the following command.

df.to_csv('dataset.csv', index=False)
df.to_csv('dataset.gz', compression='gzip', index=False)
df = pd.read_csv('dataset.gz')

🤖 Conclusion

This brings us to the end of this article. Hope you learn some new useful pandas commands. As Roman Orac has highlighted in one of his posts:

💪💪As always keep studying, keep creating 🔥🔥

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Georgios Drakos

Written by

📊Data Scientist, 🎓 Imperial MSc Graduate, 📌 London

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

More From Medium

More from Towards AI

More from Towards AI

More from Towards AI

Image Filtering

More from Towards AI

___
Mar 29 · 8 min read

103

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade