Nikhil Ravindra Akki
GreyAtom
Published in
3 min readJun 3, 2017

--

Before we go to the problem statement lets just agree on a few things — All of us have used MS Excel at some point or the other. It is relatively easy to use and it packs a lot of features. Excel can be used to do a lot of things, it is a great tool and has been developed for decades. To this day many people (including myself), businesses big or small use it quite extensively.

However, there are a few things which I have found to be easier to do with Pandas when compared with Excel. These tasks could have been achieved with Excel, but I choose Pandas as an experiment.

Reasons for choosing Pandas over Excel -

1. Love for Python in general :-)

2. Automation with a just few lines of code.

3. To learn Pandas framework

Problem statement & Solutions –

1. Multiple files with Student information (Over 1,00,000 records / rows per excel sheet ~20mb file or ~28mb CSV file)

a. Excel has an excellent graphical user interface but this becomes a pain even with a small 20–28mb file. I tried to work on a Core i5 Desktop with 8 GB ram and on a Core i5 MacBook, after a few minutes, Excel would simply crash.

2. Sorting data based on various criteria (ex — score/percentile, pin code etc.)

a. This was one of the key reasons why I choose Pandas -I do not know how would one separate records based on column values in Excel (maybe vlookup or hlookup, if you know how to do this, please let me know in the comments section below).

def pincode_split(pincodes, df):
frames = []
for pin in pincodes:
frames.append(df[df['PINCODE']==pin])
new_df = pd.concat(frames)
return new_df

def score_cet(df, lower, upper):
newdf = df[df['PERCENTILE']<=cf1]
newdf = newdf[newdf['PERCENTILE']>=cf2]
return newdf​
pincode_split and score_cet are two simple functions which takes multiple arguments, they are used to check certain
conditions on specific column fields. Both functions return a Data-frame at the end.

3. Removing duplicate data

a. This is one of the easiest tasks to do in Pandas, just one line of code! and no, I’m not kidding it is true

df.drop_duplicates(subset=[column names])

This command will remove all the duplicates from the dataset by checking duplicate values in column fields passed in the subset argument.

4. Realign order of columns along with row data

a. Again, this is achieved quite effortlessly thanks to pandas framework as it does all the heavy lifting for you.

New_df = df[df[‘Name’,’Email’,’Score’,’Contactno’]]

This code creates a new_df with column fields selected in block [ ] quotes.

5. Saving / creating sorted CSV file –

a. Saving excel file with so many rows and columns has be very slow for me. However, with Pandas it took 1/10th of the time taken by Excel to save the same file on same hardware configuration.

df.to_csv('new.csv')

I have been learning Python for a quite a few months now and I keep learning new things every day. I highly recommend to those who want to get into Data Science, try using Pandas for Data Munging & Analysis as much as possible. It has been a fun yet intense ride learning Python & Pandas.

--

--