letsprep
Published in

letsprep

Top must know functions in Pandas

Not this one of course :)

When I began coding in python I knew python has a versatile set of libraries that can be imported anytime to get the job done. When I started with Data Science one of the most useful libraries I found is Pandas.

Hello everyone! So today we will be writing about the Pandas library (link to the website). The contents I will try to cover are as follows :)

1.Introduction

2.Installation

3.Using Pandas

Introduction :

Yes this one :)

Pandas stand for “Python Data Analysis Library”. According to the Wikipedia page on Pandas, “the name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.” Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas is a NumFOCUS sponsored project.

Installation :

For installing Pandas you must have python3 or python2 installed on your system, obviously, it’s a python library. Pandas can be installed either using a python package installer(pip) or if you are using Anaconda then by using Conda. So I hope you got pandas installed, you can refer to the links for more details regarding installation.

Working with Pandas:

Pandas library is very useful in data science where everything starts and ends with data.

Pandas library is basically focussed on dataframe , so let's learn how to create a dataframe.

data = np.array([[‘’,’Col1',’Col2'],
[‘Row1’,1,2],
[‘Row2’,3,4]])

print(pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:]))

Reading files:

import pandas as pd
import pandas_profiling
%matplotlib inline
df = pd.read_csv('data.csv')
df.head(10)

JSON data

import pandas as pd

json = pd.read_json('https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json')

Html data:

crypto_data = pd.read_html(crypto_url.text)

Pickle data:

import pickle
with open('crypto_final.pickle', 'wb') as sub_data:
pickle.dump(crypto_final, sub_data, protocol=pickle.HIGHEST_PROTOCOL)
crypto_final = pd.read_pickle('crypto_final.pickle')
crypto_final.head()

Reading file with certain conditions

df = pd.read_csv('data.csv', sep=';', encoding='latin-1', nrows=1000, skiprows=[2,5])
df.head(10)

Inspecting Data :

Now that you’ve loaded your data, it’s time to take a look. How does the data frame look? Running the name of the data frame would give you the entire table, but you can also get the first n rows with df.head(n) or the last n rows with df.tail(n). df.shape would give you the number of rows and columns. df.info() would give you the index, datatype and memory information. The command s.value_counts(dropna=False) would allow you to view unique values and counts for a series (like a column or a few columns). A very useful command is df.describe() which inputs summary statistics for numerical columns. It is also possible to get statistics on the entire data frame or a series (a column etc):

  • df.mean()Returns the mean of all columns
  • df.corr()Returns the correlation between columns in a data frame
  • df.count()Returns the number of non-null values in each data frame column
  • df.max()Returns the highest value in each column
  • df.min()Returns the lowest value in each column
  • df.median()Returns the median of each column
  • df.std()Returns the standard deviation of each column

Selecting particular data

df.loc[8]

Out[81]:

PassengerId                                                    9
Survived 1
Pclass 3
Name Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
Sex female
Age 27
SibSp 0
Parch 2
Ticket 347742
Fare 11.1333
Cabin NaN
Embarked S
Name: 8, dtype: object

Getting all the details and brief report at once using ProfileReport.

pandas_profiling.ProfileReport(df)

Out[87]:

report for the dataset for full details visit here

Plotting data with pandas :

df['Age'].plot()

Out[88]:

In [89]:

df['Age'].hist()

Out[89]:

Manipulating data:

  • .loc[] works on labels of your index. This means that if you give in loc[2], you look for the values of your DataFrame that have an index labeled 2.
  • .iloc[] works on the positions in your index. This means that if you give in iloc[2], you look for the values of your DataFrame that are at index ’2`.
  • .ix[] is a more complex case: when the index is integer-based, you pass a label to .ix[]. ix[2] then means that you’re looking in your DataFrame for values that have an index labeled 2. This is just like .loc[]! However, if your index is not solely integer-based, ix will work with positions, just like .iloc[].
  • You can remove duplicate rows from your DataFrame by executing df.drop_duplicates().
  • To get rid of (a selection of) columns from your DataFrame, you can use the drop()

Reshaping your Dataframe :

You can use the pivot() function to create a new derived table out of your original one. When you use the function, you can pass three arguments:

  1. values: This argument allows you to specify which values of your original DataFrame you want to see in your pivot table.
  2. columns: whatever you pass to this argument will become a column in your resulting table.
  3. index: whatever you pass to this argument will become an index in your resulting table.

Iterating Over a Pandas DataFrame:

You can iterate over the rows of your DataFrame with the help of a for loop in combination with an iterrows() call on your DataFrame

Write a Pandas DataFrame to a File

To write a DataFrame as a CSV file, you can use to_csv():

import pandas as pd
df.to_csv('myDataFrame.csv')
  • To use a specific character encoding, you can use the encoding argument:
import pandas as pd
df.to_csv('myDataFrame.csv', sep='\t', encoding='utf-8')

Similarly to what you did to output your DataFrame to CSV, you can use to_excel() to write your table to Excel. However, it is a bit more complicated:

import pandas as pd
writer = pd.ExcelWriter('myDataFrame.xlsx')
df.to_excel(writer, 'DataFrame')
writer.save()

Apply Function:

It is one of the commonly used functions for playing with data and creating new variables. Apply returns some value after passing each row/column of a data frame with some function. The function can be both default or user-defined. For further details of apply function click here.

def num_missing(x):
return sum(x.isnull())

#Applying per column:
print("\n\nMissing values per column:\n\n")
print(df.apply(num_missing, axis=0))
#axis=0 defines that function is to be applied on each column

#Applying per row:
print("\n\nMissing values per row:\n\n")
print(df.apply(num_missing, axis=1).head())

For all the codes and dataset you can refer to my github repo.

Follow me on Github as I keep contributing opensource and helping the community from where I learned a lot.

In case you are an engineering student and looking for videos, do refer to letsprep.in as it has AI selected videos, it brings the best videos from youtube especially for you. It has all the engineering subjects' videos.

Do follow me on Linkedin for more awesome content:-

If you liked the blog, do clap. And follow for more awesome content coming right to you :)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Raja Mishra

Raja Mishra

Python Developer @HipaaS Inc , Data scientist and Full stack developer