Few Things You Should Be Able to Do With the Pandas Library.

Promise Shittu
The Startup
Published in
5 min readSep 7, 2020
Pandas

What is Pandas?
Pandas is an open-source data analysis library for providing easy-to-use data structures and data analysis tools. It is dependent on other libraries like Numpy(Read about Few Things You Should Be Able to Do With the Numpy Library) and has optional dependencies.

What pandas has to offer?
- It has fast and efficient DataFrame object.
- It allows efficient data wrangling.
- It has high performance in data cleaning and data scraping.
- Missing data can be handled efficiently with the use of Pandas.
- It allows grouping by data for aggregation and transformations.
- It performs very well in merging and joining of data.
- It has time series functionality.

How to install Pandas?
The easiest way to get Pandas set up is to install it through a package like the Anaconda distribution which is a cross platform distribution for data analysis and scientific computing or using pip as seen below.

pip install pandas

Data structures of Pandas
Some of the data structures available with Pandas are:

Series: A series is a 1-dimensional labelled array that can hold data of any type. Its axis labels are collectively called an index.

Data frames: A data frame is a 2-dimensional labelled data structure with columns

Panels: is a 3-dimensional container of data.

Using Pandas
To use pandas, it has to be imported in our coding environment. This is done conventionally with the following command:

import pandas as pd

What are Dataframes?
Dataframe is a table. It has rows and columns. Each column in a dataframe is Series object, rows consist of elements inside Series. It can be constructed using dictionary with keys and values.

It is illustrated below.

df = pd.DataFrame({'hobbies': ['Playing piano', 'singing', 'movies'], 'likes': ['reading', 'writing', 'learning'], 'dislikes': ['laziness', 'lateness', 'dishonesty']})

To form series, use:

data = pd.Series(['promise', 'ogooluwa', 'daniel'])

To read CSV files, use:

df = pd.read_csv('filename.csv')

Dataframe operations

Some dataframe operations is illustrated in the code snippets below

#Importing pandas
import pandas as pd
#Creating a dataframe with a dictionary
df = pd.DataFrame({'hobbies': ['Playing piano', 'coding', 'movies'], 'likes': ['reading', 'writing', 'learning'], 'dislikes': ['laziness', 'lateness', 'cheating']})

#To check data types of data frames, use the below command:
df.dtypes
#To check the origin of a dataframe from the pandas library, use:
type(df)
#To see columns of a dataframe, use:
df.columns
#To create dataframe specifying the columns and index.. You can use #as illustrated below:data = [['Playing piano', 'singing', 'movies'], ['reading', 'writing', 'learning'], ['laziness', 'lateness', 'dishonesty']]df1 = pd.DataFrame(data, columns=['Promise', 'Michael', 'Gloria'], index=['hobbies', 'likes', 'dislikes'])
#Indexing
#To explicitly locate an element i.e to specifically locate an element, use:
df1.loc[['likes']]#To implicitly locate an element, use:
df1.iloc[[1]]

Filtering Data frames
The illustration to show filtering dataframes is seen in the below code snippets:

#Importing pandas
import pandas as pd
#Data frame creation
data = [['Playing piano', 'singing', 'movies'], ['reading', 'writing', 'learning'], ['laziness', 'lateness', 'dishonesty']]
df1 = pd.DataFrame(data, columns=['Promise', 'Michael', 'Gloria'], index=['hobbies', 'likes', 'dislikes'])#To transpose, use:
df1.T
#To filter with conditions, use this command:
#df1[condition]

Arithmetic operation
Data frames take arithmetic operations row by columns of each frame being operated.
Other operations is shown in the snippet below:

#Importing pandas
import pandas as pd
import numpy as np
#Data frame creation
data = [['Playing piano', 'singing', 'movies'], ['reading', 'writing', 'learning'], ['laziness', 'lateness', 'dishonesty']]
df1 = pd.DataFrame(data, columns=['Promise', 'Michael', 'Gloria'], index=['hobbies', 'likes', 'dislikes'])#To give an overview of what is in the dataframe, use:
df1.describe()
#To sum what is in a dataframe, use:
df1.sum()
#To represent with a null using numpy, use:
np.NaN

Data cleaning
NA is referred to as missing values in Pandas. It is used for simplicity and performance reasons. Data cleaning is a process of preparing data for analysis.
This can be illustrated in the following code snippets:

# Importing pandas
import pandas as pd
import numpy as np
# Data series creation
data = pd.Series(['singing', 'eating', 'lying'])
#Data frame creation
df1 = pd.DataFrame(data, columns=['Promise', 'Michael', 'Gloria'], index=['hobbies', 'likes', 'dislikes'])
# Removing Nan, use:
data.dropNa()
# To get true or false for not null values, use:
data.notnull()
data[data.notnull()]
# The above the same as
data.dropNa()
#drops all rows with Nan (You could add axis)
data.dropNa(how="all")
#To create random dataframe with 4 rows and 2 columns
pd.DataFrame(np.random.rand(4, 2))
#To drop Na where na appears 2 or more than 2 times in df1
df1.dropna(thresh=2)
#Forward fill i.e fills all Na with data just before the data. To forward fill, use:
df1.fillna(method='ffill')
#To fill with a limit, use:
df1.fillna(method="ffill", limit=2)
#Won't fill more than 2 elements
#IT IS OFTEN BETTER TO FILL WITH MEAN THAN TO FILL WITH JUST DATA

Data Wrangling
To manipulate data, we could:
- Group by join
- Combine
- Pivot
- Melt
- Reshape

This is illustrated below:

# Importing pandas
import pandas as pd

# To melt - To make data come down from the top. To do, use:
melted = pd.melt(dataframe, ['key'])
# Pivot - To reshape melted
melted.pivot('key', 'variable', 'value')
# To group by data, use:
grouped = dataframe['data'].groupby(dataframe['key'])
#To view what you've grouped and evaluated in a table, use:
grouped.unstack()

Joins and Unions

# Importing pandas
import pandas as pd
df1 = pd.DataFrame({'hobbies': ['Playing piano', 'singing', 'movies'], 'likes': ['reading', 'writing', 'learning'], 'dislikes': ['laziness', 'lateness', 'dishonesty']})df2 = pd.DataFrame(data, columns=['Promise', 'Michael', 'Gloria'], index=['hobbies', 'likes', 'dislikes'])
data = [['Playing piano', 'singing', 'movies'], ['reading', 'writing', 'learning'], ['laziness', 'lateness', 'dishonesty']]
# To join - To intersect two dataframes. To do, use:
pd.merge(df1, df2)
#Note : Make sure df2 has unique values before joining
#To join on a particular key, use:
pd.merge(df1, df2, on="key")
#To join in another way
pd.merge(df1, df2, left_on='name', right_on="name")
#To outer join i.e to join and put 'Nan" wherever a key is not included
pd.merge(df1,df2, how = "outer")
#To concat, where s1, s2, s3 are series, use:
pd.concat([s1, s2, s3])

Date and Time
To use date and time, we use:

from datetime import datetime, date, time

Operations of date and time is seen below:

# importing required libraries
from datetime import datetime, date, time
import pandas as pd
#Datetime object
dt = datetime(2019, 11, 25, 11, 36, 00, 00)
dt.day
#To stringify datetime object to specified format
dt.strftime('%d/%m/%Y %H:%M')
#To perform difference in time
difference = datetime(2019, 1, 7) + datetime(2018, 6, 24, 8, 15)
difference.days
stamp = datetime(2019, 1, 3)
print (str(stamp))
print(stamp.strftime('%Y-%m-%d'))
#To strip time from any format to datetime format, use:
value = '19-January-03'
datetime.strptime(value, '%y-%B-%d')
#This gives the best guess of a stripped time
from dateutil.parser import parse
parse('2011-January-03')
#To create random series time series
ts = pd.Series(np.random.randn(5), index=pd.date_range('1/1/2000', periods=5, freq='Q'))
#Shifts first two time down:
ts.shift(2)
#Shifts first two time up:
ts.shift(-2)

--

--