Pandas: Combine, Analyse, Save & Load

Ja X
6 min readDec 29, 2023

--

1. Combining DataFrames

Combining DataFrames in Pandas, done through methods like merge, join and concat, is crucial for merging datasets based on common columns or indices. By specifying key columns or indices, users can efficiently consolidate information from multiple DataFrames, enabling comprehensive analysis and insights.

The concat function is used to concatenate or stack DataFrames along a specified axis, either vertically or horizontally.pd.concat()

cust_data = pd.DataFrame({"customerID":['101','102','103','104'],
'category': ['Medium','Medium','High','Low'],
'first_visit': ['yes','no','yes','yes'],
'sales': [123,52,214,663]},index=[0,1,2,3])

cust_data_new = pd.DataFrame({"customerID":['101','103','104','105'],
'distance': [12,9,44,21],
'sales': [123,214,663,331]},index=[4,5,6,7])
pd.concat([cust_data,cust_data_new],axis=0) # concatenate by row i.e axis=0

pd.concat([cust_data,cust_data_new],axis=1) # concatenate by column i.e axis=1

Merge and Join

The merge operation in Pandas combines DataFrames based on the values of a specified column to identify common entries. On the other hand, the join operation utilizes the index to merge DataFrames and identify common entries.

Pandas’ merge function provides flexibility with outer, inner, left, and right joins, allowing users to customize the combination of DataFrames based on common column values.pd.merge()

Pandas Join
# outer merge is union of on
pd.merge(cust_data,cust_data_new,how='outer',on='cust_ID')

# inner merge is intersection of on
pd.merge(cust_data,cust_data_new,how='inner',on='cust_ID')

The join in Pandas works similarly to merge, but instead of using column values to combine DataFrames, it utilizes index labels for the merging process.pd.join()

data_quar = pd.DataFrame({'Q1': [101,102,103],
'Q2': [201,202,203]},
index=['I0','I1','I2'])

data_quar_new = pd.DataFrame({'Q3': [301,302,303],
'Q4': [401,402,403]},
index=['I0','I2','I3'])
data_quar.join(data_quar_new,how='right') 
# outer, inner, left, and right work the same as merge

A right join in Pandas includes all data from the right (new) DataFrame, considering its indexes, and also includes indexes from the left DataFrame that match those in the right DataFrame.

2. Saving and Loading DataFrames

In practical situations, working with extensive datasets of thousands of rows and multiple columns renders creating datasets using multiple lists impractical, especially as the dataset size grows. Hence, a more efficient approach is required for simultaneous handling of data at both the column and row levels. In Python, we can directly import datasets from local systems, links, or databases, eliminating the need to manually create datasets.

Loading a CSV file in Python involves utilizing libraries such as Pandas to read the file into a DataFrame, providing a structured and accessible format for further analysis.pd.read_csv(), pd.read_excel(), data.head(), data.tail(), data.info(), data.shape

data = pd.read_csv('/path/to/data/file/Data.csv')
data_excel = pd.read_excel('/path/to/data/file/Data.xlsx')
# Using pd.read_csv()/pd.read_excel function will work without any path
# if the notebook and dataset are in the folder
# head() function helps us to see the first 5 rows of the data
data.head()

data.tail()
# to check the last 5 rows of the dataset

data.info()
# concise summary of the DataFrame's structure, including data types, non-null counts, and memory usage

data.shape
# to check the number of rows and columns in the dataset

Let’s identify null values in the dataset.data.isnull()

data.isnull().sum() 
# data.isnull() indicates True or False for given data
# .sum() summarizes total null values for each column
Scorecard       0
Team 1 0
Team 2 0
Winner 0
Margin 179
Ground 0
Match Date 0
dtype: int64

# only Margin column has 179 null values

The dataset we loaded contains information about cricket games. Now, let’s find the games where Australia secured a win.

data[data['Winner'] == 'Australia']

Now, let’s determine the number of games where the visiting team won the match.

data.Winner.eq(data['Team 2']).sum()

# data: This is the DataFrame containing the dataset.
# data.Winner: This extracts the "Winner" column from the DataFrame.
# .eq(): This is a Pandas method used for element-wise comparison. It compares each element in the "Winner" column with the values provided.
# data['Team 2']: This refers to the values in the "Team 2" column of the DataFrame
1637

Saving the Dataset data.to_csv(), data.to_excel()

data.to_csv('/path/to/data/file/Saved_Data_cs.csv',index=False)

data.to_excel('/path/to/data/file/Saved_Data_excel.xlsx',index=False)

3. Statistical Functions

Pandas provides a rich set of statistical functions that enable comprehensive data analysis. From basic measures such as mean, median, and standard deviation to advanced functionalities like correlation and covariance calculations, these functions empower users to gain valuable insights into their datasets. Additionally, Pandas offers efficient aggregation tools, allowing users to summarize and analyze data across different dimensions, making it a versatile tool for statistical exploration in data science.

In this instance, we’ll load a stock dataset comprising stock names, prices, and dates. Initially, we’ll perform operations to identify the minimum and maximum values of prices, along with extracting the unique stock names. data['column'].min(), data['column'].max()

data_stock = pd.read_csv('/path/to/data/file/StockData.csv')

data_stock.head()
data_stock['price'].min()

data_stock['price'].max()

The min function provides the minimum price, denoted as 28.4, while the max function yields the maximum price, represented as 179.26, for a given stock.

If you want to extract listed stock names, we can utilise data['column'].unique

data_stock['stock'].unique()
array(['AAPL', 'SNI', 'TJX', 'ZTS'], dtype=object)

As observed, our dataset encompasses price data for four distinct stocks.

In statistics, achieving balance in the data is crucial, emphasizing the need to ensure that the dataset is well-balanced. This can be verified by employing value_counts() to inspect the count of each unique quantity in a column and value_counts(normalize=True) to obtain the relative frequencies of these unique values through normalisation.

data_stock['stock'].value_counts()
AAPL    1259
SNI 1259
TJX 1259
ZTS 1259
Name: stock, dtype: int64
data_stock['stock'].value_counts(normalize=True)
AAPL    0.25
SNI 0.25
TJX 0.25
ZTS 0.25
Name: stock, dtype: float64

Pandas further facilitates statistical analysis by providing functions for mean, median, and mode computations. data['column'].mean(), data['column'].medium(), data['column'].mode()

data_stock['price'].mean()
# avarage of price

data_stock['price'].median()
# middle value of price

data_stock['price'].mode()
# the most frequent value(s) of price

mean: 73.05702966640192

median: 69.08500000000001

mode: 74.59

Group By function

The Pandas dataframe.groupby() function is employed to categorize data into groups based on specified criteria.

# lets group mean price of each stock
data_stock.groupby(['stock'])['price'].mean()
stock
AAPL 109.066698
SNI 71.319206
TJX 66.743566
ZTS 45.098648
Name: price, dtype: float64

In this context, the groupby function is utilized to partition the data into the four stocks present in the dataset, followed by the calculation of the mean price for each of the four stocks.

The Pandas apply() function empowers you to manipulate both columns and rows within a DataFrame. In the context of our task, let's formulate a function using apply() to increment the stock price by 10%. data['column'].apply()

def profit(s):
return s + s*0.10 # increase of 10%

data_stock['new_price'] =data['price'].apply(profit)
data_stock.head()

The Pandas sort_values() function facilitates the sorting of a DataFrame in either ascending or descending order based on a specified column.

data_stock.sort_values(by='new_price',ascending=False) 
# by default ascending is set to True

Key Functions →

Combining DataFrame: pd.concat(), pd.merge(), pd.join()

Saving & Loading: pd.read_csv(), pd.read_excel(), data.head(), data.tail(), data.info(), data.shape, data.isnull(), data.to_csv(), data.to_excel()

Statistical Functions: data['column'].min(), data['column'].max(), data['column'].unique, .value_counts(normalize=True), data['column'].mean(), data['column'].medium(), data['column'].mode(), dataframe.groupby(), data['column'].apply(), .sort_values()

--

--