Pandas -operations (functions), handling missing data for Data Science

Nikitajain Jain
5 min readJan 18, 2024

--

1. df.assign()

Pandas’ .assign() function is used to add new columns to a DataFrame, based on the computation of existing columns. It allows you to add new columns to a DataFrame without modifying the original dataframe. The function returns a new DataFrame with the added columns.

Here is an example of how you can use it:

df_new = df.assign(count_plus_5=df['Count'] + 5)
df_new.head()

In the above example, the first time df.assign() is used to create a new column named ‘count_plus_5’ with the value of count + 5.

It’s important to note that the original DataFrame df remains unchanged and the new DataFrame df_new is returned with the new columns added.

The .assign() method can be used multiple times in a chain, allowing you to add multiple new columns to a DataFrame in one line of code.

2. df.sort_values()

The df.sort_values() function is used to sort a DataFrame by one or multiple columns.

Output:

df_sorted = df.sort_values(by='Count')
print(df_sorted.head())
Year of Birth Gender Ethnicity Child's First Name  Count  Rank
19 2011 FEMALE ASIAN AND PACIFIC ISLANDER EVA 67 20
26 2011 FEMALE ASIAN AND PACIFIC ISLANDER ANGELA 60 27
30 2011 FEMALE ASIAN AND PACIFIC ISLANDER HANNAH 55 31
44 2011 FEMALE ASIAN AND PACIFIC ISLANDER ARIA 50 45
36 2011 FEMALE ASIAN AND PACIFIC ISLANDER VIVIAN 48 37

3. df.sample()

The df.sample() function randomly selects rows from a DataFrame.

Output:

df_sample = df.sample(n=2, replace=False, random_state=1)
print(df_sample)
Year of Birth Gender Ethnicity Child's First Name  Count  Rank
35 2011 FEMALE ASIAN AND PACIFIC ISLANDER ARIA 50 36
33 2011 FEMALE ASIAN AND PACIFIC ISLANDER ELAINE 52

4. df.rename()

The df.rename() function changes the name of columns or rows in a DataFrame.

df_rename_multiple = df.rename(columns={'Count': 'count', 'Rank':'rank'})
print(df_rename_multiple.head())
df_rename_index = df.rename(index={0:'first',1:'second',2:'third'})
print(df_rename_index.head())

5. df.merge()

df.merge() is used to combine two DataFrames based on one or more common columns, similar to SQL JOINs.

Example:

# Create two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})
# Merge the two DataFrames on the 'key' column
merged_df = df1.merge(df2, on='key')
print(merged_df)

Output

key  value_x  value_y
0 B 2 5
1 D 4 6

6. pd.get_dummies()

pd.get_dummies() is used to convert categorical variable(s) into dummy/indicator variables. This is useful when you have categorical data, and you want to convert it into a format that can be provided to ML algorithms to improve predictions.

Example:

# Creating a DataFrame with categorical data
data = {'Category': ['A', 'B', 'A', 'C', 'B']}
df_categorical = pd.DataFrame(data)
# Applying pd.get_dummies() to convert categorical variable into dummy/indicator variables
df_dummies = pd.get_dummies(df_categorical, columns=['Category'], prefix='Category')
# Displaying the resulting DataFrame with dummy variables
print(df_dummies)

Output:

Category_A  Category_B  Category_C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0

In this example, the Category column has been converted into three dummy columns (Category_A, Category_B, Category_C) with binary values representing the presence or absence of each category.

You can then concatenate these dummy columns with your original DataFrame if needed.

# Concatenating the dummy columns with the original DataFrame
df_concatenated = pd.concat([df_categorical, df_dummies], axis=1)
# Displaying the concatenated DataFrame
print(df_concatenated)

Output:

Category  Category_A  Category_B  Category_C
0 A 1 0 0
1 B 0 1 0
2 A 1 0 0
3 C 0 0 1
4 B 0 1 0

We can fix this using the method .reset_index().

import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west']
)
df2 = df.loc[[1, 3, 5]]
print(df2)
df3=df2.reset_index(inplace=True,drop=True)
print(df3)

Handling missing data with Pandas

1. df.isnull()

The isnull() method in Pandas returns a DataFrame of the same shape as the original DataFrame, but with True or False values indicating whether each value in the original DataFrame is missing or not. Missing values, such as NaN or None, will be True in the resulting DataFrame, while non-missing values will be False.

# Check for missing values
print(df.isnull())

# Count missing values in each column
print(df.isnull().sum())

2. df.fillna()

The fillna() method in Pandas is used to fill in missing values in a DataFrame with a specified value or method. By default, it replaces missing values with NaN, but you can specify a different value to use instead as shown below:

  • value: Specifies the value to use to fill in the missing values. Can be a scalar value or a dict of values for different columns.
  • method: Specifies the method to use for filling in missing values. Can be ‘ffill’ (forward-fill) or ‘bfill’ (backward-fill) or ‘interpolate’(interpolate values) or ‘pad’ or ‘backfill’
  • axis: Specifies the axis along which to fill in missing values. It can be 0 (rows) or 1 (columns).
  • inplace: Whether to fill in the missing values in place (modifying the original DataFrame) or to return a new DataFrame with the missing values filled in.
  • limit: Specifies the maximum number of consecutive missing values to fill.
  • downcast: Specifies a dictionary of values to use to downcast the data types of columns.
# Fill missing values with a constant
df_filled_constant = df.fillna(0)

# Fill missing values with the mean of each column
df_filled_mean = df.fillna(df.mean())

# Fill missing values with the median of each column
df_filled_median = df.fillna(df.median())
# forward-fill missing values (propagates last valid observation forward to next)
df.fillna(method='ffill')
# backward-fill missing values (propagates next valid observation backward to last)
df.fillna(method='bfill')
# fill missing values using interpolation
df.interpolate()
# fill missing values in place
df.fillna(0, inplace=True)

3. df.dropna()

df.dropna() is a method used in the Pandas library to remove missing or null values from a DataFrame. It removes rows or columns from the DataFrame where at least one element is missing.

You can remove all rows containing at least one missing value by calling df.dropna().

In [29]:

df_drop_na = df.dropna()

If you want to remove only the columns that contain at least one missing value you can use df.dropna(axis=1)

df_drop_na = df.dropna(axis=1)

You can also set thresh parameter to keep only the rows/columns that have at least thresh non-NA/null values.

df_drop_na = df.dropna(thresh=2)

4. df.drop()

df.drop() is a method used in the Pandas library to remove rows or columns from a DataFrame by specifying the corresponding labels. It can be used to drop one or multiple rows or columns based on their labels.

You can remove a specific row by calling df.drop() and passing the index label of the row you want to remove, and the axis parameter set to 0 (default is 0)

df_drop = df.drop(0)

This would remove the first row of the DataFrame.

You can also drop multiple rows by passing a list of index labels:

df_drop = df.drop([0,1])

This would remove the first and second rows of the DataFrame.

Similarly, you can drop columns by passing the labels of the columns you want to remove and setting the axis parameter to 1:

df_drop = df.drop(['Count', 'Rank'], axis=1)

5. Handling Missing Values in Categorical Data:

  • For categorical data, you can create a new category for missing values using df.fillna('Unknown').
# Fill missing values in a categorical column
df['Category'].fillna('Unknown', inplace=True)

--

--

Nikitajain Jain

To be data scientist ,love creating value out of numbers . Pursing MSc DS from Banasthli Vidhyapith. Interested in AI, ML, NLP, spirituality , society