Data Science from a Nonexpert — The Art of Manipulating Data

Naz Eylem
Learning Data
6 min readDec 14, 2023

--

Data Manipulation

Data manipulation is the most crucial step of data analysis. You should manipulate your data to handle missing values, normalization, standardization, encoding categorical variables to fit machine learning algorithms, feature engineering, converting data types, and even visualization.

Image Source: [Pixabay] (Courtesy of Innovalabs)

There is no one-size-fits-all solution. But you will encounter many cases that which you should encode categorical variables and create new features to visualize or use data in algorithms. I will share what I mostly use for manipulating data.

Replace empty string with null

It will replace empty strings as nan values in the string column.

df["string"] = df["string"].replace("", np.nan)

Set string to boolean and Create a new column on a data frame

This code writes 0 for nan values, and 1 for non-empty values for the string column, and creates a new column on the data frame.

I use this for analysis of understanding what the trend between a field is set or not in the collection which depends on 3rd parties or integrations

df["new_binary_stringtoint"] = df["string"].notnull().astype(int)

Labeling

I applied labelling on multiple occasions such as feature engineering, data understanding, and visualization.

Below snippet do:

label_result the function takes a row of a data frame as input (row) and checks two conditions:

  • If the value in the ‘exampleInt’ column of that row is greater than or equal to 10, it assigns the label ‘Label_1’.
  • Otherwise, if the value in the ‘result’ column of that row is greater than or equal to 5, it assigns the label ‘Label_2’.
  • If both conditions are not met, it assigns the label ‘Label_3’.
def label_result(row):
if row['exampleInt'] >= 10:
return 'Label_1'
elif row['exampleInt'] >= 5:
return 'Label_2'
else:
return 'Label_3'

Here, the apply function is used to apply the label_result function to each row of the DataFrame along the specified axis (axis=1 indicates that the function should be applied to rows). The result is stored in a new column called 'label' in the DataFrame.

df['label'] = df.apply(label_result, axis=1)

DateTime

Most of the time, we encounter data that contains dates. So, we convert the ‘EventDate’ columns, respectively, to Pandas datetime objects with this code. This helps perform datetime-related operations and analysis.

df['EventDate'] = pd.to_datetime(df['EventDate'])

Extracting Components:

Extracting year, month, day, hour, minute, second, etc., for further analysis.

df['Year'] = df['EventDate'].dt.year
df['Month'] = df['EventDate'].dt.month
df['Day'] = df['EventDate'].dt.day
df['Hour'] = df['EventDate'].dt.hour
df['date'] = df['EventDate'].dt.date
df['time'] = df['EventDate'].dt.time
df['date'] = df['createDate'].dt.to_period('M')

Remember to convert it to datetime value before calculations and remember to convert it to a string before making aggregation and putting it on a plot.

df['EventDate'] = pd.to_datetime(df['EventDate'])

Filtering by Date Range:

Selecting rows within a specific date range and creating a new dataframe from it.

start_date = pd.to_datetime('2020-09-01')
end_date = pd.to_datetime('2020-09-30')
df_filtered = df[(df['EventDate'] >= start_date) & (df['EventDate'] <= end_date)]

Calculating Time Differences:

Computing time differences between dates.

df['TimeSinceEvent'] = pd.to_datetime('now') - df['EventDate']

Grouping and Aggregating by Time:

I need to group and aggregate data most of the time before labeling and visualization on Folium or Matplotlib.

Grouping data by time intervals and performing aggregations.

  • df['EventDate'].dt.month extracts the month component from the 'EventDate' column.
  • df.groupby(...) groups the dataframe by the extracted month.
  • ['exampleInt'].mean() calculates the mean of the 'exampleInt' column for each group.
df.groupby(df['EventDate'].dt.month)['exampleInt'].mean()

Creating New Features by Using Time:

You can also add a time filter to labeling algorithms.

def label_result(row):
if row['EventDate'].year >= 2023 and row['exampleInt'] >= 10:
return 'Label_1'
elif row['EventDate'].year >= 2023 and row['exampleInt'] >= 5:
return 'Label_2'
else:
return 'Label_3'

df['label'] = df.apply(label_result, axis=1)

Step 1: Function Definition

This line defines a function named label_result that takes a single argument row. In the context of pandas, when you use apply with axis=1, this function will be applied to each row of the DataFrame.

def label_result(row):

Step 2: Conditions for Labeling

If the ‘EventDate’ year is greater than or equal to 2023 and the ‘exampleInt’ is greater than or equal to 10, the function returns the string ‘Label_1’.

if row['EventDate'].year >= 2023 and row['exampleInt'] >= 10:
return 'Label_1'

If the first condition is not met but the ‘EventDate’ year is still greater than or equal to 2023 and the ‘exampleInt’ is greater than or equal to 5, the function returns the string ‘Label_2’.

elif row['EventDate'].year >= 2023 and row['exampleInt'] >= 5:
return 'Label_2'

If none of the above conditions are met, the function returns the string ‘Label_3’.

else:
return 'Label_3'

Step 3: Applying the Function to DataFrame

This line applies to the label_result function to each row of the data frame (df) using the apply method with axis=1. It creates a new column 'label' in the DataFrame, and the values in this column are determined by the conditions specified in the label_result function.

df['label'] = df.apply(label_result, axis=1)

Splitting lists as columns

While I am dealing with H3 ( spatial indexing system), indexes are stored with resolutions as a list within the documents. So I had to split them. I use the below code to achieve this.

indexes_column = df['Indexes']

max_length = max(len(index_list) if index_list else 0 for index_list in indexes_column)

indexes_padded = [
index_list + [float('nan')] * (max_length - len(index_list))
if index_list
else [float('nan')] * max_length
for index_list in indexes_column
]

indexes_df = pd.DataFrame(indexes_padded, columns=[f'index-{i+1}' for i in range(max_length)])

df.reset_index(drop=True, inplace=True)
indexes_df.reset_index(drop=True, inplace=True)

result_df = pd.concat([df, indexes_df], axis=1)

Step 1:

Here, you are extracting the ‘Indexes’ column from the DataFrame df and storing it in the variable indexes_column.

indexes_column = df['Indexes']

Step 2:

This line calculates the maximum length among all the lists in the ‘Indexes’ column. It uses a generator expression to iterate over each list in the indexes_column, and the max function finds the maximum length. If a list is empty or None, it considers its length as 0.

max_length = max(len(index_list) if index_list else 0 for index_list in indexes_column)

Step 3:

Here, you are padding each list in the ‘Indexes’ column with NaN values to make them all of equal length. If the original list is not empty (index_list), it is extended with NaN values to reach the maximum length (max_length). If the original list is empty, a list of NaN values of length max_length is created.

indexes_padded = [
index_list + [float('nan')] * (max_length - len(index_list))
if index_list
else [float('nan')] * max_length
for index_list in indexes_column
]

Step 4:

This line creates a new DataFrame indexes_df from the padded lists. The columns of this new DataFrame are named 'index-1', 'index-2', ..., 'index-max_length', based on the maximum length of the lists.

indexes_df = pd.DataFrame(indexes_padded, columns=[f'index-{i+1}' for i in range(max_length)])

Step 5:

Here, you are resetting the indices of both df and indexes_df to make sure they start from 0 and are continuous. This is important for proper concatenation.

df.reset_index(drop=True, inplace=True)
indexes_df.reset_index(drop=True, inplace=True)

Step 6:

Finally, you concatenate the original DataFrame df and the new DataFrame indexes_df along the columns (axis=1) to create a new DataFrame named result_df.

result_df = pd.concat([df, indexes_df], axis=1)

Conclusion

I hope that this guide helps you in the big world of data manipulation using Python and pandas. I tried to provide easy-to-follow explanations and code snippets, making it easy for anyone diving into data science.

We learned the basics of cleaning up your data to feature engineering where you get to create binary columns and give your data some personality. This guide covers fundamental must-know skills.

It takes you on a journey through datetime operations, showing you how to play with time, from simple conversions to more advanced tricks like time-based filtering and calculations.

While working with real-life examples, you will learn that it isn’t just theory. These tricks are ready to use in your actual data projects. I hope that this guide is your friendly GPS, offering clear directions and tips as you navigate the world of data science.

Happy coding!

The contents of external submissions are not necessarily reflective of the opinions or work of Maven Analytics or any of its team members.

We believe in fostering lifelong learning and our intent is to provide a platform for the data community to share their work and seek feedback from the Maven Analytics data fam.

Happy learning!

-Team Maven

--

--