Customer Lifetime Value Prediction: Dataset Exploration (Part 2)

Learn to categorize your customers using cohort analysis to understand their behavior and retention.

Asish Biswas
AnalyticSoul
4 min readJun 18, 2024

--

Welcome back! In this lesson we will continue our data exploration of the online transaction dataset. We will transform the dataset for cohort analysis by grouping customers who made their first purchase in the same month.

Aggregate monthly transactions

Let’s prepare the base and aggregate month-over-month total sales for each customer using the groupby() method.

# aggregate monthly revenue for each customer
df_monthly_sales = df_cdnow_tr.groupby(['customer_id', 'year_month']).agg({'revenue': 'sum'})

df_monthly_sales.reset_index(inplace=True)
print(df_monthly_sales.head())

Customer transactions over time

Not every customer makes a purchase every month. Due to these gaps in transactions, it becomes difficult to handle the data. In order to maintain consistency, we will fill inactive months for each customer with zero-dollar transactions. This way we will have data for each customer for each month.

Step 1: Creating a customer-month matrix

Prepare dummy data by combining all months for each customer.

  • First, we create a dataframe with all the available year_month in our dataset.
  • Then, we create another dataframe with all the distinct customer_id.
  • Lastly, we cross-join those two datasets and create a dummy dataframe with each customer and each year-month name.
# create a dataframe with only the year-month values
df_monthyear = pd.DataFrame({'year_month': df_cdnow_tr['year_month'].unique()})
# create a dataframe with only the customer-ids
df_customers = pd.DataFrame({'customer_id': df_cdnow_tr['customer_id'].unique()})
# cross join customer-ids with year-month
df_customer_yearmonth = df_customers.merge(df_monthyear, how='cross')
df_customer_yearmonth.head()

Step 2: Handling missing sales

We want to show a continuous revenue line for each customer. That’s why we need to fill the monthly sales dataframe with zero-dollar transactions for those months where customers were inactive.

  • We merge the dummy transactions from the previous step with the monthly sales dataframe.
  • Then, we replace the invalid values in the revenue column with zero (0).
# fill absent months for each customer with zero transactions
df_monthly_sales = df_customer_yearmonth.merge(
df_monthly_sales,
on=['customer_id', 'year_month'],
how='outer'
)
df_monthly_sales = df_monthly_sales.sort_values(by=['customer_id', 'year_month'])
df_monthly_sales['revenue'].fillna(0, inplace=True)

# show sample data
df_monthly_sales[df_monthly_sales['customer_id'] == 3]
Sample data for customer 3

Step 3: Rolling aggregate of sales

Aggregate all historical sales on a monthly basis for each customer.

  • First, we utilize the groupby() function with a window function and continue aggregating all the historical sales for each month for each customer. We continue rolling for 18 periods because we have 18 month's transactions in our dataset (1997-01 to 1998-06)
  • Then, we sort the dataframe by customer and month. This will order the observations for data visualization.
# rolling aggregate
df_monthly_sales_agg = (
df_monthly_sales
.groupby('customer_id')
.rolling(18, on='year_month', min_periods=0)
.revenue
.sum()
)
df_monthly_sales_agg = df_monthly_sales_agg.reset_index()
df_monthly_sales_agg = df_monthly_sales_agg.sort_values(by=['customer_id', 'year_month'])

# show sample data
df_monthly_sales_agg[df_monthly_sales_agg['customer_id'] == 3]
Rolling aggregated sample data for customer 3

Customer revenue over time

Let’s visualize the historical aggregated revenue for each customer.

  • First, we convert year_month column because the line plot in the next step expects a string in the x-axis.
  • We draw aggregated revenue for each customer month over month. To keep the diagram clean, we display only 20 customers’ revenue in the plot. According to the aggregated sales lines, most of the observed customers started purchasing in February 1997, and as they continued to purchase the aggregated sales lines moved upwards.
df_monthly_sales_agg['year_month'] = df_monthly_sales_agg['year_month'].astype(str)

# plot aggregated customer sales
plt.figure(figsize=(10, 6))
s = sns.lineplot(
data=df_monthly_sales_agg.iloc[-360:],
x='year_month',
y='revenue',
hue='customer_id'
)
s.set_xticklabels(s.get_xticklabels(), rotation=45)
plt.ylabel('Sale')
plt.legend().remove()
plt.show()

Now it’s your turn! Practice along with the code notebook.

What’s next?

Join the community

Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!

Thanks for reading! If you like this tutorial series make sure to clap (up to 50!) and let’s connect on LinkedIn and follow me on Medium to stay updated with my new articles.

Support me at no extra cost by joining Medium via this referral link.

--

--