利用集群分析掌握消費者輪廓，Python實作（一）

Published in

Finformation當資料科學遇上財務金融

23 min readJul 23, 2019

導論

在真實消費場景中，我們通常需要了解TA(Target audience)來制定銷售策略，畢竟企業的資源是有限的，針對有效管道投放推送適合的促銷有助於提升我們的營收、增加客戶體驗，以及提升客戶留存。在流量紅利見底的現在，維持客戶留存，以及提升個人對不論平臺、商家的營收都非常重要。而集群分析是一種藉由考察資料間距離，從而推算「資料相似度」的技術。可見解說：什麼是機器學習中的集群分析

套用在商業場景中，剛好可以藉由區分出不同的消費者，來更了解我們的客戶族群分佈，進而將顧客與產品價值的「甜蜜點」連結在一起。如果手邊有顧客資料，不妨試著用Python來區分自己的客戶吧！

這邊我使用一家零售商的資料集：https://www.kaggle.com/c/instacart-market-basket-analysis/data

這份資料集是零售市場Instacart提供的銷售資料，我們需要預測顧客下一個可能想買的東西（Target : predict which previously purchased products will be in a user’s next order.），進而提升他的消費體驗。原本這份資料是用來預測顧客「下一個想買的東西是什麼」，其實也是一種推薦系統。但是如果將這份資料用在分群，其實會得到一個相當有趣的結果，最近工作也需要分析直播平臺的受眾分佈，以及分析直播效果對平臺的效益，工作上我是使用自然語言處理的技術，這邊我則會使用聚類(Clustering)的方式。

這時候可能就會有人想：Deep learning（自然語言處理）表現得是不是會比Machine learning（Clustering）好？其實沒有哪一種演算法是最好的，只有最適合問題的演算法。通常我會根據手邊有的資料、資料的好取得性去思考應該套用哪一種模型，以及使用什麼樣的技術來挖掘數據背後的價值。

那麼接下來我會帶著您，利用Python做一次消費者分群。本文章分為上下兩篇，首篇將帶您使用Plotly、seaborn等好用的資料視覺化工具來了解資料、挖掘資料的故事，下一篇我們會正式進入分群的主題。

首先讀入工具包

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling as pf
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot , init_notebook_mode
import cufflinks
cufflinks.go_offline(True)
init_notebook_mode(True)import os
print(os.listdir("../input"))# Any results you write to the current directory are saved as output.

導入我們需要的工具包，其中Plotly、cufflinks是個人很喜歡的資料視覺化套件，操作起來也很方便。

#讀入我們需要的資料表
order_products_prior_df = pd.read_csv('../input/order_products__prior.csv')
order_products_train_df = pd.read_csv('../input/order_products__train.csv')
orders_df = pd.read_csv('../input/orders.csv')
products_df = pd.read_csv('../input/products.csv')
aisles_df = pd.read_csv('../input/aisles.csv')
departments_df = pd.read_csv('../input/departments.csv')

我們可以使用Pandas 中的 read_csv來將資料讀取為適合資料分析的型態：DataFrame。不過這份資料集稍微複雜一點，有許多資料表要處理，請記得檔名路徑更改為自己資料的檔案路徑哦！

EDA(探索性資料分析)

一樣，通常拿到數據我會先試著去了解資料，所以會畫一些圖、看一下資料的分佈以及熟悉各個資料欄位的意思。比如，我最先想知道的是各個資料表間的關聯情況（他們是怎麼連接的？）

%matplotlib inline
cnt_sts = orders_df.eval_set.value_counts()
color = sns.color_palette()plt.figure(figsize = (12,8))
sns.barplot(cnt_sts.index , cnt_sts.values  , alpha = 0.8 , color = color[1])
plt.ylabel('Number of occurence' , fontsize = 12 )
plt.xlabel('Eval set type' , fontsize = 12)
plt.title('Count of row in each dataset' , fontsize = 15)plt.xticks(rotation = 'vertical')

共有多少顧客呢？

def get_unique_count(x):
    return len(np.unique(x))cnt_srs = orders_df.groupby('eval_set')['user_id'].aggregate(get_unique_count)
cnt_srs

所以這邊我們可以看到，總共有206209的顧客，其中訓練測試集大概是65:35。

那接下來我會很好奇是，總共有多少筆訂單？消費者一次會買很多東西嗎？如果是買許多東西的消費者，會比較難猜測他的購物習慣為何，反之如果購物單數並不多的顧客，消費模式通常較為單純。

cnt_srs = orders_df.groupby('user_id')['order_number'].aggregate(np.max).reset_index()cnt_srs = cnt_srs.order_number.value_counts()plt.figure(figsize = (12 , 8 ))
sns.barplot(cnt_srs.index , cnt_srs.values , alpha = 0.8 , color = color[2])
plt.ylabel('Number of occurecnes' , fontsize = 12)
plt.xlabel('Maximum order number' , fontsize = 12)
plt.xticks(rotation = 10)

所以我們看到，最低是四筆訂單資料，最高則到100筆，但是因為100筆的量過高，因為不太可能有那麼多人都訂這麼高量的商品，很大原因是這家市場的資料科學團隊將大於100的資料全部歸類到100，導致這樣的結果。

那禮拜幾最多人逛超市呢？超市是有季節性以及短期週期的產業，所以掌握禮拜幾最多人是非常重要的，也可以在比較多人的天數進行實體活動，增加顧客消費。

plt.figure(figsize  = (12, 8 ))
sns.countplot(x = 'order_dow' , data = orders_df , color= color[0])
plt.ylabel('Count' , fontsize = 12)
plt.xlabel('Day of week' , fontsize = 12 )
plt.xticks(rotation = 0)
plt.title("Frequency of order by week day" , fontsize = 15)

我們發現，週六、週日最多人購買、逛超市，而禮拜三的人最少。

假日人多是還蠻合理的，大家都在假日、休息的時候趕快補充一個禮拜的物資XD

那麼時段又是如何呢？

plt.figure(figsize  = (12, 8 ))
sns.countplot(x = 'order_hour_of_day' , data = orders_df , color= color[0])
plt.ylabel('Count' , fontsize = 12)
plt.xlabel('Hour of one day' , fontsize = 12 )
plt.xticks(rotation = 0)
plt.title("Frequency of order by hour in a  day" , fontsize = 15)

原來基本上都是在10.~下午4.之間，值得一提的是中午大家可能都去吃飯了，人就變得比較少。

grouped_df = orders_df.groupby(['order_dow' , 'order_hour_of_day'])['order_number'].aggregate('count').reset_index()grouped_df = grouped_df.pivot('order_dow' , 'order_hour_of_day' , 'order_number')plt.figure(figsize = (12 , 6))
sns.heatmap(grouped_df)
plt.title('Frequency of Day of week Vs Hour of day')

我喜歡使用熱點圖來查看變數與變數之間的關係，而Heatmap其實最常使用在考察變數間的相關性，即correlation。

我們發現：禮拜天的10點是是最熱門的時段，如果再放大一些，可以說禮拜六的下午、禮拜天的早晨是最多人的時間。也就是說，根據這兩點資訊可以在禮拜六下午多準備新鮮的蔬果、麵包配合下午茶時間打造健康點心時光，而禮拜天早晨的悠閒也可以順便推出炒蛋、即時性食品的商品來配合顧客的生活。可以利用質化（訪談、調研）的方式更了解顧客生活，從而真正得出偏好這段時間來超市的因素是什麼。

一個月中，幾號是最多人的呢？

plt.figure(figsize = (12 ,8))
sns.countplot('days_since_prior_order' , data = orders_df , color = color[3])
plt.xlabel('Days since prior day' , fontsize = 12)
plt.ylabel('Count' ,fontsize = 12)
plt.xticks(rotation = 0)
plt.title('Frequency distribution by days since prior order ')

七天的頻次是最高的，這反映了人們或許固定每週都到超市買商品，而且根據商業知識，這些固定頻率的人們應該大致是同一群人。除了一般家庭，也可能是餐飲業者。
這些人的pattern會比較好預測，而其他人就比較難一些。

有趣的是14天、21天都比他們周圍的頻次高些，顯示以「週」為頻率是消費者的購買行為。還有就是有人一天去兩次的XD，可能是東西忘記買(?

不過，我們發現說30天的異常高，而數據沒有缺失值也讓我們懷疑數據是不是經過了30天的填補，也就是說大於30天的購物都會被放在30天。
所以我們檢查一下：

def detect_potential_exception(df , col):
    """
    Input:
        df:DataFrame
        col: Column name , it must be the continuous variable!
    Output:
        Detect result
    """
    confident_value = abs( ( df[col].mode().iloc[0,] - df[col].median())  / (df[col].quantile(0.75) - df[col].quantile(0.25) ))
    confident_value = round(confident_value , 2)
    if confident_value > 0.8:
        print('According to experience rule , Its is dangerous!' , confident_value)
    else:
        print('Safe!' , confident_value)
detect_potential_exception(orders_df, 'days_since_prior_order')

這樣太危險～

我們想得沒錯，基本上除非雙峰分佈，否則出現這麼高的異常值可以推斷為數據經過了粗略地填補。這點在建模的時候必須小心！我們可以將30天部分抽樣為缺失值，利用其他數值使用隨機森林等預測模型來填補這些缺失值。如果是業界資料，就直接去詢問BI部門的人吧XD

有關數據異常處理，在這一篇文章中也有類似的案例出現，寫得非常清楚，這篇教學文也讓我第一次認識到異常數據：共享單車需求預測，後來就研究了一下怎麼偵查出來，就是利用上面那個function。真實世界的數據往往髒亂，資料品質的管理也需要資料分析師或者資料科學家注意！

那遇到了該怎麼辦呢？我是利用取對數的方式解決，將偏度過高的資料取對數，尤其在金融、消費資料集會常常是高偏度的資料。

def make_skew_transform(df , feature):
    """
    To transform high skew data 
    Input:
        df:DataFrame
        feature:The columns of Variable to predict Y
    Output:
        X_rep : DataFrame which process the data with log transform
    
    """
    skew_var_x = {}
    X_rep = df.copy()
    var_x = df[feature]
    for feature in var_x:
        skew_var_x[feature] = abs(X_rep[feature].skew())
        
    skew = pd.Series(skew_var_x).sort_values(ascending = False)
    print(skew)
    
    var_x_ln = skew.index[skew > 1]
    print('針對偏度大於1的進行對數運算')
    
    for var in var_x_ln:
        #針對小於0的我們先確保讓他大於0，平移資料
        if min(X_rep[var]) <= 0:
            X_rep[var] = np.log(X_rep[var] + abs(min(X_rep[var] + 0.01)))
        else:
            X_rep[var] = np.log(X_rep[var])
    return X_rep

這樣就寫好一個轉換的函數囉！接下來我們看一下「買過又再買」的比例是多少？即培養品牌忠誠度的「復購率」指標。

#查看重複購買的比例
order_products_prior_df.reordered.sum() / order_products_prior_df.shape[0]order_products_train_df.reordered.sum() / order_products_train_df.shape[0]

算出來大約是0.6，是相當高的，等於說許多顧客之前就買過相同的商品。

那每一個顧客的重複購買比例又是多少呢？亦即，多少人「大概是」之前就來過的人呢？

grouped_df = order_products_prior_df.groupby('order_id')['reordered'].agg('sum').reset_index()grouped_df['reordered'].loc[grouped_df['reordered'] > 1] = 1 
grouped_df['reordered'].value_counts() / grouped_df.shape[0]

大家都買多少樣商品呢？

#這邊要小心，add to cart order的order是順序的意思，不是訂單，所以我們只要查看max就可以知道這一單總共有多少商品！
grouped_df = order_products_train_df.groupby('order_id')['add_to_cart_order'].aggregate('max').reset_index()
cnt_srs = grouped_df.add_to_cart_order.value_counts()plt.figure(figsize = (20 , 8))
sns.barplot(cnt_srs.index , cnt_srs.values , alpha = 0.8)
plt.ylabel('Number of occurences' , fontsize = 12)
plt.xlabel('Number of products  in the given order' , fontsize = 12)
plt.xticks(rotation = 0)

而許多人大概都買五樣商品。那哪些商品又常常被購買呢？我們就要回去看product_df：

department基本上我會當成品類，實體零售業用department分 , 而e-commerce叫做category。其實都是一種資訊管理的方式，分門別類來將資料做群組（Groups）

什麼商品賣最好？

order_products_prior_df = pd.merge(order_products_prior_df , products_df , on  = 'product_id' ,  how= 'left')
order_products_prior_df = pd.merge(order_products_prior_df , aisles_df , on = 'aisle_id' , how = 'left')
order_products_prior_df = pd.merge(order_products_prior_df , departments_df ,on = 'department_id' , how = 'left')
order_products_prior_df.head()
cnt_srs = order_products_prior_df['product_name'].value_counts().reset_index().head(20)
cnt_srs.columns = ['product name' , 'frequency_count']
cnt_srs

幾乎都是有機食物！（好健康啊！）除了大蒜等香料，大部分都是水果。

這一家可能是素食愛好者、吃得健康的人喜歡來的店，或者店內提供的蔬果非常新鮮，有了良好的口碑。

接下來我們看看top MVP級的走道XD

cnt_srs = order_products_prior_df['aisle'].value_counts().head(20)plt.figure(figsize  = (20 , 8 ))
sns.barplot(cnt_srs.index , cnt_srs.values , alpha = 0.8 , color=  color[5])
plt.ylabel('Number of occurences' , fontsize = 12)
plt.xlabel('Aisle' , fontsize = 12)
plt.xticks(rotation = 25)

看來水果、蔬菜、優格等比較健康的食物是這家商店的主打！喜歡吃沙拉的人有福了！Aisles比較像是次級的category，可以分得更細。

我們也想看看放進購物車的順序是不是跟reorder有關，因為可以預期，前幾個放進去的通常就是固定會買的。

order_products_prior_df.add_to_cart_order.value_counts()#發現很多零散的紀錄，基本上後面的都是1次而已，所以我們想把維度縮減一點，利用蓋帽法來縮減。order_products_prior_df['add_to_cart_mod'] = order_products_prior_df.add_to_cart_order.copy()
order_products_prior_df['add_to_cart_mod'].loc[order_products_prior_df.add_to_cart_mod > 70] = 70
grouped_df = order_products_prior_df.groupby('add_to_cart_mod')['reordered'].aggregate('mean').reset_index()plt.figure(figsize = (18 , 8))
sns.pointplot(grouped_df['add_to_cart_mod'].values , grouped_df['reordered'].values  , alpha = 0.8 , color = color[2])plt.ylabel('Reorder ratio' , fontsize = 12)
plt.xlabel('Add to cart order' , fontsize = 12)
plt.title('Add to cart order - Reorder ratio'  , fontsize = 18)

蠻有趣的，還真的是有關係。而這也是一個蠻合理的事情，因為我們可能都是先拿著一張清單，推著推車把所有要買的必需品先拿進來，再慢慢逛商店，不過這樣的關係還真是強烈。

其他有趣的EDA其實要看的話，真的看不完！其實還有很多有趣的議題可以挖，比如：哪一個商品買過一次就不太買了？

# temp_series = pd.DataFrame(temp_series)
# temp_series['labels'] = temp_series.index
# temp_series.columns = ['labels' , 'values']
# temp_series.iplot(kind = 'pie' , labels='labels', values='values')
grouped_df = order_products_prior_df.groupby('department')['reordered'].aggregate('mean').reset_index()plt.figure(figsize = (20 , 8))
sns.pointplot(grouped_df['department'] , grouped_df['reordered'].values , alpha =0.8 , color = color[2])
plt.ylabel('Reorder ratio' , fontsize = 12)
plt.xlabel('Department' , fontsize= 12)
plt.title('Department wise reorder ratio'  , fontsize = 18)

個人照護的商品重複購買率最差，有可能大量是新品，也有可能商品很雷。

而produce我們先前已經知道本身銷售就很高了，發現乳製品、蛋類等蛋白質食物也賣得不錯。

接下來，您可以繼續挖掘自己感興趣的主題，藉由資料視覺化，我們可以很清楚觀察資料的分佈以及關係、特徵。這些對後續的建模都有大大小小的幫助。而下一篇，我將會使用Kmeans來找出這家超市的消費者輪廓、行為特徵，千萬不要錯過哦！

那麼以上，就是資料分群的第一篇：如何使用超市資料來做探索性資料分析。其實這份資料集的品質我覺得還是蠻高的，很多資料欄位對資料分析師都非常友善，在做分析的時候自己真的很讚賞這家超市的資料科學團隊（還是這些欄位是Kaggle幫忙清理的XXD），欄位、名稱都蠻有意義的。還有很多非常有趣的演算法可以練習（比如關聯分析）而且就是真實世界的數據。一樣，這系列的資料科學文都不會深入講解code的細節，希望我的分析架構對您有幫助！我們下一篇文章見！

這邊也想做個實驗，好讓我知道你/妳喜不喜歡這篇文章：
拍 10 下：簽個到，表示支持（謝謝鼓勵！）
拍 20 下：想要我多寫「商管相關」
拍 30 下：想要我多寫「資科相關」
拍 50 下：我有你這讀者寫這篇也心滿意足了!