網站推薦系統建置，我怎麼分析Airbnb的資料？

Published in

Finformation當資料科學遇上財務金融

29 min readMay 19, 2019

在資料科學的應用中，排序(rank)是個很重要的商業應用，快速呈現客戶想看的內容可以讓客戶的體驗更好，各家廠商無不費勁心思推播客戶最想要(most want)的內容到客戶面前。如果對排序在data science的朋友可以看看有名的rank data science 團隊，包括Netflix , KKbox , Airbnb等網站的資料科學團隊，airbnb甚至有自己的data science社群。那麼本專案為一個Airbnb的資料科學專案，商業目標是：

如果用戶現在要訂房，推播最有可能、對他們最合適的前五名airbnb民宿到他們的手機。

資料網址：https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

基本上可以預期需要用到地理資訊還有手機的存取位置，這些資料集這場比賽都有提供，最棒的是，它提供了實務上很常用到的「第一接觸點」資料。基本上我們在構建使用者旅程圖(User Journey map)的時候，touch point是開啟這張圖的起點，在行銷領域非常重要。我們也可以看出廣告的轉換率、客戶在什麼情境下更容易來到我們的網站、進而使用服務，所以這筆資料非常珍貴。不得不說很感謝Airbnb，提供了電商實務上的資料公開。

import pandas as pd
import numpy as np#讀入資料，由於資料分成測試訓練集，我們用pandas的concat將兩份資料集做合併
train_data = pd.read_csv('../input/train_users_2.csv')
test_data = pd.read_csv('../input/test_users.csv')
print('train data num:' , train_data.shape[0])
print('test data num' , test_data.shape[0])
df = pd.concat([train_data , test_data] , axis = 0 , ignore_index=True)
df.head()

# 對我們的資料做個概覽。
def overview_data(df):
    df.info()
    print('rows: ',df.shape[0])
    print('features:  ' , df.shape[1])
    print('Number of missing value :')
    print(df.isnull().sum().values.sum())
    print(df.nunique())
overview_data(df)

我們發現age有很多缺失值，另外有145個不同的值顯然不合理，這代表我們的age range至少跨了140多個，待會要好好查清楚，gender有四種也蠻奇怪的，看metadata知道有缺失。資料總共27萬筆左右。date_first_booking也少太多，後來發現原來test_data都沒存！這麼懸殊的資料差可能就必須考慮把這一欄刪除。

Metadata

總共16個feature + session data 的6個feature。

user_id: to be joined with the column ‘id’ in users table
action
action_type
action_detail
device_type
secs_elapsed
id
date_account_created: the date of account creation
timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
date_first_booking: date of first booking
gender
age
signup_method
signup_flow: the page a user came to signup up from
language: international language preference
affiliate_channel: what kind of paid marketing
affiliate_provider: where the marketing is e.g. google, craigslist, other
first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
signup_app
first_device_type
first_browser
country_destination: this is the target variable we are to predict

Clear missing data

print(df.date_first_booking.unique())
print(df.first_browser.unique())
print(df.gender.unique())
(df.isna()==True).head()

上面我們發現gender的不合理之處，看了一下metadata知道多出來的是unknown and other ，將他們處理一下。話說居然有人用PS vita上網，好懷念我的那台PSP XD，還有值得一提的是chrome 跟chrome mobile不一樣！那稍微看一下其他缺失資料，發現Airbnb內部似乎是用-unknown-的符號代表不知道的數值，這邊的unknown是資料收集時的錯誤導致的，並非不存在。雖說如此，我們還是先把他們設成NaN，接下來再思考如何填補。

這邊我想先看一下能不能填補，所以畫圖看看分佈：

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# df[df.date_first_booking.isna()]
df.date_account_created = pd.to_datetime(df.date_account_created)
df.date_first_booking = pd.to_datetime(df.date_first_booking)
day = (df[df.date_first_booking.notna()].date_first_booking- df[df.date_first_booking.notna()].date_account_created)#我們看一下能不能透過帳戶創建日期反推first_booking
day_collect = []
for d in day:
    day_collect.append(int(d.days))
plt.figure(figsize= (8,4))
plt.title('Should we drop outlier?')
sns.distplot(day_collect)

其實還蠻有趣是，有些極端值是創了很久帳戶之後才用，但大部分的人都是沒創多久就使用airbnb了，可能是臨時要訂房間想說去辦個帳戶這樣。呈現峰度非常高的類似常態分佈，那我們知道就統計的定義上，高峰度就代表著：資料群的變異主要是來自極端事件影響很大。所以這裡我其實會選擇乾脆移除掉400天等太極端的數值，然後利用剩下的資料來預測說first_booking的日期。否則刪除一個feature可能還蠻可惜的。但是我們也不能確認這個feature有沒有用，所以省事一點XD先把它刪除，因為就我這樣觀察days而言，first booking跟first account可能會有蠻強的共線性。這在我們建模的時候也不是我們預期看到的。

#按照分析unknown把它改成NaN
df.gender.replace('-unknown-' , np.nan , inplace=True)
df.first_browser.replace('-unknown-' , np.nan , inplace=True)
#把date_first_booking直接消掉。
df.drop('date_first_booking' , axis = 1 , inplace = True)
df.isnull().sum()

經過上面的處理，我們可以來一個個填補資料，那基本上就是看看資料的分佈決定mode, mean這樣。

#先看age長什麼樣子？
df.age.describe()

記得我們在上面的overview提到過嗎？age的範圍明顯很大，現在詳細看就發現：「居然有人2014歲？」顯然不合理，有一個狀況是他們可能以為填年齡是填註冊年份，看最小年齡居然有人是1歲… 可能是被屁孩亂玩XD，基本上Airbnb只要有信用卡或者金融卡就能註冊，而在台灣申請的年齡是20，美國18就好，1也太可笑。另外值得注意的是28~33歲很接近，但涵蓋了25%的用戶，顯示Airbnb的市場定位應該是在30歲左右的客群，接近年輕爸媽族群。接下來看一下大於1000歲的統計分佈

df[df.age>1000]['age'].describe()

暈了，也太多人把自己的名字打錯(828個)，1920這個數字明顯就是惡搞，至於2014年涵蓋大部分的資料，我們去看看2014年發生什麼事情。 2014年airbnb取得大量融資…有興趣可以查查。

那因為美國只要18歲就有卡XD，我們看看18歲以下的族群：

df[df.age< 18].age.describe()

16歲其實還算合理，５歲太誇張，我們可以將16歲當作一個filter(閥值)把數字濾掉。

我們決定把資料限制一下範圍，那其實16歲還是蠻合理的，畢竟高中生可以用爸媽的卡(???：

df_with_year = df.age>1000
#由於這份資料是2015年的data，回到那時的情況我們用2015處理。
df[df_with_year]['age']  = 2015 - df[df_with_year]['age']#要用loc[filter ,  col_name]的寫法才可以讓Python知道要替換。
df.loc[df.age>95 , 'age'] = np.nan
df.loc[df.age < 16 , 'age'] = np.nandf.age.describe()

探索性資料分析（Explore data analysis）

plt.figure(figsize = (12 , 6))
sns.distplot(df.age.dropna() , rug = True)
#移除右上的作圖框線
sns.despine()

import plotly.offline as py
import plotly.graph_objs as gopy.init_notebook_mode(connected=True)labels = df.groupby(df.country_destination).agg('count').id.keys().tolist()
values = df.groupby(df.country_destination).agg('count').id.values.tolist()trace = go.Pie(labels=labels, values=values)py.iplot([trace])

基本上有60%的人都沒有訂過房，如果有定基本上都是訂位於US的國家，由於這是美國的資料，看來大部分人使用Airbnb還是以國內旅遊為主。所以待會的分析需要排除掉這些沒訂過房的人，這樣才能知道有訂房的顧客有哪些行為特徵。那麼，年齡跟去的國家有關嗎？比如年輕人會不會更喜愛往外跑一點？

plt.figure(figsize =(12 , 6))
df_without_NDF = df[df.country_destination != 'NDF']
sns.boxplot(x = 'country_destination' , y = 'age' , data = df_without_NDF)
plt.xlabel('Destination country box plot ')
plt.ylabel('Age of users')
plt.title('Country destination v.s. age')
sns.despine()

產品數據分析

接下來開始分析註冊資料，由於我會大量使用到餅圖，所以寫一個函數比較方便些：

"""
輸入變數名稱，畫出分佈餅圖
"""
def call_pie(col_name):
    labels = df.groupby(df[col_name]).agg('count').id.keys().tolist()
    values = df.groupby(df[col_name]).agg('count').id.values.tolist()trace = go.Pie(labels=labels, values=values)py.iplot([trace])
call_pie('signup_method')

call_pie('signup_app')
call_pie('affiliate_provider')

8成左右的人都來自網站登入，由於郵件就可以註冊了非常方便，照理說用手機應該會比網站來得方便，所以我們更好奇：網站的來源都是怎麼來的？或許這是一個象徵內容行銷（客戶逛網頁、部落格發現有airbnb這個平台，於是連過來辦帳號）、廣告導流還不錯的指標。
至於行動app其實人們都以社交為主，所以應該比較少人用，另外一點就是ios用戶使用的人比android的還多。另外我們可以看到，伴隨其他服務的使用者以Google最多(23.9%)，但是有65.8%的比例是直接使用的。

分析時序資料

我們需要先把時間資料轉換成datetime，才能夠做出時間序列的圖

#將我們的資料格式轉換成時間格式
df_without_NDF['date_account_created'] = pd.to_datetime(df_without_NDF.date_account_created)#因為他原本的timestamp記到很細節，我們把年月日抽出來就好
df_without_NDF.timestamp_first_active = pd.to_datetime(df_without_NDF.timestamp_first_active//1000000 , format='%Y%m%d')#做時間序列的圖
plt.figure(figsize=(12 , 6))
df_without_NDF.date_account_created.value_counts().plot(kind = 'line' , linewidth = 1.2 )
plt.xlabel('Date')
plt.title('New account created over time')
sns.despine()plt.figure(figsize = (12,6))
df_without_NDF.timestamp_first_active.value_counts().plot(kind = 'line' , linewidth = 1.2)
plt.title('First active date over time')
plt.xlabel('Date')
sns.despine()

2014年包含確定了Airbnb總部以及其他營運事項，我們可以看到2014年之後的Airbnb真的是急速成長。

那如果是急速成長的2014年以前又是什麼樣子呢？

#先切出20130101~2014的時間（也就是2013年一整年）
df_2013 = df_without_NDF[df_without_NDF.timestamp_first_active > pd.to_datetime(20130101 , format = '%Y%m%d')]
df_2013 = df_2013[df_2013.timestamp_first_active < pd.to_datetime(20140101 , format = '%Y%m%d')]
plt.figure(figsize = (12,6))
plt.title('First active date 2013')
plt.xlabel('Date')
df_2013.timestamp_first_active.value_counts().plot(kind = 'line' , linewidth = 1.2)
sns.despine()

我們可以看看2013年的時間序列長什麼樣子，可以發現有固定的週期性存在，另外蠻明顯就是12月整個落下去。

該分析網站本身儲存的資料囉！

看到App操作的資料，我們會先想哪些動作最常被進行。

sessions = pd.read_csv('../input/sessions.csv')
print("There were" , len(sessions.user_id.unique()) , 'unique user IDs in the session data.')
sessions.action_type.replace('-unknown-' , np.nan , inplace=True)
sessions.action_type.value_counts()

那麼比例又有多懸殊呢？

"""
功能：產生餅圖
df: 使用哪一個dataframe
col_name :輸入欄位的名字 , 可指定該series
name:可決定圓餅圖的圖表名稱
"""
def session_pie(df , col_name , name):
    fig = {
      "data": [
        {
          "values": df[col_name].value_counts().tolist(),
          "labels": df[col_name].value_counts().keys().tolist(),
          "domain": {"column": 0},
          "name": name,
          "hoverinfo":"label+percent+name",
          "hole": .4,
          "type": "pie"
        }
        ],
      "layout": {
            "title":name + '   ' +'distribution',
#             "grid": {"rows": 1, "columns": 2},
            "annotations": [
                {
                    "font": {
                        "size": 20
                    },
                    "showarrow": False,
                    "text": "GHG",
                    "x": 0.20,
                    "y": 0.5
                }
            ]
        }
    }
    py.iplot(fig)
session_pie(sessions , 'action_type' , name = 'interaction')

不知道是不是偶然，明明windows的市佔率比較高，但是mac以及ios的用戶都比較高，ipad的表現也不錯。結合剛剛的分析，不僅讓我思考是不是28~40歲左右的人都手持iphone , 有一台mac的筆電。
那麼年輕世代將會更明顯，也就是目前屬於16~25歲的人群，雖然這是一份Airbnb的分析專案，但是apple產品的競爭力可見一斑。而上面的分析也讓我們知道誰都透過airbnb的網站訂房間，那麼這些訂房的人
在使用網站的行為有什麼明顯的特徵嗎？

web還是以瀏覽為主，不過message_post不高蠻意外的，這反映了現在的確已經邁入數位原生代，我個人猜想許多顧客都是遇到問題直接上網找相關案例參考，可以請Airbnb增設KOL或者content marketing並做做看A/B testing，看顧客體驗有沒有變好。

資料處理

準備開始建模囉！回去讀原本的資料，由於之前的EDA讓我們知道這是有季節性的資料，所以先把它拆分到各個時間區間指標，這裡的程式碼看起來比較複雜，其實就只是把時間細分而已：

#我們先將時間資料轉換成datetime的格式
#拆分成年月週日
#合成新的feature: 創建期間
#移除掉原本的時間特徵
df.date_account_created = pd.to_datetime(df.date_account_created)
df.timestamp_first_active = pd.to_datetime( (df.timestamp_first_active//1000000) , format = '%Y%m%d')
df['weekday_account_created'] = df.date_account_created.dt.weekday_name
df['day_account_created'] = df.date_account_created.dt.day
df['month_account_created'] = df.date_account_created.dt.month
df['year_account_created'] = df.date_account_created.dt.year
df['weekday_first_active'] = df.timestamp_first_active.dt.weekday_name
df['day_first_active'] = df.timestamp_first_active.dt.day
df['month_first_active'] = df.timestamp_first_active.dt.month
df['year_first_active'] = df.timestamp_first_active.dt.year
df['time_lag'] = (df['date_account_created'] - df.timestamp_first_active)
df['time_lag'] = df['time_lag'].astype(pd.Timedelta).apply(lambda l : l.days)cols_to_drop = ['date_account_created', 'timestamp_first_active']
df.drop(cols_to_drop , axis = 1 , inplace=True)
df_with_year = df['age'] > 1000
df.loc[df_with_year , 'age'] = 2015 - df.loc[df_with_year , 'age']
df.loc[df.age > 95 , 'age'] = np.nan
df.loc[df.age < 16 , 'age'] = np.nan
df.age.fillna(-1 , inplace=True)

接下來我們看看session的資料，有個很直接的思路是利用顧客分類，每個顧客都有他們自己的行為。因為影響顧客體驗最核心的關鍵其實就是「流暢度」，所以我們可以依照’secs_elapsed’這個feature來看看是不是真的影響很大：

sessions.rename(columns={'user_id':'id'} , inplace=True)
#計算每個人的每一動作分別共花了多少時間
action_count = sessions.groupby(['id' , 'action'])['secs_elapsed'].agg(len).unstack()
#動作分類同上
action_type_count = sessions.groupby(['id', 'action_type'])['secs_elapsed'].agg(len).unstack()
#動作細節
action_detail_count = sessions.groupby(['id' , 'action_detail'])['secs_elapsed'].agg(len).unstack()
#蠻有可能跟裝置有關，該不會一堆人用ios是因為ios效能真的比android好？
device_type_sum = sessions.groupby(['id' , 'device_type'])['secs_elapsed'].agg(sum).unstack()#合併好我們的報表
sessions_data = pd.concat([action_count , action_type_count , action_detail_count , device_type_sum] , axis = 1)
sessions_data.head()
sessions_data.columns = sessions_data.columns.map(lambda x:str(x) + '_count')
sessions_data['most_used_device'] = sessions.groupby('id')['device_type'].max()#使用names函數將id轉成index
sessions_data.index.names = ['id']
sessions.reset_index(inplace=True)

我們使用secs_elapsed這個指標做統計，並且分箱，倘若：

小於3600 -> short_pause

大於86400 -> day_pause

大於300,000 -> long_pause

secs_epapsed = sessions.groupby('id')['secs_elapsed']secs_epapsed = secs_epapsed.agg({
      'secs_elapsed_sum': np.sum,
        'secs_elapsed_mean': np.mean,
        'secs_elapsed_min': np.min,
        'secs_elapsed_max': np.max,
        'secs_elapsed_median': np.median,
        'secs_elapsed_std': np.std,
        'secs_elapsed_var': np.var,
    'day_pause':lambda x:(x > 86400).sum(),
    'long_pause':lambda x:(x>300000).sum(),
    'short_pause':lambda x:(x < 3600).sum(),
    'session_length':np.count_nonzero
})secs_epapsed.reset_index(inplace = True)
sessions_secs_elapsed = pd.merge(sessions_data , secs_epapsed , on ='id' , how = 'left')
df = pd.merge(df , sessions_secs_elapsed , on = 'id' , how = 'left')

資料編碼

還有一個很重要的，就是我們要將類別變數編碼，以便後續的建模。
這裡我們統一使用one hot encoding ，因為將它們都視作名目尺度（nominal）。

cate_features = ['gender', 'signup_method', 'signup_flow', 'language','affiliate_channel',
                 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser', 
                 'most_used_device', 'weekday_account_created', 'weekday_first_active']
df = pd.get_dummies(df , columns=cate_features)

使用NBCG演算法(Normalized Discounted Cumulative Gain)來排序

實務上很常用到，也是一個很有名評估網站搜索結果的演算法。

參考論文：
1. http://proceedings.mlr.press/v30/Wang13.pdf
2. http://hal.in2p3.fr/file/index/docid/726760/filename/07-busa-fekete.pdf

回頭檢視我們的商業目標：「讓使用者看到top5 booking」所以我們的超參數挑5。
這邊直接放上別人寫好的function XD，因為我也還沒完全搞懂這個演算法QQQ

def ndcg_score(preds , dtrain):
    labels = dtrain.get_label()
    top = []
    
    for i in range(preds.shape[0]):
        top.append(np.argsort(preds[i])[::-1][:5])
        
    mat = np.reshape(np.repeat(labels , np.shape(top)[1]) == np.array(top).ravel() , np.array(top).shape).astype(int)
    score = np.mean(np.sum(mat/np.log2(np.arange(2, mat.shape[1] + 2)),axis = 1))
    return 'ndcg', score

XGBoost有一套處理NaN的方式，本質上就是乾脆把缺失值當作一類來處理，所以我們都把nan統一用-1來表示。train好model就大功告成了！

train_df = df.loc[train_users['id']]
train_df.reset_index(inplace = True)
train_df.fillna(-1 , inplace = True)
#將target抽出來
y_train = train_df['country_destination']
train_df.drop(['country_destination' , 'id'] , axis = 1 , inplace = True)
#將剩下的dataframe轉為矩陣，才能夠放入sklearn裡面
x_train = train_df.valuesfrom sklearn.preprocessing import LabelEncoder
import xgboost as xgb#將我們的target使用label編碼，畢竟US的特別多，希望他有先後順序的關係，然後target不可能做one hot，畢竟只有一個維度
label_encoder = LabelEncoder()
encoded_y_train = label_encoder.fit_transform(y_train)xgtrain = xgb.DMatrix(x_train, label=encoded_y_train)#設置超參數
param = {
    'max_depth': 10,
    'learning_rate': 1,
    'n_estimators': 5,
    'objective': 'multi:softprob',
    'num_class': 12,
    'gamma': 0,
    'min_child_weight': 1,
    'max_delta_step': 0,
    'subsample': 1,
    'colsample_bytree': 1,
    'colsample_bylevel': 1,
    'reg_alpha': 0,
    'reg_lambda': 1,
    'scale_pos_weight': 1,
    'base_score': 0.5,
    'missing': None,
    'silent': True,
    'nthread': 4,
    'seed': 42
}#trian 我們的model 做交叉驗證 (cross validation)
num_round = 5
#train 我們的model，並使用剛剛寫好的function
xgb.cv(param , xgtrain , num_boost_round = num_round , metrics = ['mlogloss'] , feval=ndcg_score)

這裡會訓練得久一點，請耐心等待！我的kernel跑這個爆過兩次XD

那麼以上，就是如何使用Airbnb的資料來做搜尋排序，其實Airbnb還有很多非常有趣的資料集可以練習，而且就是真實世界的數據，一樣，這系列的資料科學文都不會深入講解code的細節，希望我的分析架構對您有幫助！

這邊也想做個實驗，好讓我知道你/妳喜不喜歡這篇文章：
拍 10 下：簽個到，表示支持（謝謝鼓勵！）
拍 20 下：想要我多寫「商管相關」
拍 30 下：想要我多寫「資科相關」
拍 50 下：我有你這讀者寫這篇也心滿意足了！