ระบบแนะนำสินค้าสำหรับผู้ใช้ใหม่: การแก้ปัญหา Cold Start ที่ผู้ใช้ยังไม่มีข้อมูล

การแนะนำสินค้าให้กับผู้ใช้ใหม่ที่ยังไม่มีประวัติการซื้อหรือการใช้งานเป็นความท้าทายอย่างหนึ่ง เนื่องจากเราไม่มีข้อมูลเกี่ยวกับความชอบหรือพฤติกรรมการซื้อของพวกเขา

fr4nk.xyz

Published in

myorder

6 min readSep 30, 2024

https://www.researchgate.net/figure/Rating-Matrix-rows-represent-users-and-columns-represent-items-The-entries-of-the_fig1_332511384

บทความนี้จะนำเสนอวิธีการและเทคนิคต่างๆ ในการสร้างระบบแนะนำสินค้าที่มีประสิทธิภาพสำหรับผู้ใช้ใหม่

1. การใช้ข้อมูลประชากรศาสตร์

วิธีการแรกที่เราสามารถใช้ได้คือการใช้ข้อมูลประชากรศาสตร์ของผู้ใช้ใหม่

วิธีการ:

1. เก็บข้อมูลพื้นฐาน เช่น อายุ เพศ ที่อยู่ ในขั้นตอนการลงทะเบียน

2. วิเคราะห์แนวโน้มการซื้อสินค้าของกลุ่มประชากรที่มีลักษณะคล้ายกัน

3. แนะนำสินค้าที่เป็นที่นิยมในกลุ่มประชากรนั้นๆ

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import NearestNeighbors

# สมมติว่าเรามีข้อมูลผู้ใช้และการซื้อสินค้า
users_df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 40, 45],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'location': ['NY', 'CA', 'TX', 'FL', 'WA']
})

purchases_df = pd.DataFrame({
    'user_id': [1, 1, 2, 3, 4, 5],
    'product_id': ['A', 'B', 'C', 'A', 'D', 'E']
})

# One-hot encoding สำหรับข้อมูลเชิงคุณภาพ
enc = OneHotEncoder(sparse=False)
encoded_features = enc.fit_transform(users_df[['gender', 'location']])

# รวมข้อมูลเชิงปริมาณและข้อมูลที่ encode แล้ว
features = pd.concat([users_df[['age']], pd.DataFrame(encoded_features)], axis=1)

# สร้างโมเดล Nearest Neighbors
nn_model = NearestNeighbors(n_neighbors=3, metric='euclidean')
nn_model.fit(features)

# ฟังก์ชันสำหรับแนะนำสินค้า
def recommend_products(new_user_features):
    distances, indices = nn_model.kneighbors([new_user_features])
    similar_users = users_df.iloc[indices[0]]['user_id']
    recommended_products = purchases_df[purchases_df['user_id'].isin(similar_users)]['product_id'].unique()
    return recommended_products

# ตัวอย่างการใช้งาน
new_user = [28, 1, 0, 1, 0, 0, 0, 0]  # อายุ 28, เพศชาย, อยู่ NY
print(recommend_products(new_user))

2. การใช้ความนิยมของสินค้า

อีกวิธีหนึ่งที่ง่ายแต่มีประสิทธิภาพคือการแนะนำสินค้าที่เป็นที่นิยมทั่วไป

วิธีการ:

1. วิเคราะห์ยอดขายหรือคะแนนรีวิวของสินค้าทั้งหมด

2. จัดอันดับสินค้าตามความนิยม

3. แนะนำสินค้าที่อยู่ในอันดับต้นๆ

import pandas as pd

# สมมติว่าเรามีข้อมูลการซื้อสินค้า
sales_df = pd.DataFrame({
    'product_id': ['A', 'B', 'C', 'A', 'B', 'D', 'E', 'A', 'C', 'E'],
    'quantity': [1, 2, 1, 3, 1, 2, 1, 2, 3, 1]
})

# คำนวณยอดขายรวมของแต่ละสินค้า
product_popularity = sales_df.groupby('product_id')['quantity'].sum().sort_values(ascending=False)

# ฟังก์ชันสำหรับแนะนำสินค้ายอดนิยม
def recommend_popular_products(n=5):
    return product_popularity.head(n).index.tolist()

# ตัวอย่างการใช้งาน
print(recommend_popular_products())

3. การใช้ Content-Based Filtering

วิธีนี้จะแนะนำสินค้าโดยอิงจากคุณลักษณะของสินค้าที่ผู้ใช้สนใจ

วิธีการ:

1. ให้ผู้ใช้ใหม่เลือกหมวดหมู่หรือคุณลักษณะของสินค้าที่สนใจ

2. วิเคราะห์คุณลักษณะของสินค้าในระบบ

3. แนะนำสินค้าที่มีคุณลักษณะตรงกับความสนใจของผู้ใช้

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# สมมติว่าเรามีข้อมูลสินค้า
products_df = pd.DataFrame({
    'product_id': ['A', 'B', 'C', 'D', 'E'],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing'],
    'description': ['Smartphone with high-res camera', 'Comfortable cotton T-shirt', 'Laptop with fast processor', 'Bestselling novel', 'Stylish denim jeans']
})

# สร้าง TF-IDF vector จากคำอธิบายสินค้า
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(products_df['description'])

# คำนวณความคล้ายคลึงระหว่างสินค้า
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# ฟังก์ชันสำหรับแนะนำสินค้าที่คล้ายกัน
def recommend_similar_products(product_id, n=3):
    idx = products_df.index[products_df['product_id'] == product_id].tolist()[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n+1]
    product_indices = [i[0] for i in sim_scores]
    return products_df['product_id'].iloc[product_indices].tolist()

# ตัวอย่างการใช้งาน
print(recommend_similar_products('A'))

4. การใช้ Hybrid Approach

การผสมผสานวิธีการต่างๆ เข้าด้วยกันสามารถให้ผลลัพธ์ที่ดีที่สุด

วิธีการ:

1.ใช้ข้อมูลประชากรศาสตร์เพื่อหากลุ่มผู้ใช้ที่คล้ายกัน

2. วิเคราะห์ความนิยมของสินค้าในกลุ่มนั้น

3. ใช้ Content-Based Filtering เพื่อกรองสินค้าที่ตรงกับความสนใจของผู้ใช้

4. นำเสนอสินค้าที่ผ่านการกรองทั้งหมด

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 1. ข้อมูลตัวอย่าง
users_df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 40, 45],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'location': ['NY', 'CA', 'TX', 'FL', 'WA']
})

products_df = pd.DataFrame({
    'product_id': ['A', 'B', 'C', 'D', 'E'],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing'],
    'description': ['Smartphone with high-res camera', 'Comfortable cotton T-shirt', 'Laptop with fast processor', 'Bestselling novel', 'Stylish denim jeans']
})

purchases_df = pd.DataFrame({
    'user_id': [1, 1, 2, 3, 4, 5, 2, 3, 4, 5],
    'product_id': ['A', 'B', 'C', 'A', 'D', 'E', 'A', 'B', 'C', 'D'],
    'quantity': [1, 2, 1, 3, 1, 2, 1, 2, 3, 1]
})

# 2. ประมวลผลข้อมูล
# One-hot encoding สำหรับข้อมูลเชิงคุณภาพ
enc = OneHotEncoder(sparse=False)
encoded_features = enc.fit_transform(users_df[['gender', 'location']])
user_features = pd.concat([users_df[['age']], pd.DataFrame(encoded_features)], axis=1)

# สร้าง TF-IDF vector จากคำอธิบายสินค้า
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(products_df['description'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 3. สร้างโมเดล
nn_model = NearestNeighbors(n_neighbors=3, metric='euclidean')
nn_model.fit(user_features)

# 4. ฟังก์ชันสำหรับ Hybrid Recommendation
def hybrid_recommend(new_user_features, user_interests, n=5):
    # 1. หาผู้ใช้ที่คล้ายกัน
    distances, indices = nn_model.kneighbors([new_user_features])
    similar_users = users_df.iloc[indices[0]]['user_id']
    
    # 2. วิเคราะห์ความนิยมของสินค้าในกลุ่มผู้ใช้ที่คล้ายกัน
    popular_products = purchases_df[purchases_df['user_id'].isin(similar_users)]
    popular_products = popular_products.groupby('product_id')['quantity'].sum().sort_values(ascending=False)
    
    # 3. ใช้ Content-Based Filtering
    user_interest_index = products_df[products_df['category'].isin(user_interests)].index
    if len(user_interest_index) > 0:
        interest_sim_scores = cosine_sim[user_interest_index].mean(axis=0)
    else:
        interest_sim_scores = np.ones(len(products_df))
    
    # 4. รวมคะแนนและจัดอันดับ
    final_scores = popular_products.to_dict()
    for idx, score in enumerate(interest_sim_scores):
        product_id = products_df.iloc[idx]['product_id']
        if product_id in final_scores:
            final_scores[product_id] *= (1 + score)
        else:
            final_scores[product_id] = score
    
    recommended_products = sorted(final_scores.items(), key=lambda x: x[1], reverse=True)[:n]
    return [product[0] for product in recommended_products]

# 5. ตัวอย่างการใช้งาน
new_user = [28, 1, 0, 1, 0, 0, 0, 0]  # อายุ 28, เพศชาย, อยู่ NY
user_interests = ['Electronics', 'Books']
recommendations = hybrid_recommend(new_user, user_interests)
print("Recommended products:", recommendations)

# 6. แสดงรายละเอียดของสินค้าที่แนะนำ
recommended_details = products_df[products_df['product_id'].isin(recommendations)]
print("\nRecommended product details:")
print(recommended_details)

คำอธิบายโค้ด:

เริ่มต้นด้วยการสร้างข้อมูลตัวอย่างสำหรับผู้ใช้ สินค้า และการซื้อ

2. ประมวลผลข้อมูล:

ใช้ One-hot encoding สำหรับข้อมูลเชิงคุณภาพของผู้ใช้
สร้าง TF-IDF vector จากคำอธิบายสินค้าและคำนวณความคล้ายคลึง

3. สร้างโมเดล Nearest Neighbors สำหรับหาผู้ใช้ที่คล้ายกัน

4. สร้างฟังก์ชัน hybrid_recommend ที่ผสมผสานวิธีการต่างๆ:

หาผู้ใช้ที่คล้ายกันโดยใช้ข้อมูลประชากรศาสตร์
วิเคราะห์ความนิยมของสินค้าในกลุ่มผู้ใช้ที่คล้ายกัน
ใช้ Content-Based Filtering โดยพิจารณาจากความสนใจของผู้ใช้
รวมคะแนนจากทุกวิธีและจัดอันดับสินค้า

5. แสดงตัวอย่างการใช้งานโดยสร้างผู้ใช้ใหม่และกำหนดความสนใจ

6. แสดงรายละเอียดของสินค้าที่แนะนำ

วิธีนี้ช่วยให้เราสามารถแนะนำสินค้าที่เหมาะสมกับผู้ใช้ใหม่ได้ โดยพิจารณาทั้งข้อมูลประชากรศาสตร์ ความนิยมของสินค้า และความสนใจของผู้ใช้ ทำให้ได้ผลลัพธ์ที่มีความสมดุลและมีประสิทธิภาพมากขึ้น

ปรับปรุงให้ใช้ embedding model แทน one-hot encoding สำหรับข้อมูลเชิงคุณภาพ โดยจะใช้ Word2Vec สำหรับ embedding ข้อมูลผู้ใช้และสินค้า ซึ่งจะช่วยให้เราสามารถจับความสัมพันธ์ที่ซับซ้อนมากขึ้นระหว่างคุณลักษณะต่างๆ ได้ดีกว่า one-hot encoding

import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec

# 1. ข้อมูลตัวอย่าง
users_df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 40, 45],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'location': ['NY', 'CA', 'TX', 'FL', 'WA']
})

products_df = pd.DataFrame({
    'product_id': ['A', 'B', 'C', 'D', 'E'],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing'],
    'description': ['Smartphone with high-res camera', 'Comfortable cotton T-shirt', 'Laptop with fast processor', 'Bestselling novel', 'Stylish denim jeans']
})

purchases_df = pd.DataFrame({
    'user_id': [1, 1, 2, 3, 4, 5, 2, 3, 4, 5],
    'product_id': ['A', 'B', 'C', 'A', 'D', 'E', 'A', 'B', 'C', 'D'],
    'quantity': [1, 2, 1, 3, 1, 2, 1, 2, 3, 1]
})

# 2. ประมวลผลข้อมูลและสร้าง embedding
def create_user_sentences(df):
    return df.apply(lambda row: [str(row['age']), row['gender'], row['location']], axis=1).tolist()

def create_product_sentences(df):
    return df.apply(lambda row: [row['category']] + row['description'].split(), axis=1).tolist()

user_sentences = create_user_sentences(users_df)
product_sentences = create_product_sentences(products_df)

# สร้าง Word2Vec model
embedding_size = 10
w2v_model = Word2Vec(sentences=user_sentences + product_sentences, vector_size=embedding_size, window=3, min_count=1, workers=4)

# สร้าง user embeddings
user_embeddings = users_df.apply(lambda row: np.mean([w2v_model.wv[str(row['age'])], 
                                                      w2v_model.wv[row['gender']], 
                                                      w2v_model.wv[row['location']]], axis=0), axis=1)
user_embeddings = np.array(user_embeddings.tolist())

# สร้าง product embeddings
product_embeddings = products_df.apply(lambda row: np.mean([w2v_model.wv[word] for word in [row['category']] + row['description'].split() if word in w2v_model.wv], axis=0), axis=1)
product_embeddings = np.array(product_embeddings.tolist())

# 3. สร้างโมเดล
nn_model = NearestNeighbors(n_neighbors=3, metric='cosine')
nn_model.fit(user_embeddings)

# คำนวณความคล้ายคลึงระหว่างสินค้า
product_similarity = cosine_similarity(product_embeddings)

# 4. ฟังก์ชันสำหรับ Hybrid Recommendation
def hybrid_recommend(new_user_features, user_interests, n=5):
    # แปลง new_user_features เป็น embedding
    new_user_embedding = np.mean([w2v_model.wv[feature] for feature in new_user_features if feature in w2v_model.wv], axis=0)
    
    # 1. หาผู้ใช้ที่คล้ายกัน
    distances, indices = nn_model.kneighbors([new_user_embedding])
    similar_users = users_df.iloc[indices[0]]['user_id']
    
    # 2. วิเคราะห์ความนิยมของสินค้าในกลุ่มผู้ใช้ที่คล้ายกัน
    popular_products = purchases_df[purchases_df['user_id'].isin(similar_users)]
    popular_products = popular_products.groupby('product_id')['quantity'].sum().sort_values(ascending=False)
    
    # 3. ใช้ Content-Based Filtering
    user_interest_index = products_df[products_df['category'].isin(user_interests)].index
    if len(user_interest_index) > 0:
        interest_sim_scores = product_similarity[user_interest_index].mean(axis=0)
    else:
        interest_sim_scores = np.ones(len(products_df))
    
    # 4. รวมคะแนนและจัดอันดับ
    final_scores = popular_products.to_dict()
    for idx, score in enumerate(interest_sim_scores):
        product_id = products_df.iloc[idx]['product_id']
        if product_id in final_scores:
            final_scores[product_id] *= (1 + score)
        else:
            final_scores[product_id] = score
    
    recommended_products = sorted(final_scores.items(), key=lambda x: x[1], reverse=True)[:n]
    return [product[0] for product in recommended_products]

# 5. ตัวอย่างการใช้งาน
new_user = ['28', 'M', 'NY']  # อายุ 28, เพศชาย, อยู่ NY
user_interests = ['Electronics', 'Books']
recommendations = hybrid_recommend(new_user, user_interests)
print("Recommended products:", recommendations)

recommended_details = products_df[products_df['product_id'].isin(recommendations)]
print("\nRecommended product details:")
print(recommended_details)

การเปลี่ยนแปลงหลักในโค้ดนี้คือ:

เราใช้ Word2Vec เพื่อสร้าง embeddings สำหรับทั้งผู้ใช้และสินค้า แทนการใช้ One-hot encoding

2. สร้างฟังก์ชัน create_user_sentences และ create_product_sentences เพื่อเตรียมข้อมูลสำหรับ Word2Vec

3. สร้าง Word2Vec model โดยใช้ข้อมูลทั้งของผู้ใช้และสินค้า

4. สร้าง embeddings สำหรับผู้ใช้และสินค้าโดยใช้ค่าเฉลี่ยของ word vectors

5. ปรับปรุงฟังก์ชัน hybrid_recommend เพื่อใช้งานกับ embeddings แทน one-hot encoded features

ข้อดีของการใช้ embedding model:

1. สามารถจับความสัมพันธ์ที่ซับซ้อนระหว่างคุณลักษณะต่างๆ ได้ดีกว่า one-hot encoding

2. ลดมิติของข้อมูลลง ทำให้ประมวลผลได้เร็วขึ้นเมื่อมีข้อมูลจำนวนมาก

3. สามารถใช้งานกับข้อมูลที่มีลักษณะต่อเนื่องได้ดี เช่น อายุ

4. สามารถค้นหาความคล้ายคลึงระหว่างคุณลักษณะที่ไม่เคยปรากฏร่วมกันในข้อมูลฝึกสอนได้

อย่างไรก็ตาม การใช้ embedding model อาจต้องการข้อมูลจำนวนมากในการฝึกสอนเพื่อให้ได้ผลลัพธ์ที่ดี และอาจต้องการการปรับแต่งพารามิเตอร์เพิ่มเติมเพื่อให้เหมาะสมกับข้อมูลของคุณ

Ref.

เอกสารอ้างอิงที่เกี่ยวข้องกับการแก้ปัญหาระบบแนะนำสำหรับผู้ใช้ใหม่ (หรือที่เรียกว่าปัญหา cold start) ดังนี้:

Schein, A. I., Popescul, A., Ungar, L. H., & Pennock, D. M. (2002). Methods and metrics for cold-start recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 253–260).

บทความนี้นำเสนอวิธีการและเมตริกสำหรับการแก้ปัญหา cold start ในระบบแนะนำ

2. Lam, X. N., Vu, T., Le, T. D., & Duong, A. D. (2008). Addressing cold-start problem in recommendation systems. In Proceedings of the 2nd international conference on Ubiquitous information management and communication (pp. 208–211).

บทความนี้นำเสนอวิธีการแก้ปัญหา cold start โดยใช้ข้อมูลประชากรศาสตร์และพฤติกรรมการใช้งานเว็บไซต์

3. Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems survey. Knowledge-based systems, 46, 109–132.

บทความสำรวจนี้ให้ภาพรวมของระบบแนะนำ รวมถึงวิธีการแก้ปัญหา cold start

4. Elahi, M., Ricci, F., & Rubens, N. (2016). A survey of active learning in collaborative filtering recommender systems. Computer Science Review, 20, 29–50.

บทความนี้สำรวจการใช้ active learning ในการแก้ปัญหา cold start ในระบบแนะนำแบบ collaborative filtering

5. Bernardi, L., Kamps, J., Kiseleva, J., & Mueller, M. J. (2015). The continuous cold start problem in e-commerce recommender systems. arXiv preprint arXiv:1508.01177.

บทความนี้นำเสนอปัญหา cold start แบบต่อเนื่องในระบบแนะนำสำหรับ e-commerce และเสนอวิธีการแก้ปัญหา

6. Gope, J., & Jain, S. K. (2017). A survey on solving cold start problem in recommender systems. In 2017 International Conference on Computing, Communication and Automation (ICCCA) (pp. 133–138). IEEE.

บทความสำรวจนี้ให้ภาพรวมของวิธีการแก้ปัญหา cold start ในระบบแนะนำ

7. Safoury, L., & Salah, A. (2013). Exploiting user demographic attributes for solving cold-start problem in recommender system. Lecture Notes on Software Engineering, 1(3), 303–307.

บทความนี้นำเสนอวิธีการใช้ข้อมูลประชากรศาสตร์ของผู้ใช้ในการแก้ปัญหา cold start

8. Rashid, A. M., Albert, I., Cosley, D., Lam, S. K., McNee, S. M., Konstan, J. A., & Riedl, J. (2002). Getting to know you: learning new user preferences in recommender systems. In Proceedings of the 7th international conference on Intelligent user interfaces (pp. 127–134).

บทความนี้นำเสนอวิธีการเรียนรู้ความชอบของผู้ใช้ใหม่ในระบบแนะนำ

เอกสารอ้างอิงเหล่านี้ครอบคลุมทั้งทฤษฎีและวิธีการปฏิบัติในการแก้ปัญหาระบบแนะนำสำหรับผู้ใช้ใหม่ ซึ่งจะเป็นประโยชน์ในการพัฒนาและปรับปรุงระบบแนะนำของคุณครับ

Tackling cold-start with deep personalized transfer of user preferences for cross-domain…

https://doi.org/10.1007/s41060-023-00467-9 Journal: International Journal of Data Science and Analytics, 2023…

ouci.dntb.gov.ua

ระบบแนะนำสินค้าสำหรับผู้ใช้ใหม่: การแก้ปัญหา Cold Start ที่ผู้ใช้ยังไม่มีข้อมูล

1. การใช้ข้อมูลประชากรศาสตร์

2. การใช้ความนิยมของสินค้า

3. การใช้ Content-Based Filtering

4. การใช้ Hybrid Approach

คำอธิบายโค้ด:

การเปลี่ยนแปลงหลักในโค้ดนี้คือ:

Ref.

Tackling cold-start with deep personalized transfer of user preferences for cross-domain…

https://doi.org/10.1007/s41060-023-00467-9 Journal: International Journal of Data Science and Analytics, 2023…

Written by fr4nk.xyz