การสร้างระบบแนะนำสินค้าด้วย Neural Collaborative Filtering (NCF)

ในบทความนี้ เราจะมาดูวิธีการสร้างระบบแนะนำโดยใช้เทคนิค Neural Collaborative Filtering (NCF) ซึ่งเป็นวิธีที่ผสมผสานระหว่าง Deep Learning และ Collaborative Filtering

fr4nk.xyz

Published in

myorder

5 min readOct 3, 2024

ขั้นตอนที่ 1: การเตรียมข้อมูลและสภาพแวดล้อม

เริ่มต้นด้วยการติดตั้งไลบรารีที่จำเป็น:

!pip install torch
!pip install LibRecommender
!pip install tensorflow==2.11.0
!pip install datasets
!pip install surprise

จากนั้นนำเข้าไลบรารีที่จำเป็น:

import numpy as np
import pandas as pd
from libreco.data import random_split, DatasetPure
from libreco.algorithms import NCF
from libreco.evaluation import evaluate
from datasets import load_dataset

ขั้นตอนที่ 2: การโหลดและเตรียมข้อมูล

เราใช้ชุดข้อมูล “Porameht/product-img-rating” จาก Hugging Face:

ds = load_dataset("Porameht/product-img-rating")

เพิ่มคอลัมน์ user_id และ product_id ให้กับข้อมูล:

def generate_user_id():
    return f"user_{random.randint(1, 10000):04d}"

def generate_product_id():
    return f"prod_{random.randint(1, 100000):06d}"

ds['train'] = ds['train'].add_column('user_id', [generate_user_id() for _ in range(len(ds['train']))])
ds['train'] = ds['train'].add_column('product_id', [generate_product_id() for _ in range(len(ds['train']))])

แปลงข้อมูลเป็น DataFrame และเตรียมข้อมูลสำหรับการฝึกฝนโมเดล:

df = ds['train'].to_pandas()
ratings = df[['user_id', 'product_id', 'overall_rating']]
ratings.columns = ["user", "item", "label"]

products = df[['product_id', 'title']]
products.columns = ["productID", "Title"]

ขั้นตอนที่ 3: การแบ่งข้อมูลและสร้างชุดข้อมูล

แบ่งข้อมูลเป็นชุดฝึกฝน ประเมินผล และทดสอบ:

training_set, evaluation_set, testing_set = random_split(ratings, multi_ratios=[0.8, 0.1, 0.1])

training_set, data_info = DatasetPure.build_trainset(training_set)
evaluation_set = DatasetPure.build_evalset(evaluation_set)
testing_set = DatasetPure.build_testset(testing_set)

ขั้นตอนที่ 4: การสร้างและฝึกฝนโมเดล NCF

สร้างและกำหนดค่าโมเดล NCF:

ncf = NCF(
    task="rating",
    data_info=data_info,
    embed_size=16,
    n_epochs=10, #Example
    lr=1e-3,
    batch_size=256,
    num_neg=1,
    use_bn=False,
    dropout_rate=None,
    hidden_units=(128, 64, 32),
)

ฝึกโมเดล:

ncf.fit(
    training_set,
    neg_sampling=False,
    verbose=2,
    eval_data=evaluation_set,
    metrics=["rmse", "mae", "loss"],
)

ขั้นตอนที่ 5: การบันทึกและโหลดโมเดล

บันทึกโมเดลและข้อมูลที่เกี่ยวข้อง:

save_path = "/content/drive/MyDrive/recom/model"
model_name = "product_ncf_model"

data_info.save(path=save_path, model_name=model_name)
ncf.save(path=save_path, model_name=model_name, manual=True, inference_only=True)

โหลดโมเดลกลับมาใช้งาน:

loaded_data_info = DataInfo.load(save_path, model_name=model_name)
loaded_model = NCF.load(path=save_path, model_name=model_name, data_info=loaded_data_info, manual=True)

ขั้นตอนที่ 6: การประเมินผลโมเดล

การประเมินผลเป็นขั้นตอนสำคัญในการพัฒนาระบบแนะนำ เพื่อให้เราทราบว่าโมเดลของเรามีประสิทธิภาพมากน้อยเพียงใด ในที่นี้เราจะใช้เมทริกซ์หลายตัวในการประเมิน

6.1 การประเมินด้วยเมทริกซ์พื้นฐาน

เราใช้ฟังก์ชัน evaluate จาก LibRecommender เพื่อคำนวณเมทริกซ์พื้นฐาน เช่น RMSE (Root Mean Square Error) และ MAE (Mean Absolute Error):

metrics = ["rmse", "mae", "loss"]

test_result = evaluate(
    model=loaded_model,
    data=evaluation_set,
    metrics=metrics,
    k=10,
    neg_sampling=False,
)
print("Test results:", test_result)

ผลลัพธ์จะแสดงค่า RMSE, MAE และ Loss ซึ่งบ่งบอกถึงความแม่นยำของการทำนายคะแนน

6.2 การประเมินด้วยเมทริกซ์ขั้นสูง

นอกจากนี้ เรายังได้สร้างคลาส RecommenderMetrics ที่มีเมทริกซ์การประเมินขั้นสูงหลายตัว:

import itertools

from surprise import accuracy
from collections import defaultdict

class RecommenderMetrics:

    def MAE(predictions):
        return accuracy.mae(predictions, verbose=False)

    def RMSE(predictions):
        return accuracy.rmse(predictions, verbose=False)

    def GetTopN(predictions, n=10, minimumRating=4.0):
        topN = defaultdict(list)


        for userID, movieID, actualRating, estimatedRating, _ in predictions:
            if (estimatedRating >= minimumRating):
                topN[int(userID)].append((int(movieID), estimatedRating))

        for userID, ratings in topN.items():
            ratings.sort(key=lambda x: x[1], reverse=True)
            topN[int(userID)] = ratings[:n]

        return topN

    def HitRate(topNPredicted, leftOutPredictions):
        hits = 0
        total = 0

        # For each left-out rating
        for leftOut in leftOutPredictions:
            userID = leftOut[0]
            leftOutMovieID = leftOut[1]
            # Is it in the predicted top 10 for this user?
            hit = False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if (int(leftOutMovieID) == int(movieID)):
                    hit = True
                    break
            if (hit) :
                hits += 1

            total += 1

        # Compute overall precision
        return hits/total

    def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
        hits = 0
        total = 0

        # For each left-out rating
        for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
            # Only look at ability to recommend things the users actually liked...
            if (actualRating >= ratingCutoff):
                # Is it in the predicted top 10 for this user?
                hit = False
                for movieID, predictedRating in topNPredicted[int(userID)]:
                    if (int(leftOutMovieID) == movieID):
                        hit = True
                        break
                if (hit) :
                    hits += 1

                total += 1

        # Compute overall precision
        return hits/total

    def RatingHitRate(topNPredicted, leftOutPredictions):
        hits = defaultdict(float)
        total = defaultdict(float)

        # For each left-out rating
        for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
            # Is it in the predicted top N for this user?
            hit = False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if (int(leftOutMovieID) == movieID):
                    hit = True
                    break
            if (hit) :
                hits[actualRating] += 1

            total[actualRating] += 1

        # Compute overall precision
        for rating in sorted(hits.keys()):
            print (rating, hits[rating] / total[rating])

    def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
        summation = 0
        total = 0
        # For each left-out rating
        for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
            # Is it in the predicted top N for this user?
            hitRank = 0
            rank = 0
            for movieID, predictedRating in topNPredicted[int(userID)]:
                rank = rank + 1
                if (int(leftOutMovieID) == movieID):
                    hitRank = rank
                    break
            if (hitRank > 0) :
                summation += 1.0 / hitRank

            total += 1

        return summation / total

    # What percentage of users have at least one "good" recommendation
    def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
        hits = 0
        for userID in topNPredicted.keys():
            hit = False
            for movieID, predictedRating in topNPredicted[userID]:
                if (predictedRating >= ratingThreshold):
                    hit = True
                    break
            if (hit):
                hits += 1

        return hits / numUsers

    def Diversity(topNPredicted, simsAlgo):
        n = 0
        total = 0
        simsMatrix = simsAlgo.compute_similarities()
        for userID in topNPredicted.keys():
            pairs = itertools.combinations(topNPredicted[userID], 2)
            for pair in pairs:
                movie1 = pair[0][0]
                movie2 = pair[1][0]
                innerID1 = simsAlgo.trainset.to_inner_iid(str(movie1))
                innerID2 = simsAlgo.trainset.to_inner_iid(str(movie2))
                similarity = simsMatrix[innerID1][innerID2]
                total += similarity
                n += 1

        S = total / n
        return (1-S)

    def Novelty(topNPredicted, rankings):
        n = 0
        total = 0
        for userID in topNPredicted.keys():
            for rating in topNPredicted[userID]:
                movieID = rating[0]
                rank = rankings[movieID]
                total += rank
                n += 1
        return total / n

เมทริกซ์เหล่านี้ช่วยให้เราเข้าใจประสิทธิภาพของระบบแนะนำในมิติต่างๆ:

Hit Rate: วัดว่าระบบสามารถแนะนำสินค้าที่ผู้ใช้สนใจได้บ่อยแค่ไหน
Cumulative Hit Rate: คล้ายกับ Hit Rate แต่พิจารณาเฉพาะสินค้าที่มีคะแนนสูงกว่าค่าที่กำหนด
Rating Hit Rate: แสดงความแม่นยำของการแนะนำแยกตามระดับคะแนน
Average Reciprocal Hit Rank: วัดว่าสินค้าที่ผู้ใช้สนใจอยู่ในลำดับที่ดีแค่ไหนในรายการแนะนำ

5. User Coverage: วัดว่าระบบสามารถให้คำแนะนำที่มีคุณภาพแก่ผู้ใช้กี่เปอร์เซ็นต์

6. Diversity: วัดความหลากหลายของสินค้าที่แนะนำ

7. Novelty: วัดว่าสินค้าที่แนะนำมีความแปลกใหม่มากน้อยเพียงใด

6.3 การใช้งานเมทริกซ์ขั้นสูง

เพื่อใช้งานเมทริกซ์เหล่านี้ เราต้องสร้างชุดข้อมูลสำหรับการทดสอบและใช้ฟังก์ชัน GetTopN เพื่อสร้างรายการแนะนำ Top N:

testSet = testing_set  # สมมติว่าเรามี testing_set

predictions = [loaded_model.predict(user=user, item=item) for user, item, _ in testSet]

topN = RecommenderMetrics.GetTopN(predictions, n=10, minimumRating=4.0)

hitRate = RecommenderMetrics.HitRate(topN, testSet)
print(f"Hit Rate: {hitRate}")

cumHitRate = RecommenderMetrics.CumulativeHitRate(topN, testSet, ratingCutoff=4.0)
print(f"Cumulative Hit Rate: {cumHitRate}")

arhr = RecommenderMetrics.AverageReciprocalHitRank(topN, testSet)
print(f"Average Reciprocal Hit Rank: {arhr}")

ขั้นตอนที่ 7: การใช้งานโมเดลเพื่อทำนายและแนะนำ

ทำนายคะแนนสำหรับผู้ใช้และสินค้าเฉพาะ:

user_id = 1
item_id = 100
prediction = loaded_model.predict(user=user_id, item=item_id)
print(f"Prediction for user {user_id} and item {item_id}: {prediction}")

แนะนำสินค้าให้กับผู้ใช้:

user = "user_2020"
recs = loaded_model.recommend_user(user=user, n_rec=10)
for rec in recs[user]:
    print(products[products.productID == rec].Title)

สรุป

การประเมินผลเป็นขั้นตอนสำคัญในการพัฒนาระบบแนะนำ ช่วยให้เราเข้าใจจุดแข็งและจุดอ่อนของโมเดล ในบทความนี้ เราได้เรียนรู้วิธีการสร้างระบบแนะนำโดยใช้ Neural Collaborative Filtering (NCF) และวิธีการประเมินผลด้วยเมทริกซ์หลากหลาย ทั้งเมทริกซ์พื้นฐานอย่าง RMSE และ MAE และเมทริกซ์ขั้นสูงที่วัดประสิทธิภาพในมิติต่างๆ เช่น ความแม่นยำ ความครอบคลุม ความหลากหลาย และความแปลกใหม่ของคำแนะนำ

การใช้เมทริกซ์หลายตัวร่วมกันช่วยให้เราเห็นภาพรวมของประสิทธิภาพระบบแนะนำได้ดียิ่งขึ้น และสามารถนำไปปรับปรุงโมเดลให้ตอบโจทย์ความต้องการของผู้ใช้และธุรกิจได้อย่างมีประสิทธิภาพ

Ref.

NCF - Lib 1.5.1 Recommender

epsilon (, default: 1e-5) - A small constant added to the denominator to improve numerical stability in Adam optimizer…

librecommender.readthedocs.io