Stories by Muhamad Salim Alwan on Medium

Credit Card Customer Segmentation: Unsupervised Machine Learning K-Means Clustering

Muhamad Salim Alwan — Sat, 23 Aug 2025 17:20:29 GMT

Credit cards are typically issued by banks or financial services companies, allowing cardholders to borrow funds to pay for goods and services at merchants that accept card payments. However, credit cards require cardholders to repay the borrowed funds in the future, plus applicable interest and agreed-upon additional fees, either in full on the billing date or over time.

Based on this, we will carry out clustering by looking at the behavior of credit card customers. We will use a dataset provided by Kaggle titled “Credit Card Data- Intermediate Dataset”. The dataset provides insight into the usage patterns of credit card users within a bank. With detailed information on credit card transactions and customer behavior, this dataset enables researchers and analysts to uncover meaningful segments and trends. However, this time we will focus only on customer segmentation. But, before we dive into the data, it would be a good idea to check out the dataset here.

We will divide our work into the following steps:
1. Data preprocessing: handling missing values and exploratory data analysis
2. Filtering and normalize data
3. Find the optimal number for clustering
4. Implementing K-Means Clustering
5. Analyzing segments
6. Implementing clusters result into data

Awesome, let’s import the required libraries and learn the dataset first by observing what features it has.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')
sns.set_palette('pastel', 5)

df = pd.read_csv('Customer_Data.csv')
df.head()

BALANCE: The balance is the total amount a customer owes you or you owe to a vendor. Customer and vendor balances can be.
- BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) PURCHASES : Amount of purchases made from account.
- PURCHASES: Consumer purchase data is the information about consumer purchases, including the product or service, time of purchase, and the amount spent. This data indicates purchase history, customer buying patterns, and other relevant details, such as stock availability and product appearance.
- ONEOFF_PURCHASES: Maximum purchase amount done in one-go.
- INSTALLMENTS_PURCHASES: Amount of purchase done in installment.
- CASH_ADVANCE: Cash in advance given by the user.
- PURCHASES_FREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased).
- ONEOFF_PURCHASES_FREQUENCY: A column that shows how often customers buy things that are not part of a regular plan. It can help businesses group customers based on their buying habits and offer them suitable products. There are different ways to do this using machine learning and data analysis.
- PURCHASES_INSTALLMENTS_FREQUENCY: A column that measures how often customers make purchases in installments, as a percentage of months with at least one installment purchase in the last 12 months. It can help businesses understand the payment preferences and behaviors of different customer segments.
- CASH_ADVANCE_FREQUENCY: A column that indicates how often customers take cash in advance from their credit card, as a fraction of months with at least one cash advance transaction in the last 12 months. It can help businesses understand the borrowing and repayment patterns of different customer segments.
- CASH_ADVANCE_TRX: A column that counts the number of times customers take cash in advance from their credit card in the last 12 months. It can help businesses identify the customers who rely on cash advances more often.
- PURCHASES_TRX: A column that counts the number of purchase transactions made by customers in the last 12 months. It can help businesses understand the purchase frequency and volume of different customer segments.
- CREDIT_LIMIT: A column that shows the maximum amount of money that customers can borrow from their credit card. It can help businesses understand the spending power and credit risk of different customer segments.
- PAYMENTS: A column that shows the total amount of money that customers paid to their credit card in the last 12 months. It can help businesses measure the repayment ability and creditworthiness of different customer segments.
- MINIMUM_PAYMENTS: A column that shows the minimum amount of money that customers have to pay to their credit card in the last 12 months. It can help businesses evaluate the repayment ability and creditworthiness of different customer segments.
- PRC_FULL_PAYMENT: A column that shows the percentage of months with full payment of the due statement balance in the last 12 months. It can help businesses assess the repayment ability and creditworthiness of different customer segments.
- TENURE: A column that shows how long customers have been using their credit card, in months. It can help businesses

Oh no! We found missing values in the data in the columns ‘CREDIT_LIMIT’ and ‘MINIMUM_PAYMENTS’. So, we need to handle them first. For column ‘MINIMUM_PAYMENTS’, we’ll fill the missing values with zeros. But for column ‘CREDIT_LIMIT’, we’ll delete the entire row. Why we don’t fill it with zeros as well? Of course, because the credit limit must have a certain minimum value and cannot be zero. And also we can’t fill it with the mean, median, or even mode, as that would make the results unrepresentative. Furthermore, since we’re only deleting one row, it won’t be a significant issue.

df['MINIMUM_PAYMENTS'] = df['MINIMUM_PAYMENTS'].fillna(0)
df.dropna(inplace=True)

Voilà, we have successfully doing data preprocessing. And next, we will first look at the correlation between features.

From the heatmap graph above, the column that has the highest correlation is column ‘PURCHASES’ and ‘ONEOFF_PURCHASES’ with a value of 0.92, and so on. It should be noted that correlation is not causation. This means that the columns do have a relationship, but that does’nt mean that the columns have cause and effect.

Perfect, we will continue to filtering and normalize the data now. And of course, after that we also should find the optimal number of cluster before implementing the K-Means Clustering.

Q1 = df['PURCHASES'].quantile(0.25)
Q3 = df['PURCHASES'].quantile(0.75)
IQR = Q3-Q1

lower_bound = Q1 - (0.75 * IQR)
upper_bound = Q3 + (0.75 * IQR)

filtered_df = df[(df['PURCHASES'] >= lower_bound) &
                 (df['PURCHASES'] <= upper_bound)]

filtered_df = df.drop(columns='CUST_ID', index=1)
scaled_df = StandardScaler().fit_transform(filtered_df)

pca = PCA(n_components=2)
pca_components = pca.fit_transform(scaled_df)
pca_df = pd.DataFrame(pca_components, columns=['PCA1', 'PCA2'])

The filtering we do aims to reduce outliers in the data. StandardScaler() is needed to scale different features to be equal. For each feature, its value is changed to have a mean = 0 and a standard deviation = 1. PCA is used to reduce the dimensionality of features. This reduced-dimensional dataset is, in some ways, ‘good enough’ to encode the most important relationships between points, despite reducing the number of data features by 50%, the overall relationships between data points are largely preserved.

To find the optimal value of the cluster, we use the Elbow Method. The Elbow Method is a popular technique used for K-Means clustering. The best value for a cluster is when the graph starts to slope or forms an elbow. The graph shows that as the cluster passes 3, the inertia value decreases gradually. Therefore, the optimal value for the cluster is 3.

Finally, we get the results after applying K-Means Clustering. Here, we have three clusters: 0, 1, and 2. Each cluster behaves differently based on its features. We can analyze them as follows:

# CLUSTER 0

Have a small balance
Balance is not updated very often
Purchases are quite low
Make very small lump-sum purchases
Make installment purchases are quite low
Down payment is very low
Purchase frequency is quite low
Single-purchase frequency is low
Installment payment frequency is low
Cash advance withdrawals are very rare
Cash advance withdrawals are low in the last 12 months
Transaction numbers are rare in the last 12 months
Have a small credit limit
Credit card payments in the last 12 months are very low
Minimum credit card payment amount is quite low
Full payment percentage of balances due in the last 12 months is quite high
Customers are relatively new to using their credit cards

“It could be said that this segment contains fairly passive customers. This could be because their financial capacity is not very high, so they cannot bear an excessively high credit burden. Companies can promote other products/services such as savings, deposits, and so on that are more suited to this customer profile.”

# CLUSTER 1

Very large debt balance
Frequent balance renewals
Very low purchases
Very low one-time purchases
Very low installment purchases
Very high down payment
Very low purchase frequency
Low one-time purchase frequency
Very low installment purchase frequency
Frequent cash advances from credit cards
Very frequent cash advance withdrawals in the last 12 months
Very infrequent purchase transactions in the last 12 months
Fairly high credit limit
Quite large amount paid to credit card
Very high minimum payment
Very low percentage of full payments due
Not long used a credit card

“This segment has relatively high financial capacity, but rarely uses it. Their financial responsibilities can be quite high. This cluster is not very active in making purchases, but their single payments can be quite large, perhaps for luxury items. Furthermore, this cluster tends not to wait for their bills to come due. Companies can offer promotions or discounts for each purchase or installment to encourage more purchases by this segment.”

# CLUSTER 2

Quite large balance
Frequently renewed balance
Very high purchases
Very high one-time purchases
Very high installment purchases
Low down payment
Very high purchase frequency
Very high one-time purchase frequency
Very high installment purchase frequency
Very rare cash advance withdrawals from credit card
Small number of cash advance withdrawals in the last 12 months
Very frequent purchase transactions in the last 12 months
High credit limit
Very large amount paid to credit card
Fairly high minimum payment
Very high percentage of full payment of balance due
Long-term credit card users

“This segment is highly consumptive. And they are loyal customers. However, companies need to be careful about their very high past-due bills due to their high purchasing/installment patterns. To reward their loyalty, companies can offer them certain discounts.”

final_df = df[['CUST_ID']].join(filtered_df)
final_df.to_csv("Customers Clusters.csv", index=False)

Our final step will be to input the clustering results into a dataframe and save it in a file named “Customers Clusters.csv”. This file can be used in the future to practice supervised machine learning, such as prediction and classification tasks.
To see the complete documentation, you can visit my profiles on GitHub and Kaggle. Thank you for reading until the end. Please comment and like if you enjoyed.
~Have a nice coding day!

Kepribadian Ekstrovert vs Introvert: Analisa Machine Learning

Muhamad Salim Alwan — Wed, 11 Jun 2025 07:47:49 GMT

Photo by Tingey Injury Law Firm on Unsplash

Introvert dikenal sebagai kepribadian dari orang yang menyukai dan menemukan energi saat sendirian. Dan orang ekstrovert dapat dikenali dari interaksi sosial dan karakter yang mudah berbaur dengan banyak orang. Dua tipe kepribadian yang kontras ini pertama kali diperkenalkan pada tahun 1910 oleh Carl Gustav Jung.

Introvert sering digunakan untuk menggambarkan tipe kepribadian yang dicirikan oleh pilihan terhadap lingkungan yang tenang dan lebih suka introspeksi serta lebih banyak aktivitas menyendiri. Aktivitas menyendiri ini dapat mencakup membaca, berkreasi, menulis, dan terlibat dalam aktivitas sendirian. Sering kali, bagi kepribadian introvert, keterlibatan sosial dapat menyebabkan perasaan memiliki baterai sosial yang “terkuras” dan akan memerlukan waktu menyendiri untuk “mengisi ulang” baterai sosial mereka. Selain itu, orang introvert biasanya suka merenung, bisa menjadi pendengar yang baik, dan lebih suka jika dibiarkan sendiri dengan aktivitasnya.

Ekstrovert berbeda dengan kepribadian introvert, yang biasanya lebih bersifat terbuka, memperoleh energi dengan berada di sekitar orang lain, menciptakan percakapan, dan berpartisipasi dalam banyak kegiatan berkelompok. Ekstrovert memiliki energi yang tinggi dalam lingkungan sosial. Mereka mendapatkan keuntungan dari keterlibatan orang lain, berbeda dengan introvert yang membutuhkan waktu sendiri untuk mengisi ulang energi dari keterlibatan sosial.

Nah, dari sedikit penjelasan di atas, sebagai seorang data scientist (wanna be) bisakah kita menganalisa kepribadian seseorang melalui aktivitas mereka? Apakah seorang introvert tidak boleh mempunyai ciri seperti seorang ekstrovert ataupun sebaliknya?

Baik, untuk menjawab pertanyaan tersebut kita perlu menjelajahi kaggle untuk menemukan dataset yang sesuai. Kamu juga dapat melihatnya di sini. Dari dataset tersebut terdapat beberapa indikator untuk menganalisa apakah seseorang itu termasuk tipe introvert atau ekstrovert, di antaranya:

Berapa banyak waktu yang dihabiskan dengan menyendiri? (0–11 jam)
Apakah sering mengalami demam panggung? (Ya/Tidak)
Berapa frekuensi acara sosial yang diikuti? (0–10)
Frekuensi pergi keluar rumah? (0–7)
Merasa terkuras setelah bersosialisasi? (Ya/Tidak)
Jumlah teman dekat? (0–15)
Berapa frekuensi posting media sosial? (0–10)

Sekilas, dataset yang ada mempunyai berbagai data yang hilang yang ditandai dengan ‘NaN’ (Not a Number). Selain itu, beberapa fitur/kolom masih bertipe data string seperti stage_fear, Drained_after_scializing, dan Personality. Oleh karena itu, kita akan menghapus baris yang berisi data yang tidak lengkap lalu merubah semua tipe data string/teks menjadi angka. Hal ini dilakukan untuk menghindari error dan meningkatkan akurasi model.

# Menghapus baris dengan nilai NaN
df.dropna(inplace=True) 

# Melakukan iterasi untuk encoding setiap kolom bertipe teks
features = df[["Stage_fear", "Drained_after_socializing", "Personality"]]
for col in features:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

Selanjutnya, kita akan melatih data dengan membaginya menjadi data latih 80% dan data uji 20%. Dimana kolom Personality akan kita jadikan sebagai target atau variabel y, sedangkan sisanya akan menjadi variabel X karena dianggap bisa memengaruhi target atau variabel y.

X = df.drop("Personality", axis=1)
y = df["Personality"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model yang akan kita gunakan kali ini adalah SVM (Support Vector Machine). Lebih tepatnya lagi adalah modul SVM untuk Classification yang bernama SVC. Hal ini karena kita akan melakukan tugas klasifikasi.

Klasifikasi adalah proses untuk mengelompokkan sekumpulan data berdasarkan ciri kesamaan dan perbedaannya. Seperti halnya dataset kita yang cocok untuk mengkalsifikasi atau mengelompokkan apakah termasuk ke dalam tipe introvert atau ekstrovert. Di bawah ini ialah contoh perbedaan sebelum dan sesudah dilakukan klasifikasi. Dimana sebelum dilakukan klasifikasi, masih terdapat titik data yang tidak sesuai warna/kelompoknya. Berbeda halnya dengan gambar di sebelah kanan yakni setelah dilakukan klasifikasi. Dimana setiap kelompok, yaitu 0 atau warna biru untuk ekstrovert dan 1 atau merah untuk introvert saling terpisah membentuk kelompoknya masing-masing.

Seusai dilakukan pengujian, model SVM menampilkan hasil yang cukup memuaskan dimana skor akurasinya mencapai 0,93. Itu artinya sebanyak 93% data berhasil diprediksi dengan benar oleh model. Seperti yang terlihat pada Confusion Matrix, hampir semua data dapat diprediksi dengan benar. Dimana data yang berhasil diprediksi dengan benar ditandai dengan warna biru.

SVM_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SVC(kernel='linear'))
])

SVM_pipeline.fit(X_train, y_train)

y_pred = SVM_pipeline.predict(X_test)

print("Accuracy Score: ", accuracy_score(y_test, y_pred))

Kita juga perlu memperhatikan korelasi atau hubungan antara setiap fitur dengan target. Hal ini untuk melihat kontribusi dari setiap fitur pada saat pemodelan. Kita bisa melihatnya melalui grafik korelasi di bawah ini. Dimana jika angka mendekati angka 1 atau semakin gelap warnanya berarti nilainya positif dan begitu pun berlaku sebaliknya. Jika diperhatikan fitur seperti stage_fear (demam panggung) dan Drained_after_socializing (merasa capek setelah bersosialisasi) punya korelasi yang paling tinggi pada Personality yaitu masing-masing sekitar 0,85. Sehingga bisa kita asumsikan bahwa kedua fitur ini berkorelasi kuat terhadap personality seseorang apakah ekstrovert atau introvert.

Dan yang paling terakhir adalah kita akan coba memasukan data baru ke dalam model untuk diprediksi lalu kita akan melihat hasilnya. Namun, sebelum ke situ, aku sedikit penasaran dengan tipe kepribadian rata-rata untuk orang introvert dan ekstrovert itu seperti apa. Kenapa? karena menurutku itu bisa jadi gambaran awal atau untuk melakukan validasi sederhana nantinya. Ingat ya, setelah encoding, 0 merujuk pada ekstrovert dan 1 merujuk pada introvert. Kamu bisa melihat detailnya di bawah ini.

Oke, ini saatnya kita mencoba model yang udah dibuat tadi dengan mengujinya dengan data baru. Biar seru, kita asumsikan saja bahwa data baru ini ialah berdasarkan kebiasaan dari seseorang bernama Joni. Joni sering menghabiskan waktu sendirian paling tidak selama 8 jam. Ia tidak pernah merasakan takut saat di atas panggung. Acara sosial yang diikutinya sebanyak 5 kali dan pergi keluar sebanyak 4 kali. Ia sering merasa lelah setelah bersosialisasi. Sirkel pertemanannya hanya berkisar 10 orang dan ia hanya posting di sosmed dengan frekuensi sebanyak 3. Lalu, kira-kira apakah tipe kepribadian Joni? Yuk kita cek!

def predict_extrovert_or_introvert(input_dict):
    features = ['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance',
                     'Going_outside', 'Drained_after_socializing', 'Friends_circle_size', 'Post_frequency']
    
    # Create a list input according to the feature order and make prediction
    input_list = [input_dict[feature] for feature in features]
    prediction = SVM_pipeline.predict([input_list])[0]
    return 'Extrovert' if prediction == 0 else 'Introvert'

# Input new data
data = {
    'Time_spent_Alone': 8,
    'Stage_fear': 0,
    'Social_event_attendance': 5,
    'Going_outside': 4,
    'Drained_after_socializing': 1,
    'Friends_circle_size': 10,
    'Post_frequency': 3
}

# Predict
result = predict_extrovert_or_introvert(data)
print(f"The Personality is: {result}")

Ternyata, setelah kita cek, personality si Joni ini ialah Introvert. Meskipun Joni tidak merasa takut saat di atas panggung dan mempunyai teman yang cukup banyak, ia tetap termasuk orang introvert. Kenapa ya? Kalau kamu tahu jawabannya kita diskusi di kolom komentar ya!

Dan jangan lupa buat cek kode lengkapnya di sini.

Referensi:

https://www.choosingtherapy.com/introvert-vs-extrovert/

Introvert vs. Extrovert Personality: What's The Difference?

Prediksi Harga Saham Menggunakan Python

Muhamad Salim Alwan — Sat, 07 Jun 2025 15:33:01 GMT

Photo by Maxim Hopman on Unsplash

Hai semua, apa kabar? Semoga dalam keadaan sehat selalu. Tepat sehari sebelum hari raya Idul Adha 1446 H aku tulis tulisan ini. Semoga dapat membuat kita menuju ke arah yang positif dan bagi yang melaksanakan ibadah haji semoga menjadi haji yang mabrur. Aamiin.

Anyway busway, kali ini aku mau berbagi sedikit soal projek latihanku, yaitu (drum beat sound) prediksi harga saham. Yap, cuman kali ini aku pengen coba yang agak berbeda, aku ngga pake dataset dari kaggle, tapi aku pake dataset dari library YFinance. By the way, buat yang belum sempet donwload library bisa langsung download dulu aja dan lihat dokumentasinya di sini sebelum lanjut.

Apa itu Library YFinance?

Jadi, ini adalah library yang menggunakan API Yahoo yang tersedia untuk umum, dan ditujukan untuk tujuan penelitian dan pendidikan. Isinya berisi informasi-informasi keuangan yang ada pada Yahoo Finance, yaitu seputar pasar saham, obligasi, mata uang dan juga kripto. Di YFinance ini, kamu juga bisa menemukan emiten saham dalam negeri (Indonesia), contohnya PT Telkom Indonesia (Persero) Tbk dengan kode TLKM tapi khusus di YFinance atau juga Yahoo Finance kita harus nambahin domain .JK untuk merujuk pada wilayah Jakarta, Indonesia. Oke langsung aja kita deep into the water, baca sampai selesai ya!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import warnings
warnings.filterwarnings("ignore")

ticker = yf.Ticker("TLKM.JK")

try:
    # Attempt to fetch historical data
    data = ticker.history(start='2020-01-01', end='2024-12-31')
except Exception as e:
    print("An error occurred while fetching data:", e)

Di sini kita import beberapa library dasar di python dan load data historis untuk harga saham TLKM.JK dari 2020–2024 (5 tahun). Sebelum lebih jauh, sumber kodenya kamu bisa lihat juga di profil github-ku di sini. Sebenernya tiap kali load data akan ada beberapa fitur atau kolom yang tersedia, cuman di sini aku fokus di fitur OHLCV (Open, High, Low, Close, dan Volume) yang berguna buat analisa teknikal maupun buat prediksi nanti. Di luar itu, kaya fitur dividen dan stock splits ngga aku pakai karena terlalu banyak nilai null dan aku pikir gak akan terlalu berdampak signifikan ke hasil prediksi. Meskipun demikian, aku juga nambahin beberapa fitur yang berkaitan dengan analisis teknikal dasar saham, seperti Moving Average, RSI, dan MACD.

Moving Average, RSI, dan MACD

Salah satu indikator sederhana yang dapat membantu kita dalam menganalisa pergerakan saham adalah Moving Average. Indikator ini menggambarkan pergerakan harga saham rata-rata dari waktu ke waktu. Seperti halnya gambar di atas terdapat beberapa indikator MA (Moving Average) berupa garis pada candle stick yang terdiri dari MA 5, 10, 20 dan 50 yang masing-masing merujuk pada berapa lama hari dimana harga dirata-ratakan. Tentunya hal tersebut akan memberikan sedikit perbedaan. Dimana MA 50 hari digunakan untuk analisis harga saham jangka panjang karena memberikan informasi harga rata-rata tiap 50 hari atau sekitar 2,5 bulan pasar saham (satu bulan 20 hari pasar saham). Berbeda halnya dengan MA 5, 10, dan 20 yang penggunaannya lebih cocok untuk analisa jangka pendek sampai menengah, misalnya untuk trading harian atau mingguan.

Lain halnya dengan indikator MACD dan RSI. Indikator Moving Average Convergence Divergence (MACD) dan Relative Strength Index (RSI) adalah salah dua indikator populer yang digunakan oleh analis teknis dan pedagang harian. Meskipun keduanya pun memiliki perbedaan yang terletak pada apa yang ingin diukur oleh masing-masing. MACD dihitung dengan mengurangi EMA 26 periode dari EMA 12 periode. Hasil perhitungan tersebut adalah EMA MACD selama sembilan hari yang disebut “garis sinyal” yang kemudian diplot di atas garis MACD, yang dapat berfungsi sebagai pemicu sinyal beli dan jual. Pedagang dapat membeli sekuritas saat MACD melintasi garis sinyalnya dan menjual, ataupun sebaliknya.

Sedangkan RSI bertujuan untuk menunjukkan apakah pasar dianggap overbought (kondisi terlalu banyak orang yang membeli saham sehingga harga terlalu tinggi untuk didorong naik lagi) atau oversold (kondisi terlalu banyak orang yang menjual saham atau mengalami tren penurunan) dalam kaitannya dengan level harga terkini. RSI menghitung keuntungan dan kerugian harga rata-rata selama periode waktu tertentu. Periode waktu default-nya adalah 14 periode dengan nilai yang dibatasi dari 0 hingga 100.

Buat kamu yang ingin buat visualisasi seperti di atas bisa pakai mplfinance yang kamu bisa akses di sini buat tutor yang lengkapnya.

Pemodelan Machine Learning

Model yang digunakan adalah Random Forest. Kenapa? Karena sebelum ini semua beberapa model sudah aku coba bandingkan akurasinya dengan bantuan GridSearchCV termasuk juga buat nentuin parameter terbaik yang bisa digunakan. Dan Hasilnya cukup memuaskan dari mulai R2 Score, MAE, maupun MSE. Hasil R2 Score yakni 0,96 menunjukkan bahwa akurasi model untuk menjelaskan seberapa jauh data dependen (label) dapat dijelaskan oleh data independen (fitur) dimana memiliki rentang antara 0 dan 1, sehingga hasil yang didapat dengan Random Forest bisa dibilang sangat baik. MAE atau Mean Absolute Error menjelaskan rata-rata dari selisih absolut antara nilai prediksi dan nilai aktual. Dimana pada model bernilai cukup rendah yaitu 74,9. Lalu MSE atau Mean Squared Error MSE menghitung rata-rata dari selisih kuadrat antara nilai prediksi dan nilai aktual. Dengan hasil 11.923, 3 pada model.

Kesimpulan

Jadi, itu dia sedikit penjelasan mengenai step dan tools penting yang aku gunakan untuk melakukan prediksi saham dengan machine learning. Di antaranya ada YFinance untuk menyediakan dataset historis dari harga saham dan juga mplfinance untuk visualisasi khusus bidang keuangan. Buat step lengkapnya kamu bisa liat di github aku di sini.

Next step? Ya, aku kira masih banyak yang perlu ditingkatkan lagi. Mungkin selanjutnya aku bakal buat dashboard dengan bantuan back-end kaya Dash atau mungkin streamlit. Aku masih nyari tools terbaik buat bikin visualisasi kaya mplfinance yang khas tapi juga bisa interaktif kaya plotly kalau kamu tahu. Dan juga model yang udah ada ini pengen aku terapin buat prediksi harga ke depan dan coba aku bandingin dengan harga saham aktualnya. Meskipun udah aku coba sebetulnya, tapi aku masih kurang puas dengan hasilnya. Mungkin next bakal aku share.Kalau kamu ada solusi buat aku atau mau diskusi feel free buat komen ya. See you :)

Referensi

RFM Analysis Using Python and SQL

Muhamad Salim Alwan — Fri, 04 Apr 2025 16:16:50 GMT

Photo by Carlos Muza on Unsplash

Recency, frequency, and monetary (RFM) analysis is a tool for understanding and assessing consumer behavior based on purchases. RFM techniques can be used by quantitatively categorizing and classifying customers based on the total RFM of their most recent transactions. This technique has the ultimate goal of identifying and targeting the most valuable customers for the purpose of conducting targeted and focused marketing campaigns. Each consumer is given a numerical score based on these parameters, making the analysis objective and data-driven. RFM analysis is rooted in the famous marketing axiom, the Pareto Principle or 80/20 rule where 80% of the results come from 20% of the causes. This tool has three components, namely recency, frequency, and monetary.

Recency (R) as days since last purchase: Subtract the most recent purchase date from today to calculate recency. For example, 3 days, 5 days, etc.

Frequency (F) as the total number of transactions: How many times did a customer purchase? For example, 7 if someone ordered 7 times in a given time period.

Monetary (M) as the total money spent by a customer in a given currency and time period.

RFM analysis can be a powerful tool for gaining insights into customer segmentation. It can help answer important questions such as:

Who are our best customers?
Which customers are at risk of churning?
Who has the potential to become more valuable?
Which customers can be effectively retained?
Who are lost customers that you don’t need to pay much attention to?
Which group of customers is most likely to respond to your current campaign?

Preparing The Dataset

In this article, we will analyze a dataset from Kaggle called “Northwind Traders”. The dataset provides various data files, such as orders, categories, customers, employees, order details, products, and shippers. However, to do RFM Analysis we only need orders and order details data. We will analyze it with RFM analysis and using a combination of Python and SQL. I use Jupiter Notebook from Google Colab to make the coding easier. You can access the dataset by clicking this link: click here.

query = "SELECT * FROM orders LIMIT 5;"
df= pd.read_sql(query, con=engine)
df

Orders data from Northwind Traders dataset

query = "SELECT * FROM order_details LIMIT 5;"
df= pd.read_sql(query, con=engine)
df

Order details data from Northwind Traders dataset

You can see all the complete code used in this article by clicking here.

Creating RFM Customer Segmentation

You can see all the complete code used in this article by clicking here.

Visualizing The Result of RFM Analysis

Visualization of the result with radar chart, scatter plot, heatmap, and pie chart.

You can see all the complete code used in this article by clicking here.

Conclusions

Based on the results, there are 3 dominant segments, namely loyal customers, champions, and potential loyalists. And also these three segments have a fairly good proportion as depicted in the radar chart. This can illustrate the results of implementing a good strategy in that time period. Be it marketing performance, discount strategy, competitive pricing, better service, or other strategies. However, to increase the loyalty of these three segments, the company can sell high-value products or provide offers for new products and ask for their reviews to increase participation. In addition, the company can also offer a membership system with various benefits that they can get.

Although the results are very good, the company is recommended to increase attention to other segments to avoid customer churn. For example, for the hibernating segment and also at risk which is very risky to stop shopping at the company’s store. For segments like this, companies need to utilize their marketing tools, such as email marketing to continue to offer products, both new products and products with special discounts to attract customers to shop.

Finally, to predict future sales based on the 11 segments, we can look at the results on the heatmap and also the scatter plot. Where the heatmap shows a fairly strong correlation between frequency and monetary, which is 0.94. This means that if one of them goes up or down, the others will be the same. That way we can use the scatter plot to confirm this relationship by looking further at its effects on customer segments. Where there are several outliers in the data that need to be watched out for in the champions and at risk segments, aka moving away from the main point collection. While for the loyal customers segment and others are relatively stable or have adjacent points.

RFM analysis can be a useful tool, especially in the field of customer relationship management (CRM). However, especially in this article, further analysis is still needed. For example, the use of K-Means Clustering and also more interesting visualizations. In addition, the dataset used, namely Northwind Traders, still allows for other analyses in the future, for example lead time analysis or other analyses that allow.

Any feedback is so appreciated :)

References:

https://www.putler.com/rfm-analysis/#Customer_segmentation_using_RFM_analysis
Lewaaelhamd, I. (2023). Customer Segmentation Using Machine Learning Model: An Application of RFM Analysis. Journal of Data Science and Intelligent Systems, 2(1), 29–36. https://doi.org/10.47852/bonviewJDSIS32021293
https://clevertap.com/blog/rfm-analysis/