Knowledge Discovery in Databases (KDD): A Practical Approach

4 min readSep 24, 2023

By Shawn Chumbar (Article generated using the help of ChatGPT)

KDD Process. Source: https://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html

Knowledge Discovery in Databases, commonly referred to as KDD, is a systematic approach to uncovering patterns, relationships, and actionable insights from vast datasets. It involves multiple steps, from selecting and preprocessing data to the actual process of data mining and finally to the interpretation and use of the results. In this article, we will walk through the KDD methodology using a real-world example based on daily household transactions.

I will be skipping a lot of code, so you can find the full version of the google colab notebook in the my GitHub Repo at this link.

What is KDD?

KDD stands for Knowledge Discovery in Databases. It is an iterative process in which evaluation metrics can be developed, mining improved, and new data integrated and transformed to produce useful information. The typical steps involved in the KDD process are:

Data Selection
Data Cleaning (Preprocessing)
Data Transformation
Data Mining
Interpretation (Internalization)

For our practical demonstration, we’ll use a dataset containing daily household transactions, which can be found here.

1) Data Selection

The first step in the KDD process is data selection. This involves gathering data from various sources to form a raw dataset. For this demonstration, we’re assuming that this step has been completed and we have our dataset ready.

# Load the dataset
data  = pd.read_csv("/path/to/daily_household_transactions.csv")
# Display the first few rows of the dataset
data.head()

Our dataset comprises columns such as:

Date: The transaction date
Mode: Payment mode (Cash, Bank Account, etc.)
Category: Transaction category (Transportation, Food, etc.)
Subcategory: More detailed classification
Note: Additional transaction-related notes
Amount: Transaction amount
Income/Expense: Specifies if the transaction was income or an expense
Currency: Transaction currency (e.g., INR)

Each row represents a transaction, detailing how much was spent, on what, when, and using which payment mode.

2) Data Cleaning (Preprocessing)

Data preprocessing is crucial for ensuring that our dataset is of good quality. This step involves handling inconsistencies, missing values, and noise.

# Check for missing values in the dataset
missing_values = data.isnull().sum()
missing_values

(Date                0
 Mode                0
 Category            0
 Subcategory       635
 Note              521
 Amount              0
 Income/Expense      0
 Currency            0
 dtype: int64,)

For this example, it seems that the Subcategory and Note columns are missing a lot of data. Further preprocessing might involve handling outliers, filling missing values, or encoding categorical variables.

For example, we can choose to fill the empty values with some sort of placeholder:

data['Subcategory'].fillna('Not Specified', inplace=True)
data['Note'].fillna('No Note', inplace=True)

3) Data Transformation

Once the data is cleaned, it often needs to be transformed to be suitable for data mining. This might involve scaling, encoding, or feature engineering.

from sklearn.preprocessing import StandardScaler, LabelEncoder

# Encoding categorical variables
label_encoders = {}
for column in ['Mode', 'Category', 'Subcategory', 'Income/Expense', 'Currency']:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

# Scaling numerical variable: 'Amount'
scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data[['Amount']])

# Display the preprocessed data
data.head()

Here, we’ve applied label encoding to transform categorical data into a format that’s more digestible for our machine learning algorithms.

4) Data Mining

Data mining is the stage where we delve deep into the data to identify patterns, relationships, or anomalies. One of the most common techniques in this phase is clustering, which helps categorize data into different groups based on similarity. For our dataset, we’ll use the K-means clustering algorithm, a popular choice for its simplicity and efficiency.

In the provided code, feature scaling was applied to the ‘Amount’ column to ensure that the algorithm doesn’t get biased by the magnitude of transactions:

from sklearn.cluster import KMeans

# Determine the optimal number of clusters using the Elbow Method
inertia = []
for i in range(1, 15):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(data.drop(columns=['Date', 'Note']))
    inertia.append(kmeans.inertia_)

# Plotting the Elbow Curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, 15), inertia, marker='o', linestyle='--')
plt.title('Elbow Curve to Determine Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.show()

Post scaling, the K-means algorithm was applied:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=0)
data['Cluster'] = kmeans.fit_predict(data[['Mode', 'Category', 'Subcategory', 'Amount', 'Income/Expense']])

Here, the data is segmented into four clusters based on the ‘Mode’, ‘Category’, ‘Subcategory’, ‘Amount’, and ‘Income/Expense’ features.

5) Interpretation (Internalization)

Having mined the data, the next logical step is interpreting the results. This involves understanding the patterns, behaviors, and characteristics of the clusters we’ve identified.

# Visualizing the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Amount', y='Mode', hue='Cluster', palette='viridis', s=60, alpha=0.7)
plt.title('Clusters based on Transaction Mode and Amount')
plt.legend(title='Cluster')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.show()

This visualization helps in understanding how transactions are grouped based on their amount and mode. Each cluster can represent a specific behavior, like frequent small cash transactions or occasional large bank transactions.

To further interpret the clusters, we can inspect the average values for each cluster:

cluster_avg = data.groupby('Cluster').mean()

By examining the cluster averages, we can deduce patterns such as:

Cluster 0 might represent routine daily expenses with a low average amount.
Cluster 1 could signify larger, infrequent expenses.
Cluster 2 might relate to income transactions, and so on.

Understanding these patterns can be invaluable for tasks like budget planning, expense tracking, or financial forecasting.

Conclusion

In conclusion, the KDD process provides a systematic approach to derive knowledge from data. Through our example with daily household transactions, we showcased how each step of the process is essential in moving from raw data to actionable insights. Whether you’re working with transactional data like in our example or any other type of data, the KDD methodology offers a structured way to extract value from your datasets.