Knowledge Discovery in Databases (KDD) and Data Mining

From data to knowledge: an overview of the KDD process

Luigi Rossetti
4 min readNov 5, 2022
Data mining
Photo by NASA on Unsplash

Knowledge Discovery in Databases (KDD) refers to the entire process of discovering new knowledge from data. The term was coined in 1989 in a workshop by Shapiro to underline that knowledge, starting from data, is an end to end process with many steps executed.

What is KDD? According to Fayyad, Piatetsky-Shapiro and Smyth [1]:

Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

Let me clarify the formal definition:

  • Non-trivial process: the term process means that KDD is made up of many steps which involve data preparation, search for patterns, knowledge evaluation and refinement all repeated in multiple iterations with feedbacks and corrections. The term non-trivial, means that is not a computation of predefined quantities.
  • Valid, Novel, Potentially useful: means that this process doesn’t simply proceed with blind mining but is made up of many steps in order to stay safe and minimize the risk of producing meaningless and illusory patterns (what statisticians call “risk of data dredging”).
  • Understandable Patterns: imply that we can define quantitative measures for evaluating extracted patterns.
  • Data: data here stand for a set of big data in a database.

KDD versus Data Mining

All I am concerned to point out is that there is a clear distinction between the KDD process and the data mining step. The first, refers to the whole process of knowledge in which data mining is only a single particular step. The second, is the application of specific algorithms for extracting patterns from data.

KDD and data mining
Image by Author — KDD vs Data Mining

KDD Workflow

As I told before the main workflow includes 9 steps [1] :

  1. Developing an understanding of the application domain: is the initial preparatory step for understanding what should be done with many decisions and it also includes prior knowledge.
  2. Selecting and creating a data set: this includes finding out what data is available and select a subset on which discovery will be performed, according to the goals of the analysis.
  3. Pre-processing and cleaning: in this stage data reliability is enhanced, it includes data cleaning such as handling missing values and removal of noise or outliers.
  4. Data transformation: in this stage the generation of better data for the data mining is prepared and developed. Methods here include dimension reduction and attribute transformations.
  5. Choosing the appropriate Data Mining task: we are ready to decide on which type of data mining to use, for example, classification, regression or clustering. This mostly depends on the KDD goals: descriptive or predictive.
  6. Choosing the data mining algorithm: this stage includes selecting the specific method and so algorithm to be used for searching patterns in the data.
  7. Employing the data mining algorithm: finally, the implementation of the data mining algorithm is reached, and algorithms are applied in order to extract data patterns.
  8. Evaluation of mined patterns: in this stage we evaluate and interpret the mined patterns with respect to the goals defined in the first step and there’s the possibility to return to any of the previous steps.
  9. Using the discovered knowledge: we are now ready to incorporate the knowledge into another system for further action. The knowledge becomes active in the sense that we may make changes to the system and measure the effects of them.
KDD workflow with the main steps
Image by Author — KDD Workflow with the main steps

Conclusions

The entire KDD is an interactive and iterative process that finds, extracts and interprets patterns from data. It involves the repeated application of specific data mining methods and algorithms and the interpretation of the patterns generated by these algorithms. I hope you like my brief story let me know if you want more about.

References:

[1] U. Fayyad Pesah, G. Shapiro, P. Smyth, Knowledge Discovery and Data Mining: Towards a Unifying Framework (1996)

My work is strongly based on this academic paper, a masterpiece of data science literature.

Thanks for reading ! Only if you want, follow me for more.

--

--