Knowledge Discovery and Data Mining: Towards a Unifying Framework


We define KDD (Fayyad, Piatetsky-Shapiro, & Smyth 1996) as

Knowledge Discovery in Databases is the nontrivial process of ” identifying valid, novel, potentially useful, and ultimately understandable patterns in data.


Here data is a set of facts (e.g., cases in a database) and pattern is an expression in some language describing a subset of the data or a model applicable to that subset. Hence, in our usage here, extracting a pattern also designates fitting a model to data, finding structure from data, or in general any high-level description of a set of data. The term process implies that KDD is comprised of many steps, which involve data preparation, search for patterns, knowledge evaluation, and refinement, all repeated in multiple iterations. By non-trivial we mean that some search or inference is involved, i.e. it is not a straightforward computation of predefined quantities like computing the average value of a set of numbers. The discovered patterns should be valid on new data with some degree of certainty.

We also want patterns to be novel (at least to the system, and preferably to the user) and potentially useful, i.e., lead to some benefit to the user/task. Finally, the patterns should be understandable, if not immediately then after some post-processing. The above implies that we can define quantitative measures for evaluating extracted patterns. In many cases, it is possible to define measures of certainty (e.g., estimated prediction accuracy on new data) or utility (e.g. gain, perhaps in dollars saved due to better predictions or speed-up in response time of a system).

Like what you read? Give Sergej Lugovic a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.