Project Management in Data Science using KDD
Preface
Knowledge Discovery in Databases (KDD) is a classical data science life cycle where data is gathered from one or multiple sources and refined methodically. It aspires to cleanse the ‘noise’ (useless, digressing outliers) while specifying a phased approach to derive patterns and trends that provide essential knowledge.
Data Mining or KDD Process?
In general, Data Mining and KDD are both used interchangeably. However, there is a minute difference between them. KDD is the overall process of extracting insights initiating from data gathering to data cleaning to data analysis, while data mining is an integral part of the KDD process.
Data mining is considered a core part of applying algorithms (modeling) to extract patterns from the data. In brief, the KDD process represents the complete process, and Data Mining is a step in that process.
Requirement of KDD Process?
Our world is becoming data-driven, and we are generating data in abundance. However, data will become beneficial only if it is possible to conduct filtration and make it processable to extrapolate to its actual value. Many people in an organization don’t even consider performing any filtering operations, which doesn’t add much value to the organization’s goals.
Moreover, it has become extra challenging for professionals to perform filtering on massive datasets because of data capturing rates and system limitations. The capturing rates of data have improved in the past few years and increasing every day. In addition, there’s a requirement for hardware acceleration to process and filter massive datasets. This result has urged the need for economic and scientific methods to scale up our analytical capabilities to handle big datasets.
KDD Process Steps
KDD process is a flow of distinct steps or phases, from one to another in the discovery of interesting patterns and knowledge. Generally, these steps may vary (5 to 7) as per the perspective of different sources. I will direct the flow of seven (7) phases and explain them to you as follows:
- Data Acquisition & Cleaning: We start with procuring data from databases and then apply cleansing operations (handling of noise and inconsistencies) over the procurement. Once the operations are successful, modified data get shifted to Data Warehouse.
- Data Integration: In this step, we try to integrate data from one or more sources if required. You may need to apply the same number of operations as it was performed in the previous step if the data is similar. However, if the type of data is different (image or sound-based), you may need to come up with a plan to integrate it with the existing data. Once again, the data get shifted to Data Warehouse after integration from different sources.
- Data Selection: In this step, the data relevant to the analysis and developing models get filtered out from the database. There may be hundreds of features in the dataset, but not all features are essential for modeling the problem. That’s why it is crucial to select only those features which will impact the outcome. It requires the knowledge of scientific methods and the domain. Once the essential data features are determined, it gets distributed among several team members to analyze them.
- Data Transformation: This step revolves around transforming and consolidating the data into the required format appropriate for mining knowledge. You may need to apply summary or aggregation operations to achieve the objective. For example, you may have categorical data inside your dataset, and machines cannot understand alphabets or alphanumerics. To make it machine-understandable, you need to encode your categories into numbers.
- Data Mining: It is an essential step where you apply intelligent methods to extract data patterns. To solve the business problem and get insights from the data, you must know several algorithms. These algorithms may or may not give a solution to every obstacle, and you may need to alter the normal execution or try a different approach if required.
- Pattern Evaluation: In this step, we try to measure the actual intriguing patterns from the data using some measures. These measures may vary from problem to problem and the implementation specifics. For example, if the problem is about classification, you may want to use accuracy, Precision, or Recall. If the problem is about regression, you may use R-squared or RMSE. Once you have good results (patterns), you can study your data again and optimize it if required.
- Knowledge Presentation: This is the final step of the process where visualization and data presentation knowledge is required. It is also essential to have good storytelling skills when presenting to the audience or stakeholders. Developing these skills requires time and effort, which may not be possible for everybody.
Pros/Cons of the KDD Process
Although KDD is a lovely tool in planning and executing data science projects, there are some pros and cons to this life cycle which are as follows:
Pros
- KDD helps to identify and predict consumer trends. In addition, it also focuses on predicting other types of products consumers might be willing to use. It helps businesses attain a competitive advantage over others in their field.
- KDD is an iterative procedure where knowledge acquired gets transmitted back into the process (to the start of the cycle), enhancing the efficacy of established objectives.
- KDD helps identify anomalies efficiently because the entire process segregates working into different steps. If we find an issue or vagueness at any stage, we can trace back and verify the actions and proceed accordingly.
Cons
- This process fails to address many issues that modern realities of data science projects face, such as data ethics, the chosen data architecture, roles of several teams, and their associated members.
- The process is time-consuming because it has to cycle back from the initial phase of the process to refine the established objectives.
- Data security is another aspect of this process. It’s a fact that businesses mainly look for ways to understand their customers better than possible. It means they look for more data and securing it is undoubtedly essential. KDD works with data but fails to assure its security.
- If the business objectives are not clear, the process will fail miserably. That’s why it is essential to define the problem and the objectives clearly at the start of the project.
- The completion time of a data science project using KDD is sometimes unclear because of some undeniable tasks.
Alternative to KDD Process
There are many other frameworks that you can use as an alternative to the KDD process. They also help gain knowledge from raw data and circulate over the entire process giving back refined results if required. These frameworks are:
- SEMMA (stands for Sample, Explore, Modify, Model, and Access)
- CRISP-DM (stands for CRoss Industry Standard Process in Data Mining)
All these alternatives are almost similar with the same objective of solving business-oriented problems and gaining knowledge.
& That’s it. I hope you liked this traditional data science framework and learned something valuable.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their journey in Computer Science, Data Science and AI. If you are one of them and looking for a way to counterbalance these cons, then Follow me and Subscribe for more forthcoming articles related to Python, Computer Science, Data Science, Machine Learning, and Artificial Intelligence.
If you find this read helpful, then hit the Clap👏. Your encouragement will catalyze inspiration to keep me going and develop more cool stuff like this.