3 Most Popular Data Science Methodologies

Andres Ramirez
10 min readSep 5, 2021

--

Data Science is a complex process in which the projects involve a different variety of stakeholders, data sources and goals. In an effort to maintain order, there have been various outlines and methodologies that have been created in order to help upcoming and experienced Data Scientists alike with ways to organize and structure their work. I will present three of the most popular methods used and provide a description for each, which I hope will help with your future projects!

Cross-Industry Standard Process for Data Mining (CRISP-DM)

[1] CRISP-DM image by IBM

The first one I will cover is the Cross-Industry Standard Process for Data Mining, also known as CRISP-DM. It is considered to be one of the most popular methods used to date. The methodology is split into 6 sections, and is iterative which means that the steps can be repeated as often as necessary until you reach your goals!

Business Understanding

In this section, an individual or a group will begin to gather the facts and requirements needed in order to begin this process. This will include answering questions like “How will our stakeholders be using the model we develop?” and “How will the model we develop help our stakeholders reach their target goal?” It is important at this point that both the Data Scientist and the stakeholders are on the same page, in order to prevent a model that does not achieve its intended goal, and also so that everyone understands and has the same expectations in regards to the outcome of the model being built. Put simply, it is important that everyone involved understands what the project is and is not.

Data Understanding

Once we have a firm understanding of the objectives of our project ahead of us, it is only natural that we must have the same understanding of the data that we have available to us. It is important to know not only the information contained in our data but also where our data is coming from. For example, if the data that is obtained from scraping a website, it is important to consider the reliability of the data. It is also important to consider the distribution of our data, the amount of data available to us, the predictors within the data, and how accessible our data is. This prevents us from skewing away from the objective of our project and makes sure that we have the right parameters and enough data to achieve our goal.

Data Preparation

At this stage, we have enough of an understanding of our data that we can begin the process of preparing the data for modeling. Some of the issues that we will resolve in this step include: dealing with missing values, converting and verifying data types, assessing for collinearity, normalizing numeric data, and converting categorical data to numerical format. The previous step is important in informing these decisions, since not understanding our data and objectives can negatively influence how our data is prepared. Luckily, there are many articles available on data preparation if you feel stuck on how to begin or how to manipulate your data to fit your goals.

Modeling

Once our data has been sufficiently cleaned and prepared we are now at the step where we can begin to create models with our data. At this point it is important to consider the type of model we are hoping to build, guided by our understanding of our objectives and the data at hand. It is helpful to begin to ask ourselves the type of model we are going to attempt to build. Are we trying to classify images? Or are we trying to build a recommendation system? Maybe we are trying to build a predictive model that requires some regression work. With these questions it is also important to consider how we are going to deal with issues such as under and overfitting. On top of all this, how do we plan to validate our results, for example, what will our train-test-split look like? Do we need a validation set for our data. Arguably one of the most important questions will be what we will consider to be the threshold of performance that we will feel comfortable with saying is indicative of success. This answer will definitely vary and will be influenced by our dataset, the models used and the objectives at hand. It can be challenging to find a balance between these factors and therefore just as the first step mentioned, it is important that everyone involved is on the same page as to what the expectations are in regards to a successful model.

Evaluating

During this phase we take a step back, and evaluate the models we have made! The evaluation stage is important because it not only helps us evaluate the current progress of our model and the success of our model but also provides us the opportunity to see if new insights can be derived from our work. For example, our model may highlight certain outcomes that we may not have expected or point to an extraneous variable that is convoluting our results. As the image above shows, CRISP-DM is an iterative process, so we may need to return to the drawing board and alter our data and models to reflect the new information that we have obtained! We may have to reassess our data, our model structure, validation systems in place or even our entire project in order to align again with the objectives in our project. This is definitely a step where communication between Data Scientists and stakeholders is key to make sure everyone is on the same page!

Deployment

Congrats! You have reached the point where the results of your model are satisfactory and you are ready to move on to deployment. At this point the goal is to work with the stakeholders to ascertain how to put the model into progress. This includes automating features so that they can take in raw data and make it usable for modeling. This will support the insights that have been derived from the model that was created. The reward is seeing your hard work being used live!

Knowledge Discovery in Databases

The next methodology we will cover is Knowledge Discovery in Databases (KDD). This process was created by Gregory Piatetsky-Shapiro, who runs the popular blog site KDnuggets. The KDD process is similar to CRISP-DM, and a wonderful benefit to this diagram is that it contains images of our expected outcomes at the conclusion of each stage. Below, I will provide a brief synopsis of what to expect at each stage.

Selection

In the selection stage, just like CRISP-DM, you begin your work on developing your business understanding of the project at hand. It is important to take some time to review past projects, research, and literature that resembles your project. This can help gain insight on how other companies, stakeholders or research groups have tackled similar projects and what may or may have not worked in their models.

Also similar to CRISP-DM, you will work to increase your understanding of not only the objectives of your project but also the intricacies of your data. You will be asking yourself questions regarding where your data is coming from, validity of your data, and its relevance to your goals.

Preprocessing

In this stage, we are working on cleaning our data by implementing the necessary modifications. This stage is more of a “basic” approach to cleaning in which you focus on issues such as missing data, outliers, type and irrelevant data. While the data will begin to take form, we will do the rest of the work needed for the data to be model ready in the following stage.

Transformation

The transformation stage is the stage in which we make those final modifications to our data needed for it to be ready for modeling. This is where we really work on our data, feature engineering, checking for multicollinearity and normality of our data. It is also where we make sure our data is in the format we need it to be such as converting data to numeric format, string etc.

Data Mining

This is where we use our processed data and begin using different modeling techniques in order to achieve the objectives in our project. Data Mining still refers to the Data Science process, even if it is not used as frequently as it may have been in the past. This stage encompasses the insights that are gathered from using algorithms to create actionable results from the databases we are provided. We strive to gain some useful information that we can then share with our team and make available to stakeholders.

Interpretation and Evaluation

In the last stage of KDD, we use the insights and information that we have gathered from our modeling stage and make predictions to help answer the objectives in our projects. From the predictions that we are able to make, and present it in a way that our work is accessible to our stakeholders. This may include doing the extra work to make sure that we are expressing our methods and results are effectively communicated to a non-technical audience. Also similar to CRISP-DM, the KDD process is also iterative, so if there are issues that we did not account for arise, new insights appear or the objectives of our project change in light of our findings we are able to repeat the process again until we are able to deliver a successful result.

OSEMN

*adapted from KDNuggets

The final methodology I will be covering is the OSEMN model, also known as “MSEMiN” as well. The beauty in the OSEMN model is its simplicity, and that lends to ease in which you can jump between sections. If there are barriers that arise during your modeling work you can easily jump back to the scrubbing section and rework your data to fit your project objectives.

Obtain

In the Obtain section, similar to the Selection stage in KDD and the Business/Data Understanding section in CRISP-DM, the focus is to increase our understanding of the task at hand by being on the same page with our team including stakeholders. The goal is to obtain data either from our stakeholders or scraped from other sources that will hopefully assist in obtaining the answers that we are looking for.

Scrub

Scrubbing is just as the name entails, the focus is to make our data nice and clean for it to be ready for modeling. We will focus on dealing with our data’s outliers, normalization and engineering the features that we need. We will prepare our data so that we can further analyze it in the Exploration stage. It may be possible that through exploration we determine that there may be additional scrubbing that we may need to do to our data before moving on to modeling.

Explore

This stage is similar to the Data Understanding stage of CRISP-DM in which we are working towards increasing our knowledge about the information contained in our data. This stage is highlighted by the visuals we do in order to get a better understanding of our data. From histograms, heatmaps and point plots, we are using these and a variety of other visualization methods to get a better understanding of our data. These methods will help us obtain a better understanding on the distribution of our data, checking normality and collinearity of our data. Depending on the project at hand, it is important to make sure that assumptions are being met and that our data is balanced to reduce error in modeling.

Model

This stage is pretty self-explanatory. We take the data that we have worked so hard to scrub and prepare for modeling and put it into action! In modeling, we define what has been agreed upon as being the marker of success for our project. Along with this, the ML algorithms that we will work with are chosen, and focusing on those that are giving us favorable results. Keep in mind that even in this stage we may realize that we have left out a pertinent part of our data or that we may need to restructure our data to produce better results, therefore it is recommended to review the previous two stages with this insight and retrain our models based on new insights.

Interpret

In the final step of OSEMN, we will take the results that we were able to obtain from our project and share them in an accessible way to our stakeholders. Similar to the previous methodologies, we want to make sure that as we communicate our results to our stakeholders, we frame them by taking care to make sure that those coming from a non-technical standpoint can not only understand the results but also are able to understand why our models did or did not work. This can greatly aid in the discussion of the projects and also can bring to light other items that can potentially improve our results. Good communication can also help us explain what may have been lacking in our project that led to the results obtained and facilitate our ability to advocate for any additional items we may need for current and future projects. If the team and stakeholders are all satisfied with the results however, then we can move forward to model production and automation!

In conclusion, we reviewed three of the most used methodologies to help outline the Data Science process. It is important to remember that these are guidelines and are in no way set in stone, leaving the flexibility needed to go back to other steps are often as we need in order to obtain favorable results! It would be helpful to explore these and other methodologies to see which one you might enjoy best or whether different approaches will be needed depending on the project at hand. They do a great job at providing checkpoints as we move through our current and future projects.

[1] IBM CRISP-DM Help Overview

https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=dm-crisp-help-overview

--

--

Andres Ramirez

Data Science Student changing careers from the field of Psychology. I hope to use the skills I learned through the years for in Data Science