Methodical Questions to Solve Data Science Problems | Towards AI
The Data Science Methodology
Understanding the Data Science Life-cycle…
The relevance of a Doc-string to a function is to guide future users (including future-self), by specifying the right parameters and use cases of the function… So is The Data Science Methodology to data scientists.
The Data Science Methodology is an iterative system of methods that guides data scientists on the ideal approach to solving problems with data science, through a prescribed sequence of steps.
Why Data Science Methodology?
In a nutshell, the Data Science Methodology aims to answer 10 basic questions in a prescribed sequence, that cover the five main aspects of data science projects. These aspects are:
1. From Problem to Approach
2. From Requirements to Collection
3. From Understanding to Preparation
4. From Modelling to Evaluation
5. From Deployment to Feedback
The Data Science Methodology cited in this article, was developed by John Rollins, a seasoned and Senior Data Scientist at IBM, who developed this methodology based on The CRISP-DM process, and his experiences as a data scientist for over two decades whilst at IBM… link.
Let’s see a high-level description of each step in the Data Science Methodology, and the Ten(10) fundamental questions every data scientist should ask.
From Problem to Approach:
1. What is the problem you are trying to solve?
For example, if a Business owner asks ‘How can we reduce the cost of performing an activity?’… The Data Scientist needs to understand if the goal is to improve the efficiency of the activity or to increase business profitability.
Asking the right questions as a Data Scientist starts with understanding the goal of the business owner in this case.
The right questions will inform the ideal analytical approach for solving the problem.
2. How can you use data to answer the question?
Selecting the right analytical approach depends on the question being asked. This entails having a clear business understanding. The analytical options may include:-
From Requirements to Collection:
3. What data do you need to answer the question?
If the problem that needs to be resolved from steps 1 and 2 is The ‘Recipe’ and data is the ‘ingredient’, Then the data scientist needs to know which ingredients are required, how to source and collect them, and how to prepare the data to meet the desired outcome.
The task here includes:-
Identifying the necessary data contents, format and sources for initial data collection
4. Where is the data coming from (Identify all sources) and how to get it?
In this stage, the data requirements are revised and decisions are made as to whether or not the collection requires more data. Once data ingredients are collected, The data scientist would have a good understanding of what they’d be working with.
Collecting data requires that you to know where you can find the data elements. These could be from existing public data repositories (check date stamps as old data abound!), web-scraping, or if your project involves Geo-location data, API calls to retrieve such real-time data can be done on a portal like foursquare.com.
The data requirements and data collection stages are extremely important because the more relevant data you collect, the better your model.
From Understanding to Preparation:
5. Is the Data that you collected representative of the problem to be solved?
In other to understand the data, we use descriptive statistics on the variables or columns. These statistics may include univariates, mean, median, mode, minimum, maximum and standard deviation. The pandas.describe() function provides a good descriptive statistics summary.
A firmographic data understanding is also relevant at this stage as well as pairwise correlation to see how closely related variables are, dropping variables that may be highly correlated, hence redundant, leaving only one of such for modeling.
Visualization libraries such as Matplotlib and seaborn could be used to gain better insights about the data. Missing values evaluation is also done now.
6. What additional work is required to manipulate and work with the data?
In a sense, data preparation is similar to washing freshly picked veggies in so far as unwanted elements are removed… Together with data collection and understanding, data preparation is the most time-consuming aspect of data science projects taking up to 70% or even 90% of the overall project time.
Transforming data in this stage is a process of getting the data into a state where it may be easier to work with. Data cleansing involves addressing:-
- Missing Data
- Invalid Values
- Remove Duplicates
- Feature Engineering
It is imperative to get this phase right, otherwise, you risk going back to the drawing board if this phase is haphazardly done.
From Modelling to Evaluation:
7. In what way can the data be visualized to get to the answer that is required?
Modeling is geared towards answering two key questions:-
A. What is the purpose of data modeling?
B. What are the characteristics of the process?
Modeling focuses on developing models that are either descriptive or predictive
For example, a descriptive model can tell what new service a customer may prefer based on the customer’s existing preferences… Some examples of such algorithms are recommender systems and clustering algorithms.
While predictive modeling can tell a future value or class based on present data, some examples are classification and linear or logistic regression algorithms.
The choice of model is based on the analytical approach chosen in step 2 for the problem stated in step 1.
8. Does the model used really answer the initial question or does it need to be adjusted?
The model evaluation goes hand in hand with the model building. As such model creation and evaluation are done iteratively.
Model evaluation is performed during model development and before the model is deployed. Evaluation allows the quality of the model to be assessed and it’s also a way to see if it meets the initial request.
A model evaluation has two main phases:
The Diagnostic Measures phase
The Statistical Significance phase.
The former is concerned with the actual performance of the model, given a test data set, while the latter is concerned about how True or Confident is the model prediction or description.
From Deployment to Feedback:
9. Can you put the model into practice?
While the data science model may present a solution, the key to making that solution relevant and useful to solve the initial problem is to get the relevant stakeholders acquainted with the tool produced.
This stage requires effective communication skills for onboarding.
The model may be deployed to a limited number of stakeholders initially or to a test environment to build up confidence in applying it for use across the board.
The model must be relatively intuitive to use, and staff members who may be responsible to apply the model to solving similar problems must be trained. It is important to document teething problems that may arise at this stage.
10. Can you get constructive feedback into answering the question?
Once deployed, feedback from the users will be used to refine the model and assess it for performance and impact. This will continue for as long as the solution is required.
The feedback process is based on the notion that ‘The more you know, The more you’d want to know’ … And this involves gathering new data from the field to further develop the model.
Remember, The Data Science Methodology is an Iterative process that follows a Prescribed Sequence. Thus it provides a structure.
Iterative means it’s a continuous cycle… The model gets trained, evaluated and deployed. The client provides feedback to the data scientist, who collects new data, processes and further updates the model to perform better.
Prescribed Sequence means, step-by-step.
Note: It is possible to have multiple steps co-occurring, like data requirements and collection steps together, but the good thing is the Structure of The Data Science Methodology points us to the most effective exertion of our time and resources as we solve problems with data science.
By answering Ten(10) simple questions Methodically, we’ve seen that a methodology can help us to solve not only data science problems but also any other problems.