Data Science Methodology — How to design your data science project

Ashish Patel
ML Research Lab
Published in
7 min readAug 8, 2019

Data Science Methodology Series…!!!

Source : easy projects

Data Science methodology is one the most important subject to know about any data scientist, I have stuck so many times when I was thinking about this problem and always though, like mad man how can data science cycle run and big company’s design methodology for data science. my search is completed when I reached out this one of the amazing course of this on Coursera. This is one the best methodology convert your data science, business problem to data science solution. You can learn a whole project cycle here. I write all my learning from this course. After reading this you will know about how to convert business problem to Data Science base Solutions.

Outline of this article

  1. Business understanding
  2. Analytic approach
  3. Data requirements
  4. Data collection
  5. Data understanding
  6. Data preparation
  7. Model Training
  8. Model Evaluation
  9. Deployment
  10. Feedback

Toward Data Science methodology

Welcome to Data Science Methodology 101! This is the beginning of a story that you will tell others in the years to come. It will not be as you experience it here, but through the stories you share with others as you explain how your understanding of a question led to an answer that changed the way in which something was done. Despite the increased computing power and access to data in recent decades, our ability to use data in the decision-making process is lost or not maximized too often. We do not have a solid understanding of questions that are asked and how the data is correctly applied to the problem in question.

That why methodology come into the picture to design any problem.

Source : Business Dictionary

Here is a definition of the word methodology. It is important to think about it, because the temptation is often great to circumvent the methodology and go directly to the solutions. However, this prevents our best intentions from trying to solve a problem.

# Data Science Methodology and Question

Source : Coursera.org

The Data science methodology aims to answer 10 basic questions in a given order. As you can see on above image,

  1. Two questions define the problem and determine the approach to use.
  2. Four questions, you can ask the organization for the data you need.
  3. Final questions to review the data and how you do it based on four additional questions.

Take a moment to familiarize yourself with the ten questions that are critical to your success.

This article Series contain 5 modules :

  1. From Problem to Approach
  2. From Requirement to Collection
  3. From Understanding to Preparation
  4. From Modelling to Evaluation
  5. From Deployment to Feedback

Now We are Focusing this Article :

#1) Business understanding

Understand the business
  • What is problem you trying to solve?

Every project, whatever its size, begins with the understanding of the business that forms the basis of an effective solution to the business problem. Business partners who need the analytics solution play a critical role in this phase by defining the problem, the project objectives, and the solution requirements from a business perspective. This is first step for any data science methodology.

#2) Analytic approach

  • How can you use the data to answer the question?

Once a business problem has been clearly identified, the Data Scientist can define the analytical approach. To do this, the problem must be expressed in the context of statistical learning and machine learning techniques so that the Data Scientist can identify the techniques to achieve the desired result.

#3) Data requirements

  • What data do you need to answer the question?

Analytic approach determines the data requirements because the methods of analysis to be used require specific content, formats, and data representations, based on domain knowledge.

#4) Data collection

  • Where is the data coming from (identify all sources) and how will you get it?

The Data Scientist identifies and collects data resources (structured, unstructured and semi-structured) that are relevant to the problem area. If the data scientist finds gaps in the data collection, he may need to review the data requirements and collect more data.

#5) Data understanding

  • Is the data that you collected representative of the problem to be solved?

Descriptive statistics and visualization techniques can help a data scientist understand the content of the data, assess its quality, and obtain initial information about the data. A recovery from the previous step, data collection, may be necessary to fill the gaps in understanding.

#6) Data preparation

  • What additional work is required to manipulate and work with the data?

The Data preparation step includes all the activities used to create the data set used during the modeling phase. This includes cleansing data, combining data from multiple sources, and transforming data into more useful variables. In addition, feature engineering and text analysis can be used to derive new structured variables to enrich all predictors and improve model accuracy.
The Data preparation phase is the longest. Although I have seen that it represents 90% of the total duration of the project, this figure is usually 70%. However, it can go down as much as 50% if the data resources are well managed, well integrated, and analytically clean, not just storage. Automating some phases of Data preparation can further reduce the percentage: Telecommunications marketing team members once told me that this team has cut the average time it takes to create and implement promotions from three months to three weeks.

#7) Model Training

  • In What way can the data be visualized to get the answer that is required?

From the first version of the prepared data set, Data scientists use a Training data set(historical data in which the desired result is known) to develop predictive or descriptive models using the described analytical approach previously. The modeling process is very iterative. It may be vary with different situation as per problem.

#8) Model Evaluation

  • Does the model used really answer the initial question or does it need to be adjusted?

The Data Scientist evaluates the quality of the model and verifies that the business problem is handled in a complete and adequate manner. To do this, several diagnostic measures and other results, such as tables and graphs, must be calculated using a set of predictive model tests.

#9) Deployment

  • Can you put the model into practice?

Once a satisfactory model has been developed and approved by commercial sponsors, it will be implemented in the production environment or in a comparable test environment. Such deployment is often initially limited to allow for performance evaluation. Implementing a model in an operational business process generally involves multiple groups, capabilities, and technologies.

#10) Feedback

  • Can you get constructive feedback into answering the question?

By collecting the results of the implemented model, the organization receives feedback on the performance of the model and its impact on the implementation environment. By analyzing this information, the data scientist can refine the model, increasing its accuracy and, therefore, its utility.

This phase, often neglected, can have significant additional benefits when carried out as part of the overall process. The flow of this methodology illustrates the iterative nature of the problem-solving process.

I hope you will get the basic understanding of process cycle. How to think on each and every stage that help to direct toward your successful methodology for your Data science project.

Thanks for reading…!!! This is continue series articles stay tune for more module series…!!!

References :

  1. https://www.coursera.org/learn/data-science-methodology
  2. element61.be/en/competence/data-science-methodology

--

--

Ashish Patel
ML Research Lab

LLM Expert | Data Scientist | Kaggle Kernel Master | Deep learning Researcher