Part-2 Data Science Methodology From Requirement to Collection

Ashish Patel
ML Research Lab
Published in
6 min readAug 10, 2019

From Requirement to Collection…!!!

Source : Coursera.org

Data Science methodology I have described basic with the all important question like which question you have to ask on which stage if you haven’t read that article and already read here I have explained another module wise process of the same course. this is second article of this series if you haven’t read that first article just go through below links

Article Series :

  1. Overview of Data Science Methodology
  2. Part-1 Data Science Methodology- From Problem to Approach
  3. Part-2 Data Science Methodology From Requirement to Collection
  4. Part-3 Data Science Methodology From Understanding to Preparation
  5. Part-4 Data Science Methodology From Modelling to Evaluation
  6. Part-5 Data Science Methodology From Deployment to Feedback

#1) Data Requirement

Imagine that, If your goal is to prepare a spaghetti dinner, but you don’t have the right ingredients for this dish, your success will be affected.

Think of this section of data science methodology as cooking with data. Each step is essential for the preparation of the meal.

So, if the problem to be solved is, so to speak, the recipe and the data are an ingredient, the data scientist must identify the necessary ingredients, how to obtain or collect them, how to understand or use them, and how to obtain them. data. Ready to achieve the desired result.

  • Based on the understanding of the problem and the analytic approach chosen, the data scientist is ready to begin. Let’s look at some examples of the data needs of the data science methodology. Before the methodology data collection and processing steps are performed, it is important to define the data requirements for the classification of the decision tree.
  • This involves identifying the content, formats, and data sources needed for the initial data collection. Now consider the case study on the application of the “data requirements”.

Case Study:

In the case study, the first task was to define the data required for the classification approach of the selected decision tree. This involved selecting a suitable cohort of patients from the members of the health insurance companies.

In order to compile the complete medical records, three criteria were identified that should be included in the cohort.

  • First, a patient had to be hospitalized in the service area of ​​the provider to gain access to the required information.
  • Second, for one year, they focused on patients with a primary diagnosis of heart failure.
  • Third, a patient must have had a continuous record of at least six months prior to initial heart failure for a complete medical history.

Patients with congestive heart failure who have been diagnosed with other serious conditions have been excluded from the cohort, as this may result in above-average rates of re-entry and may therefore distort results.

  • Then he defined the content, format and representations of the data needed to classify the decision tree.
  • This modeling technique requires one registration per patient, with columns representing the variables of the model. To model readmission results, data covering all aspects of the patient’s medical history should be available.
  • This content includes authorizations, primary, secondary and tertiary diagnoses, procedures, prescriptions and other services provided during hospitalization or visits by patients / doctors.

In this way, a given patient can have thousands of records that represent all their attributes. To obtain a record by patient format, the data analysis specialists collected the transaction records from patient records and created a set of new variables to represent that information. It was a task for the data preparation phase, so it is important to anticipate the next phases.

#2) Data Collection

  • Once The data collection is completed, the Data Scientist performs a score to determine if he has the required resources. As with the purchase of ingredients for making a meal, some ingredients may be out of season and more difficult to obtain or cost more than originally planned.
  • At this stage, the data requirements are reviewed and a decision is made as to whether more or less data is required for the collection.
  • Once the data components have been collected, the data scientist will have a good understanding of what data he will be working on during the data collection phase.
  • Techniques such as descriptive statistics and visualization can be applied to the dataset to evaluate the content, quality and information of the original data. The gaps in the data are identified and plans for filling or replacement must be made.

Essentially, the ingredients are now sitting on the cutting board. Now let’s look at some examples of the data collection phase in the data science methodology. This step is performed as a result of the data request step. Let us now consider the case study on the application of “data collection”. To capture data, you must know the source or know where the required data items are located.

Case Study :

  • In our case study, this information may include demographic, clinical and patient care information, provider information, claims records, as well as pharmaceutical and other information related to all heart failure diagnoses.
  • This case study also required specific information on drugs, but this data source was not yet integrated with the rest of the data sources.
  • This brings us to an important point: it is correct to postpone decisions about unavailable data and to try to capture them later.
  • For example, this can happen even after obtaining intermediate results from predictive modeling. If these results indicate that drug information may be important for a good model, you will spend time trying to get it.

However, it turned out that they could build a reasonably good model without this information about drugs.

  • Database administrators and programmers often work together to extract data from different sources and then combine them.
  • In this way, the redundant data can be deleted and made available to the next level of methodology, namely the understanding of the data.
  • At this stage, scientists and analysis team members can discuss ways to better manage their data by automating certain database processes to facilitate data collection.

Thanks for Reading…!!! Happy Learning…!!!

References :

  1. https://www.coursera.org/learn/data-science-methodology

--

--

Ashish Patel
ML Research Lab

LLM Expert | Data Scientist | Kaggle Kernel Master | Deep learning Researcher