Data Science: Data understanding, gathering and exploration

Sourabh Potnis
3 min readMar 8, 2022

--

Photo by Lukas Blazek on Unsplash

In the previous post, we discussed Business understanding and defining a problem. In this post we will discuss data gathering, understanding and exploration.

Once business and ML problem is defined, next step is to identify and gather data, understand schema and meaning and then explore it.

Identify data sources —

First step is to identify different data sources that are relevant and useful for solving the problem. These data sources can be internal (at business unit or org level) or external (free or paid). Business end users/analysts will help you to identify some of the sources. Some data sources especially external ones need to be researched. You should consider the cost and time factors for each dataset i.e. is the data free or needs to be purchased and how much time it will take to have the data ready for modelling.

Feasibility and accessibility of data —

Understand how feasible it is to get access to the data. Based on the compliance and regulations at the org / industry / government levels, you should check if you can use the identified data as is or some data modification needed such as:

  • Data masking/encryption
  • Data quality improvement
  • PII(Personal Identifiable Information) or biased features handling/removal
  • Regulatory constraints handling
  • Data aggregation at particular level is required

This is important to maintain the data privacy and fairness.

Also check if you can use the data (especially external data) from security and legal/licensing perspective. e.g. you should not use the external sources if they not verified or blacklisted (not allowed by the security and/or legal teams of the organisation or country).

There can also be restrictions in terms of sensitivity and availability of the data. e.g. some data can be used only within a certain department/context or can not be mergerd with certain type of datasets.

If you are using APIs to fetch the data, there will be limits in terms of number of calls that can be made per minute/hour. There will be costs involved for consuming some of the external sources. Do the cost benefit analysis and get budget approvals for such data or try to use verified open source datasets.

You should maintain a document that tracks the datasets metadata i.e. data about datasets, provenance i.e. from which source/application data comes from, regulatory requiremnts to be followed for a dataset and data lineage i.e. data pipeline steps taken to get the raw data from source till clean data for modelling.

Get data access —

Find out where the feasible and accessible data resides. It can be in files as csv or excel or in data marts, data warehouse or Data lake. Find out who(team/application) owns it and get access to this data.

Data understanding —

Once you get the required data access, next step is to understand the data from business, process and user perspective. There will be multiple data souces such as databases, tables, files, APIs. Identify the relevant entities and its columns that are useful for the analysis and modelling.

Data exploration —

Once we have understood the data from process and user perspective, next step is to explore it statistically, for insights.

For basic data exloration, following open source packages are available that we can readily use:

  • Pandas profiling — HTML profiling reports from pandas DataFrame objects
  • lux — Automatically visualize your pandas dataframe via a single print!
  • Sweetviz — Visualize and compare datasets, target values and associations
  • dataprep — Connect, explore and prepare your data

For custom and detailed data exploration following packages are available:

  • Superset — Data Visualization and Data Exploration Platform
  • Metabase — business intelligence and analytics
  • Matplotlib — For data plotting
  • Seaborn — Statistical data visualization in Python

Once you have gathered insights and found issues in the data after exploration, next step is to preprocess this data to make it a clean that is ready for modelling.

--

--