Data Science: Data understanding, gathering and exploration

3 min readMar 8, 2022

In the previous post, we discussed Business understanding and defining a problem. In this post we will discuss data gathering, understanding and exploration.

Once business and ML problem is defined, next step is to identify and gather data, understand schema and meaning and then explore it.

Identify data sources —

First step is to identify different data sources that are relevant and useful for solving the problem. These data sources can be internal (at business unit or org level) or external (free or paid). Business end users/analysts will help you to identify some of the sources. Some data sources especially external ones need to be researched. You should consider the cost and time factors for each dataset i.e. is the data free or needs to be purchased and how much time it will take to have the data ready for modelling.

Feasibility and accessibility of data —

Understand how feasible it is to get access to the data. Based on the compliance and regulations at the org / industry / government levels, you should check if you can use the identified data as is or some data modification needed such as:

Data masking/encryption
Data quality improvement
PII(Personal Identifiable Information) or biased features handling/removal
Regulatory constraints handling
Data aggregation at particular level is required

This is important to maintain the data privacy and fairness.

Also check if you can use the data (especially external data) from security and legal/licensing perspective. e.g. you should not use the external sources if they not verified or blacklisted (not allowed by the security and/or legal teams of the organisation or country).

There can also be restrictions in terms of sensitivity and availability of the data. e.g. some data can be used only within a certain department/context or can not be mergerd with certain type of datasets.

If you are using APIs to fetch the data, there will be limits in terms of number of calls that can be made per minute/hour. There will be costs involved for consuming some of the external sources. Do the cost benefit analysis and get budget approvals for such data or try to use verified open source datasets.

You should maintain a document that tracks the datasets metadata i.e. data about datasets, provenance i.e. from which source/application data comes from, regulatory requiremnts to be followed for a dataset and data lineage i.e. data pipeline steps taken to get the raw data from source till clean data for modelling.

Get data access —

Find out where the feasible and accessible data resides. It can be in files as csv or excel or in data marts, data warehouse or Data lake. Find out who(team/application) owns it and get access to this data.

Data understanding —

Once you get the required data access, next step is to understand the data from business, process and user perspective. There will be multiple data souces such as databases, tables, files, APIs. Identify the relevant entities and its columns that are useful for the analysis and modelling.

Data exploration —

Once we have understood the data from process and user perspective, next step is to explore it statistically, for insights.

For basic data exloration, following open source packages are available that we can readily use:

Pandas profiling — HTML profiling reports from pandas DataFrame objects
lux — Automatically visualize your pandas dataframe via a single print!
Sweetviz — Visualize and compare datasets, target values and associations
dataprep — Connect, explore and prepare your data

For custom and detailed data exploration following packages are available:

Superset — Data Visualization and Data Exploration Platform
Metabase — business intelligence and analytics
Matplotlib — For data plotting
Seaborn — Statistical data visualization in Python

Once you have gathered insights and found issues in the data after exploration, next step is to preprocess this data to make it a clean that is ready for modelling.

Data Science: Data understanding, gathering and exploration

Written by Sourabh Potnis