An Underrated Skill in Data Science: Asking Good Questions

2 min readMay 9, 2022

This is Part 2 of a series of blog posts on how to build a data science portfolio project. You can also check out the previous and next posts.

Data science is mostly defined as answering questions using data but an underrated skill for data science is asking questions, in fact asking so many of them and most importantly asking the right questions.

I gathered a list of questions one should ask while moving through the data science project life cycle. Please help me to expand the list if you have more questions to add. Following the phases of CRISP-DM process, here are the questions you should be asking in each step:

Stage 1: Business Understanding

What does the business need?
How does the business need translate into a DS problem?
Is the data available?
How do you understand (measure) if the solution is successful?
Are the goals realistic?

Stage 2: Data understanding

What data do you need to answer your analytical question? What data exists?
Where can you find this data? Is it public? How do you sample from this data?
Do you need to collect your own data?
Are there any privacy/legal issues that you must consider prior to using the data?
Is the data structured or unstructured?
Do you need more data?
What features do you need? Which features are the most important according to stakeholders?

Stage 3: Data Preparation

Are there any inconsistencies in the data?
Is the data balanced/imbalanced?
Is data up to date?
Are there missing values? How to deal with them?
Are there any duplicates?
Is the data valid in terms of data type, data range, regular expression patterns etc?
How are different features distributed?
Are there any outliers, anomalies?
Are there any specific trends?
What are relationships between features?
Do you need new features? Can you transform raw data into more meaningful features?

Stage 4: Modeling/Evaluation

Do you even need machine learning for the business problem?
Which ML model is more appropriate for the problem/data?
How do you balance the accuracy and computational cost of the analysis process?
What is the baseline model and how do you improve from there?
What metrics will prove that the project is successful?
How does your solution help with the business problem?

Stage 5: Delivery/Deployment

How do you present your results?
How do you make your visualizations simple but powerful?
Is your presentation easy to understand by non technical people?
Do you need a living model? Which tools are needed to deploy your model?
How do you keep your model up to date?

Next in this series, I will be writing about how to identify a business problem and design a machine learning solution. Please check it out!

An Underrated Skill in Data Science: Asking Good Questions

Written by neslihan bisgin