Gowtham Dongari
Jul 8, 2017 · 3 min read

Before deep diving into the concepts of DATA-SCIENCE, to get some idea about how actual datasets are, I tried to explore the WINE QUALITY dataset-it consists of data related to red wine & white wine (can get the dataset at UCI machine learning repository).

It is about Wine produced by a Portuguese firm Vinho Verde, started to analyse the dataset for WINE QUALITY based on the given data.

As part of initial understanding loaded the datasets using pandas and started to explore it then i found the attributes as:

1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
Output variable (based on sensory data):
12. quality (score between 0 and 10)

These attributes are chemical composition and characteristics in that particular sample

Now the interesting part is to evaluate and find out the key findings and what the data is trying to tell us after some street-fighting started checking the solutions and found the Solutions for the data set using different resources observe the distribution patterns

distribution patterns

i was puzzled to see what the Conclusion is!!!

It says that it does not look like wine quality is well supported by its chemical properties.

So after this task my initial understanding was, we should be able to figure out the questions to be answered by the data before we experiment on it.

To question about anything first we should know about the “5 Why’s” it is an extremely basic and important concept , for those who don’t know what 5 whys is: it is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem.

The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question “Why?”

Like here in this case it might be

1. How is the quality of the wines tasted?

2.What factors defines a high quality wine?

3.What is causing wine defects? how the distribution of the data is

This questioning helps in understanding and after the initial observation of the data we should figure out if we can interpret anything from our data. As if its distribution patterns are impacting the attributes inter dependency or the given attributes are independent or any single clue about the data.

And my next observation was how relevant the data we working on is to our underlying problem.

This means the ability to understand the business perspective and get to know that Not all problems have a single root cause. If one wishes to uncover multiple root causes-

the method must be repeated asking a different sequence of questions each time. you may have to learn the ability to look at things through a more strategic way (i.e by thinking laterally 😉 ) 


GreyAtom is committed to building an educational ecosystem for learners to upskill & help them make a career in data science.

Gowtham Dongari

Written by

sailing in a never ending sea of data!! 😉



GreyAtom is committed to building an educational ecosystem for learners to upskill & help them make a career in data science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade