Understanding data science in a simpler way

Published in

Analytics Vidhya

6 min readAug 1, 2021

In our school days, we used to learn history because history taught us about the origin of evolution, ancient civilisation, agriculture, urbanisation, and many more. Data are like human behaviour we learn from each other and behave in a certain one that helps to form patterns, assume and predict the outcomes.

I was discussing data science with a bunch of folks who make their livings by training the athletes; the situation in the room was that they were unable to understand the topics due to too much technical jargon. So how do explain them in such a way that will be going to understand?

I will be discussing the same in the following headlines:

Data
Exploratory Data Analysis
Group Analysis
Clustering
Regression or Classification

Data

I simply started by asking what do you mean by data? To know the audience you have to start with queries, most of the answers were inclined towards only numerical forms of data.

I started by letting them know that data is everywhere, in the instant moment also whatever I may be saying is one of the forms of data. Further, I said to them that in this room if we count how many persons are present it’s one form of quantitative data and if we represent ourselves in categories as athletics, soccer, basketball, badminton and others, these categories will be named as qualitative data.

It is quite important to explain to the audience with examples as relevant to their life or work.

Exploratory Data Analysis:

Now we know what is data, so in this section, we have to dig more about data to gain deeper insights like behaviour, pattern, relationship, association. So, I started with a previous example of ourselves that is in a room how many total members are there, how many belong to each category, average age of members present in a room, highest qualification obtained, and many more. So, in a group if we found someone is not matching our pattern or segment or behaviour then it will be related to an outlier, in simpler words, in our room, if a person is present from a chemistry background he will be an outlier.

If we want to establish any relationship or association between two data points we can’t simply say they have, folks, we need to have some sort of evidence behind establishing the same, in a data science perspective we have to do it with statistical formula with a formulation of a hypothesis.

Group Analysis

I explained to the audience that supposes you are an event manager and you have to select which place is better to conduct the event. You did research came to the conclusive list of three restaurants named A, B and C. Now you are worried to select one out of three, how you will do it because the admin will be asking relevant queries regarding why you chose that restaurant, to do so you have to formulate a hypothesis and performs a statistical test such as a Student T-Test (for two groups) or ANOVA (if more than two groups) in case of parametric data (those data that follows normal distribution) and if data is non-parametric then we have to do Mann Whitney U test or Wilcoxon Signed Rank test or Kruskal Wallis Test. I am just naming this test to give you all a basic idea, but here is the question arises that what do you mean by parametric data?

Figure 1: Normal Distribution — **image by author**

Figure 2: Non-Normal Distribution — **image by author**

The above two graphs help in understanding the distribution of data if an attribute is normally distributed it will follow figure 1 and could be stated as parametric data otherwise figure 2 and could be stated as non-parametric data.

Clustering

Here comes the next topic clustering, clustering in simple words means grouping. So you all will be wondering why we are discussing over here and how we can do it? We are discussing this because clustering helps in better understanding of behaviour or pattern thus aids in the formulation of segments.

One common example, in our childhood days we used to sees stars and forms various sorts of shapes, isn’t it? Let’s discuss how we used to determine the shape, we used to form the image by visualizing nearby stars because it used to make sense. So in data science just imagine data points as stars and we form the cluster with a mathematical formula such as by calculating Euclidean Distance, Manhattan Distance, Minkowski and many more. Further, there is one catch over here how many clusters you will form? It could be validated through elbow curve visually or silhouette coefficient or silhouette score or wss score.

Regression or Classification

To explain regression to you all, there is a small boy who is hungry and went to his mother for food, this boy is a notorious one and only likes tasty and yummy food. So mother went to a kitchen to start preparing yummy food for him, she knows that if in the food ingredients x1, x2, x3, x4 and x5 are present then her son will going to like the food and will eat. So, she started preparing the food with the above-mentioned ingredients to make the dish Y. So, in this example x1,x2,x3,x4 and x5 are independent variables and Y is dependent on them. Therefore, in future, we can predict the taste of food with these independent variables, also folks we can predict which ingredients are most important. In this example we are predicting two-class whether the boy will be going to like the food or not, this is an example of a classification algorithm such as Logistic regression, decision tree, random forest, LDA and many more and measure the accuracy of the algorithm through classification report, AUC and ROC.

Linear regression is one form algorithm to predict the continuous variable, so you remember that notorious boy, you know what he used to do while going to school he daily went shopping to buy candy that to be in large number such as one day 100, another day 66, 71,78,86,99,45 etc that to be of same brand candy. To predict how many candies he will buy the next day we can perform linear regression in which Y_how_many will be the dependent variable depending upon past purchase (independent variable) and the metric used to measure the accuracy of a model is RMSE (Root mean square error), R-squared value or R-adjusted squared value.

EndNote

The purpose of writing this article is to explain data science in a simplistic way because we shall prefer simplicity always that was beautifully stated through Occam’s razor which states that

“The principle gives precedence to simplicity: of two competing theories, the simpler explanation of an entity is to be preferred. The principle is also expressed as “Entities are not to be multiplied beyond necessity.”

Understanding data science in a simpler way

Written by Swetank Pathak