3 Weeks Beginners Guide to Ace Data Science Interview: #Day 5

Statistical aspects used in data science

Vinay Vikram
Accredian
10 min readFeb 7, 2020

--

About the Series

Data Science field is an exciting career choice and seeing a lot of hiring across fresh, lateral and experienced job positions. It’s one thing to know the concepts and totally another to crack the rigorous interviews for data science positions. If a candidate is aware of the different questions and the interview process, he is on the right path to an excellent career in the evolving Data Science field.

This 3-week beginners guide to Ace Data Science Interview will be a useful asset for individuals who are preparing for the Data Science interviews. Every day for the next 21 days, we will talk about the different areas of the Data Science field and cover them elaborately. So sit back and start reading the article to get a finer understanding of the Data Science field and go prepared for the interviews.

You and your friends decide to play a football match but when you go out you see there is a possibility of rain so you decide instead to stay back at home. Why did you do this? What does taking a decision really mean?

The art of decision making is just this — choosing a plan of action when faced with uncertainty.

There are two ways to make a decision

  • Intuitive way, wherein one takes a decision out of a “gut feeling”.
  • Another is the method that employs data or information.

The Intuitive way is purely a personal and artistic way of making a decision. But the latter is a logical and scientific way of arriving at the right approach with available historical data.

This quantitative approach which enables decision making is called “Statistics”.

Statistics and Data Scientist

A Data Scientist is only as good as the questions they ask. Data Scientist asks probing questions like:

  • Does the small Rise and dip in temperature affect the ice cream sales in the market?
  • How a person's salary impacts his online shopping behavior?
  • How is Google able to “guess” my search question?

Statistics is the art of connecting numbers to these questions so that the “answers” evolve! To establish quantitative connections to largely qualitative questions is the heart of statistics.

This reminds me of a famous saying for DataScientist:

“A Data Scientist is one who knows more statistics than a programmer and more programming than a statistician”

Observe the picture below :

So Basically Data Science is that sweet spot that sits perfectly in the mid of computer programming, statistics and the domain on which the analysis is performed. To perform all of the above, the Data Scientist needs to have a piece of fair domain knowledge.

A Deep Dive into the world of Statistics:

Giphy

A brief background:

The word “Statistics” is derived from the Latin word “status”, which refers to information related to a state or a province. It’s actually said to be an ancient technique used by the kings to know the details about their state or province. So considering traditional statistics.

So in a very Naive way, Statistics is concerned with developing and studying different methods for collecting, analyzing and presenting the empirical data.

The field of statistics is composed of two broad categories-

  • Descriptive Statistics
  • Inferential statistics

Both fulfill different statistical aspects and give us different insights about the data.

Before we deep dive with descriptive and inferential statistics let us get the basic idea of population and sample, without these two terms, the explanation of statistics is baseless.

Population:

In statistics, population refers to the total set of observations that can be made. For example, if we are studying the weight of adult women, the population is the set of weights of all the women in the world.

The population is not necessarily people rather it could be a batch of batteries, measurements of rainfall in an area or a group of people.

Sample:

It is a part of the population, which is selected randomly for the study. The sample should be selected such that it represents all the characteristics of the population. The process of selecting the subset from the population is called sampling and the subset selected is called the sample.

Descriptive Statistics:

Fundamentally, all three(mean, mode, median)refer to one single aspect called the “Central Tendency”. The idea of central tendency is that there may be one single value that can possibly describe the data to the best extent.

It describes the important characteristics/ properties of the data using the measures of central tendency like mean/ median/mode and the measures of dispersion like range, standard deviation, variance, etc.

Three musketeers of descriptive statistics

Data can be summarized and represented in an accurate way using charts, tables, and graphs.

“A statistical chart speaks a thousand words more than the data itself which has only 1000 words!”

Inferential Statistics

This is actually statistical inference, wherein, we can make an inference about a large data set based on “testing” a small sample population of the data.

It is about using data from the sample and then making statistical inferences about the larger original population from which the sample is drawn. The goal of inferential statistics is to draw conclusions from a sample and generalize them to the population.

The most common methodologies used are

  • Hypothesis tests
  • Analysis of variance
  • Analysis of Means

Some differences to remember!

So again, I am reiterating the same point on Domain + Coding + Statistics = Data Scientist.

Descriptive Statistics Based Questions:

Question1: Name a few methods/techniques used in Statistics for analyzing the data?

Answer:

  • Median
  • Mode
  • Mean
  • Regression
  • Standard deviation

Question 2: What are the two main branches of statistics?

Answer:

  • Descriptive Statistics
  • Inferential Statistics

Question 3: What is a Sample in Statistics and list the sampling methods?

Answer:

In a Statistical study, a Sample is nothing but a set of or a portion of collected or processed data from a statistical population by a structured and defined procedure and the elements within the sample are known as a sample point.

Below are the 4 sampling methods:

  • Cluster Sampling: In the cluster sampling method the population will be divided into groups or clusters.
  • Simple Random: This sampling method simply follows pure random division.
  • Stratified: In stratified sampling, the data will be divided into groups or strata.
  • Systematical: Systematical sampling method picks every kth member of the population.

Question 4: What is correlation and covariance in statistics?

Answer:

Covariance and Correlation are two mathematical concepts; these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.

  • Measures the strength and direction of a linear relationship between two variables.
  • Ranges from -1 to +1.

Covariance: In covariance two items vary together and it’s a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable are reciprocate by a corresponding change in another variable.

  • The measure of how changes in one variable are associated with changes in a second variable.
  • Ranges from -infinity to +infinity.

Question 5: Which of the following measures of central tendency will always change if a single value in the data changes?

A) Mean

B) Median

C) Mode

D) All of these

Answer: A,

The mean of the dataset would always change if we change any value of the data set. Since we are summing up all the values together to get it, every value of the data set contributes to its value. Median and mode may or may not change with altering a single value in the dataset.

Question 6: If a positively skewed distribution has a median of 50, which of the following statements is true?

A) Mean is greater than 50

B) Mean is less than 50

C) Mode is less than 50

D) Mode is greater than 50

E) Both A and C

F) Both B and D

Answer: E

Below are the distributions for Negatively, Positively and no skewed curves.

As we can see for a positively skewed curve, Mode<Median<Mean. So if median is 50, mean would be more than 50 and mode will be less than 50.

Question 6: What is the meaning of normal distribution?

Solution:

Data is usually distributed in many ways which incline to left or right. There are high chances that data is focussed around a middle value without any particular inclination to the left or the right. But when distribution have a bell shape curve then we can call the distribution as a normal distribution.

The normal distribution has the following properties:

  • Unimodal or one-mode.
  • Both the left and right halves are symmetrical and are mirror images of each other.
  • It is bell-shaped with a maximum height at the center.
  • Mean, mode, and even the median are all present at the center.
  • Asymptotic

Question 7: Explain Central Limit Theorem?

CLT is a statistical theory stating that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population

Question 8: Correlation between two variables (Var1 and Var2) is 0.65. Now, after multiplying numeric 2 to all the values of Var1, the correlation coefficient will_______?

A) Increase

B) Decrease

C) None of the above

Answer: C

If a constant value is added or subtracted to either variable, the correlation coefficient would be unchanged. It is easy to understand if we look at the formula for calculating the correlation.

If we add a constant value to all the values of X, the Xi and will change by the same number, and the differences will remain the same. Hence, there is no change in the correlation coefficient.

Question 9: Pearson captures how linearly dependent two variables are whereas Spearman captures the monotonic behaviour of the relation between the variables.

A)TRUE

B) FALSE

Answer: (A)

The statement is true. Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable.

The spearman evaluates a monotonic relationship. A monotonic relationship is one where the variables change together but not necessarily at a constant rate.

The Spearman rank-order correlation coefficient (Spearman’s correlation, for short) is a nonparametric measure of the strength and direction of association that exists between two variables measured on at least an ordinal scale.

Question 10: What is the relationship between significance level and confidence level?

A) Significance level = Confidence level

B) Significance level = 1- Confidence level

C) Significance level = 1/Confidence level

D) Significance level = sqrt (1 — Confidence level)

Answer: (B)

The significance level is a 1-confidence interval. If the significance level is 0.05, the corresponding confidence interval is 95% or 0.95. The significance level is the probability of obtaining a result as extreme as, or more extreme than the result actually obtained when the null hypothesis is true. The confidence interval is the range of likely values for a population parameter, such as the population mean. For example, if you compute a 95% confidence interval for the average price of ice cream, then you can be 95% confident that the interval contains the true average cost of all ice creams.

The significance level and confidence level are the complementary portions in the normal distribution.

If this blog helped you in any way, then do Follow and Clap👏, because your encouragement catalyzes inspiration and helps to create more cool stuff like this. As always, I welcome feedback and constructive criticism. I can be reached on Twitter @vikramvinay or through Linkedin love to hear from your end.

Check what’s on Day1, Day2, Day3, and Day4.

Other Source :

  1. https://towardsdatascience.com/a-gentle-intro-to-probability-and-statistics-for-data-science-95d3980e19da https://blog.floydhub.com/statistics-for-data-science/
  2. https://stattrek.com/statistics/dictionary.aspx?definition=population

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

--

--

Vinay Vikram
Accredian

Artificial Intelligence Researcher at @MOTHERSON | Check My Data Science Portfolio: https://vikramvinay.github.io/