Data science fundamentals often asked in interviews

In this article, I present and share the solution for theory and concept-related questions that were asked to me or I had asked in data science interviews.

shubham badaya
4 min readJan 7, 2024

1. What is Gradient descent?

Gradient descent is an optimization algorithm

Why? Used in machine learning and deep learning to minimize the error or loss function of a model. To find weights and biases.

How? By iteratively moving toward the steepest decrease in the function.

Gradient descent is older — much, much older — than machine learning.

2. What is the difference between a t-test and a z-test?

T-tests and z-tests are statistical methods used for hypothesis testing, specifically for comparing the means of two groups.

The main difference between the two lies in the context in which they are applicable and the assumptions they make.

Type of Data:

  • Z-test: Typically used when the sample size is large (usually n > 30) or when the population standard deviation is known.
  • T-test: Used when the sample size is small (typically n < 30) or when the population standard deviation is unknown.

Population Standard Deviation:

  • Z-test: Assumes that the population standard deviation is known.
  • T-test: Does not assume knowledge of the population standard deviation and instead estimates it from the sample data.

Sample Size:

  • Z-test: Suitable for large sample sizes, where the central limit theorem ensures that the sampling distribution of the sample mean is approximately normal.
  • T-test: More appropriate for small sample sizes, where the normality assumption of the sampling distribution is not guaranteed, but the t-distribution provides a better approximation.

Formula:

  • Z-test: Uses the standard normal distribution (z-distribution) for hypothesis testing.
  • T-test: Uses the t-distribution for hypothesis testing, which has heavier tails compared to the normal distribution, reflecting the increased uncertainty associated with estimating the population standard deviation from a small sample.
Image Source: https://medium.com/@dhaval.sony.504/know-t-distribution-z-test-765fb34a2ef3

Example:

  • Z-test: If you were comparing the average height of a sample of 200 adults to a known population average height.
  • T-test: If you were comparing the average test scores of a sample of 25 students to the average scores of the entire student population.

In summary, choose between a z-test and a t-test based on the size of your sample and whether you know the population standard deviation.

If the sample size is large and the population standard deviation is known, a z-test may be appropriate.

If the sample size is small or the population standard deviation is unknown, a t-test is often more suitable.

3. What is the p-value?

In the general hypothesis testing procedure, we will have some hypotheses about the population parameter and we investigate it using a sample extracted from the population.

A p-value is nothing but the probability of observing such a sample from the population given that the null hypothesis is true, if the probability is too small, we doubt the accuracy of a null hypothesis and reject it, otherwise, we accept the Null hypothesis by saying we don’t have enough evidence to reject the null hypothesis.

The p-value reflects the strength of evidence against the null hypothesis.

P-value helps the statistician to draw conclusions on the Null hypothesis and is always between 0 and 1.

  • P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
  • P-value < 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.
  • P-value=0.05 is the marginal value indicating it is possible to go either way.

4. In which scenario results from random forest models can be better than gradient boosting models?

  1. RFs can handle extremely high dimensional datasets which makes them an excellent choice when dealing with a large number of variables.
  2. GBM cannot work effectively with such high-dimensional data due to overfitting issues.

5. What are bias and variance in modeling?

Bias: relates to the model’s simplicity and its ability to capture the underlying patterns

Variance: relates to the model’s flexibility and its sensitivity to fluctuations in the training data.

The goal of machine learning is to find a model that achieves a balance between bias and variance to generalize well to new, unseen data.

6. Why gradient boosting is prone to overfitting?

There are multiple factors influencing overfitting in boosting algorithms. However, they can be characterized as shown below:

  1. Complexity of Weak Learners: If the weak learners are too complex or have too much depth, there’s a higher risk of overfitting
  2. Number of Iterations (Boosting Rounds): Increasing the number of boosting rounds can lead to overfitting
  3. Data Characteristics: If the dataset is noisy or has outliers, boosting algorithms may be more susceptible to fitting the noise.

--

--