Top Data Science Interview Questions & Answers

Answers to AI and Data Science interview questions asked repeatedly.

Published in

Acing AI

4 min readOct 3, 2018

If you have been following Acing AI, I have been regularly posting interview questions for Data Science and AI interviews from some of the top technology companies. I have been asked on countless occasions to post answers to those questions. In this article I have chosen the top 5 algorithm/theory based questions which have been frequent and answered them.

1.What is p-value?

When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called Null Hypothesis.

Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis.
High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis
p-value of 0.05 indicates the Hypothesis could go either way.

To put it in another way,

High P values: your data are likely with a true null.
Low P values: your data are unlikely with a true null.

To learn more about p-values: Youtube video

2. How is the k-nearest neighbour(KNN) algorithm different from k-means clustering?

KNN is a supervised learning algorithm used for classification. K-means is an unsupervised learning algorithm used for clustering. KNN is a supervised learning algorithm which means training data is labeled. The goal of KNN is to classify an unlabeled point into. K-means clustering requires only a set of unlabeled points and a threshold. Then the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points. The primary difference is that when training data is labeled it becomes supervised learning while when the data is unlabeled it is unsupervised learning.

3. What is overfitting? How do we avoid it?

“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data and carries that noise resulting in failure to predict in the future. To avoid overfitting:

Use fewer parameters: Use a simpler model with fewer parameters. This will avoid capturing the noise reducing overfitting.
Better performance measures: Use canonical performance measures. The problem should dictate which performance measures will provide a better performance.
Cross-Validation: Cross-Validation techniques help avoid overfitting. Repeated random sub-sampling validation and k-fold cross-validation are good techniques to use depending on the dataset.

4. How do you decide between model accuracy and model performance?

This question is directly related to the accuracy paradox and tests your knowledge in situations when higher accuracy means poor predictions. The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favour of other metrics such as precision and recall. Lets take a situation of invalid password attempts prediction. In a hypothetical company, there are extremely fewer cases of invalid password attempts. If we were to use a predictive model for this, a highly accurate model would actually have low/fewer cases of invalid password attempts which would not be that helpful. Hence, sometimes, lower level of accuracy provides higher predictive power.

5. What’s the difference between Type I and Type II error?

Type I error is equivalent to a False positive. Type II error is equivalent to a False negative. Type I error refers to non-acceptance of hypothesis which ought to be accepted. Type II error is the acceptance of hypothesis which ought to be rejected. Lets take an example of Biometrics. When someone scans their fingers for a biometric scan, a Type I error is the possibility of rejection even with an authorized match. A Type II error is the possibility of acceptance even with a wrong/unauthorized match.

At Acing AI, the aim is to help you to get into Data Science and AI. I have profiled some of the best technology companies and written articles about AI interviews at Microsoft, Google, Amazon, Netflix, LinkedIn, Ebay, Twitter, Walmart, Apple, Facebook, Zillow, Salesforce, Uber, Intel, Adobe Tesla and most recently IBM. This has led to being the top writer in Artificial Intelligence on Medium. The AI interview preparation guides Part 1, Part 2 go over the details which help you ace any AI interview. Acing AI Portfolios helps you to showcase your AI work. Expert interviews and analyses gives you a sneak peak into the lives of AI/Data Science Leaders and analyses of AI tech companies.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Subscribe to the Acing AI/Data Science Newsletter. It is FREE! Reducing the entropy in data science. Helping you with…

www.acingdatascience.com

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to provide answers to some Data Science Interview Questions. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.

Top Data Science Interview Questions & Answers

Answers to AI and Data Science interview questions asked repeatedly.

Newsletter

Subscribe to the Acing AI/Data Science Newsletter. It is FREE! Reducing the entropy in data science. Helping you with…

Written by Vimarsh Karbhari