30 Most Asked Machine Learning Questions Answered

To Check and Increase Your Knowledge of ML 🤔

Abhay Parashar
Mar 18 · 13 min read
Image By Author

Machine Learning is the path to a better and advanced future. A Machine Learning Developer is the most demanding job in 2021 and it is going to increase by 20–30% in the upcoming 3–5 years. Machine Learning by the core is all statistics and programming concepts. The language that is mostly used by Machine learning developers for coding is python because of its simplicity. In this blog, you will some of the most asked machine learning questions that every machine learning enthusiast has to answer one day. Let’s start

0. What is Machine Learning?

1. Explain the Basic Difference Between Supervised, Unsupervised, and Semi-Supervised Machine Learning?

Unsupervised Learning: A model is trained on unlabeled data, the model tries to find patterns, relationships in the data and classify the classes according to that. We don’t have any labeled data.

Semi-Supervised Learning: It is a type of machine learning that uses some amount of labeled data and a large amount of unlabeled data to train the model. The goal of this is to classify some of the unlabeled data with the help of labeled data.

2. What do you understand by Reinforcement learning?

3. What are the different types of data using in Machine Learning?

4. Feature vs Labels?

5. Explain The Difference Between Regression and Classification?

Classification: Classification is the process of finding a function that helps in dividing the data into different classes. These are mainly used in discrete data. In Classification, our aim is to find the decision boundary which can divide the dataset into different classes.

Image By Author

6. What is Scikit learn used for?

7. What is Training Set and Test Set In Machine Learning and Why They are Important?

8. Explain The Stages of Building A Machine Learning Model?

Data Processing: In this stage, the data that we have collected in the first stage is preprocessed by handling all the null values, categorical data, etc. also in the same stage the data features are made in the same range if they are not already.

Model Building: In this stage first we choose appropriate algorithms to create the model and then with the help of sklearn the model is built.

Model Evaluation: After the model is created it is evaluated using some techniques of statistics like accuracy score, z score, accuracy matrix, and more.

Model Saving and Testing: After a successful evaluation of the model it is saved for future use and real-time testing is done using it.

9. Overfitting vs Underfitting?

Underfitting: Model performance is poor on training data as well as test data. in this model failed to generalize the new data points.

.“Machine intelligence is the last invention that humanity will ever need to make.” ~Nick Bostrom

10. Explain Confusion Matrix with Respect to Model Evaluation?

True Positive: Actual Value = Predicted Value when o/p is 1
True Negative : Actual Value != Predicted Value when o/p is 0
False Positive: Type I Error
False Negative: Type II Error


11. What’s the difference between Type I and Type II errors?

Type II Error(False Negative Error):-it occurs when the null hypothesis gets accepted when it's not true means it claims nothing when something has happened.

Example: Let’s take the example of a scenario in which a null hypothesis is a person who is innocent. Convicting an innocent person is a Type I error on the other hand letting a guilty person go free is a Type II error.

12. Differentiate Precision, Recall, accuracy, and F1 Score?

Recall is the ratio of the correct predicted positive observation and the total observation in the class.
Recall = TP/TP+FN

F1-Score is the weighted average of recall and precision.
F1-Score = 2*(Recall * Precision) / (Recall + Precision)

Accuracy is the ratio of correctly predicted positive observations to the total positive observations.
Accuracy = TP+TN/TP+TN+FP+FN

13. What Do You Understand by P-Value?

14. Explain how a Roc Curve Works?

15. How Knn different from K-means clustering?

K-means clustering is an unsupervised machine learning algorithm that is used to divide the data into different clusters based on k (number of clustering), and centroids.

16. What is ‘Naive’ in the Naive Bayes Theorem?

Let’s suppose a dataset that contains information about fruits and detects whether the fruit is an apple or not. A sample of this data contains a fruit that is red, round, and about 4'' in diameter. Even if all these features depend on each other or upon the existence of the other feature A Naive Bayes classifier will always consider them as independent contributors for the prediction of the fruit.

17. How Ensemble Learning Works?

Ensemble Learning can be done using two ways, one is to use different algorithms prediction combine to generate a new high accuracy prediction or another way is to use a single algorithm multiple times and at the end, use each model prediction to generate a better model with good accuracy.

“Don’t Let Yesterday Take Up Too Much Of Today.” — Will Rogers

18. What is bagging and Boosting in machine learning?

Boosting is a way of combining predictions belongs to different algorithms. Ex: Gradient boosting. The new model is highly influenced by the performance of the previously built models. It reduces the bias.

19. What is a bias-variance tradeoff?

if our model has fewer parameters then it may have High bias and Low variance because of that it will consistent but inaccurate on average.
A model with a large number of parameters may have Low bias and High variance models which are mostly accurate on average but inconsistent in nature.

A good model always has low bias and low variance.

20. Explain L1 and L2 Regularization?

21. What are the different ways you know to handle missing values in machine learning?

22. What are the different techniques you can use to select Features.

2. Extra Tree Classifier: This technique gives you a score for each feature of the data. The higher the score the important and relevant that feature is. You can import the class from sklean.ensemble .

3. Correlation Matrix: it is just a table that displays the correlation of all the features against each other. Each cell in the table displays a correlation between two variables. We can use a threshold value to select the less correlated variables out of the dataset.

4. Mutual Information: It is a classifier that generates the mutual information of each feature with respect to the dependent feature. The higher the information is relevant it is.

“Torture the data, and it will confess to anything.” — By Ronald Coase

23. What Approaches can Follow To Handle Categorical Values in the dataset?

  1. Nomial Encoding: When data do not have an inherent order.
    1.1 One Hot Encoding
    1.2 One Hot Encoding with many features
    1.3 Mean Encoding
  2. Ordinal Encoding: When data have an inherent order.
    2.1 Label Encoding
    2.2 Target Guided Encoding
  3. Count Encoding

24. What is Outliers and How You can handle them in Machine Learning?

  1. Remove all the outliers
  2. Replace the outlier values with a suitable value (Like 3rd deviation)
  3. Use a Different algorithm that is not sensitive to outliers.

25. What is Feature scaling and transformation and why they are necessary??

Sometimes in our dataset, we have columns that have different units — like one column can be age while the other can be the salary of the person. In this scenario, the age column ranges from 0 to 100, and the salary column ranges from 0–10000. there is such a difference between the values of these columns, so because of that the column having larger values will influence the output more. That will result in a bad performing model. Thus we need to perform feature scaling and transformation.

26. How You will handle an imbalanced dataset?

  1. Collecting More Data.
  2. Apply Oversampling when we have a large amount of data
  3. Apply Undersampling
  4. Try Some Other Algorithm

27. What is A/B Testing?

In a real-world scenario, suppose you create two models that recommend products for users. A/B testing can be used to compare these two models to check which one gives the best recommendations.

28. What is Cross-Validation in Machine Learning?

  • k-fold cross-validation
  • Holdout method
  • Stratified k-fold cross-validation
  • Leave p-out cross-validation

29. What is PCA and How it is useful?

In real life, we usually come across datasets that have large dimensions, and because of that visualizing and analyzing those datasets become difficult. PCA can help to reduce the dimensionality of the dataset by removing unnecessary dimensions from the dataset.

“The More You Learn, The More You Earn.” — Warren Buffett

30. How a Pipeline Used in Machine Learning?

The pipeline is mainly used in NLP. One part of the pipelines doing the cleaning and vectorization one the other hand another part of the pipeline doing the model training and validation.

Thanks For Reading

The Pythoneers

Sharing Projects, Codes, and Ideas.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store