Can you Crack this Data-Science Interview?

Nishesh Gogia
24 min readAug 4, 2023

--

Hey folks, I hope you are doing well. Today I am gonna tell you a story. My personal story. So if you want to be a data scientist, you dream day and out to be a data scientist, want to work on good projects, This blog my friend, is for you.

Sometimes life teach you things in hundred different ways. Let me share you my part of learning!!!

So These were the days when I was preparing for my first job, I used to do lot of coding, lot of practise because I wanted to get a role of Data Scientist and as a fresher, it’s difficult to crack a Data Scientist role. I would spend 10 hours a day to understand a topic and then code it from scratch.

One day I was preparing for the upcoming campus drive at my college and my younger brother asked me that, “What is Data Science?”

I told him, “Data Science is a field where companies and organisation want to understands the pattern in their existing data and wants to get the meaningful insight from data so that they can use that insights for their business”

He looked at me weirdly, “Thats sounds so boring”, i was thinking to get into data science because everybody is talking about but it sounded too boring for me”

I was literally shocked, because I could not find anything interesting than data science, then I asked my younger brother, “Why do you think it’s boring”?

He replied, “It has so many complex terms, I don’t want to study this”.

I said, what if I say, “Whenever you take a cab, you automatically got to know the time it will take to take you to the destination, or when you order the food, you know exactly when driver gonna reach to your home, or automatically in you gmail, Spam mail goes to the spam folder and Normal mail comes to the inbox, All this is Data Science”

He replied, “Ohhhh!!!!, This is so cool, Now it sounds interesting”

I understood one simple thing that day, INTERPRETABILITY, means it does not matter what you know, what matters is how easily it will be interpret by the other person.

“Our perceptions changes if we interpret things better!!”

I decided to learn things in a way that if I need to explain any topic to my younger brother, how would I do it? It was difficult at beginning because I need to break every topic to its fundamental but i continued learning like that…

THE INTERVIEW DAY!!!

So It was a day when I had a chance to be a Data Scientist, I prepared for this day for last 2 months.

First round was a simple coding round where I need to code simple python programs, I comfortably did that.

Second round was the technical round, Technical manager introduced himself and the other team mates. Now the twist in the game was, Managing director of the company was the school friend of our College Director. So he also came to visit the campus and met his friend.

And he also wanted to sit in the Personal Interview Round, He also was sitting with the technical manager and other Interviewers.

The interview was about to start when the Director of company asked me in a very friendly tone, “Please explain answers like you would explain to a 10 year old kid, I don’t know anything about DataScience”

Listening to this, I knew it, it was my game!!!

Before it got started, I asked if i can use simple piece of paper to explain topics, they agreed.

And Then the interview got started…

Question-1-> What is Supervised Learning and What is Unsupervised Learning?

Ans-1 Let’s say hypothetically I am playing with a small kid of age 3. I took a tray and put three fruits in-front of him, First fruit was red apples, Second fruit was pink cherries, Third fruit was Banana’s.

I am assuming that the kid never saw these fruits, never ate them, he did not know nothing about these fruits.

Then i told him 100 times, that this red fruits called “Apples”, this pink fruit is called “Cherry”, this yellow fruit is called “Banana’s”. I repeat and repeat.

Here I provide my kid the input which is a fruit and the output which is called “Apple”. The kid is able to learn that this red looking fruit which is round in shape is called “Apple”.

That’s what we called Supervised Learning, where you are giving input and the output both and the machine needs to learn that how a particular input is mapped to a particular output.

For example- Gmail Spam/NotSpam Classifier.

Then what is Unsupervised Learning?

Let’s say i got another kid, i just showed him 3 fruits but did not tell what fruit is called what. I am sure after looking the tray multiple times, the kid will be able to differentiate that all the red fruits looks same, all the yellow fruits looks same and so on. He won’t be able to tell the name but he will be able to put them in different categories based on the colours.

That is unsupervised learning where we give machine only the input, from the input it needs to understand that the particular input belongs to which group or cluster.

For example- Credit Card Fraud Detection

Question-2-> Can you tell me the difference between A Data Analyst and A Data Scientist?

Ans2- Let’s say i am an employee in Swiggy and my manager asked me to
“Give the top 5 cities from where we are getting the least order in last 6 months”. Now let’s say swiggy got 10 crore orders in last 6 months, to get the least order city information manually it will take months. The answer to this question is already there in my data but manually it’s very difficult to find it so here comes the role of Data Analyst who uses libraries like Numpy, Pandas, Matplotlib, even tools like power Bi to find the answer.

But let’s say my manager said to me that, “Give the estimated sale value of our company by the end of this year”, now here my job is to predict the sale value, This is Data Science or Machine Learning.

You go to the past in the data, its data analysis.

You go into the future, its data science.

Question-3-> Then, Do you know Who is a Data Engineer?

Ans-3 Let’s say I am using OLA and i booked a cab, cab came to my door and i completed my trip. Now after 10 days, if i wanted to get any information about the ride I took, i can simply go and check on our OLA app.

Now how’s every click of mine is getting stored as a data, What time i took my cab, my ride duration, price and lot many other things. One job of Data Engineer is to create the pipelines of data and to get data stored in the cloud database like Mongodb.

Question-4-> Define p-value and tell me why it is important?

Ans-4 -When scientists do experiments, they want to know if their results are actually meaningful or just a coincidence. The p-value helps them make that determination.

A p-value is a number that tells us the likelihood that the results we see in an experiment are due to chance. The lower the p-value, the less likely it is that the results are just a coincidence.

For example, let’s say a group of scientists is testing a new medicine. They give the medicine to one group of people and a placebo (fake medicine) to another group. Then they measure how many people in each group get better. If the group that got the medicine has a much higher percentage of people who get better, the scientists can calculate the p-value to see if this difference is likely due to chance or if it is statistically significant.

In general, scientists consider a p-value of 0.05 or lower to be statistically significant. This means that there is less than a 5% chance that the results are due to chance, and that they are probably meaningful.

Question-5-> What is PDF and CDF and why do you think it’s important in Machine Learning?

Ans-5 PDF stands for Probability Density Function. In probability theory, a PDF is a function that describes the relative likelihood of a random variable taking on a particular value or set of values.

CDF stands for Cumulative Distribution Function. It’s a function that gives the probability that a random variable is less than or equal to a certain value.

PDFs and CDFs are important because they allow us to model the distribution of data. By understanding the distribution of data, we can make more informed decisions about how to process and analyze it. For example, if we have a dataset of images, we might want to know what the distribution of pixel values looks like. This information can help us choose appropriate preprocessing techniques or models that take into account the structure of the data.

Question-6-> Can you explain the Central Limit theorem?

The Central Limit Theorem is a statistical concept that helps us understand how the means of random variables are distributed. It states that if we take a large number of random samples from a population and calculate the mean of each sample, the distribution of those means will be approximately normal, regardless of the shape of the original population.

For example, imagine we wanted to know the average height of all students in a school. We could measure the height of every student, but that would be very time-consuming. Instead, we could take a random sample of students and measure their heights. We could then take the mean height of that sample and repeat the process many times, each time taking a different random sample of students. According to the Central Limit Theorem, the distribution of those means would be approximately normal, even if the distribution of heights in the original population was not normal.

Question-7-> What are data structures and what data structures we have in python?

Ans-7 Data Structures means the ways i can store my data, but the question is why do i need different ways to store data.

Is storing data in a single structure not enough?

So yes, we can not store data into only one data structure because every data structures has its own advantages and has its own disadvantages.

For example- Python has 4 data structures

  1. List- Now list is a mutable data structure but it has a o(n) time complexity which is not good but has o(1) space time complexity.
  2. Tuple- Tuple is immutable and has the same search time complexity as list.
  3. Dictionary-Dictionary is mutable and also have o(1) search time complexity but o(n) space time complexity.
  4. Set- Set is mutable because we can insert new elements into list, set does not allow duplicates also.

Question-8-> What are CNNs, Can you explain any famous CNN architectures?

Ans-8 CNN stands for Convolution Neural Network, where the convolution means element wise multiplication and addition. CNNs are generally used in Computer Vision Problems where the basic problem is to generate vectors out of images.

How would we gather information from a image to a mathematical vector so we can use that information for problems like object detection, face recognition etc.

These are some famous CNN architectures-:

  1. LeNET-LeNet (short for LeNet-5) is a convolutional neural network architecture that was developed by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner in 1998. It was one of the first successful convolutional neural networks for image recognition and classification tasks.
  2. AlexNet- AlexNet is a convolutional neural network architecture designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It won the ImageNet Large Scale Visual Recognition Challenge in 2012 and is widely considered as one of the breakthroughs that sparked the deep learning revolution.
  3. VGG16- VGG16 is a convolutional neural network architecture that consists of 16 layers, including 13 convolutional layers and 3 fully connected layers. It was developed by the Visual Geometry Group at the University of Oxford and achieved state-of-the-art performance on the ImageNet dataset in 2014.

Question-9-> What is Max Pooling and why do we need Max Pooling?

Ans-9 Max Pooling involves dividing an input image into a set of non-overlapping rectangular regions and then taking the maximum value of each region. The result is a down-sampled version of the input image with reduced spatial dimensions.

The primary reason for using Max Pooling is to reduce the size of the feature maps produced by the convolutional layers in a CNN, while retaining the most important features. This helps to reduce the computational complexity of the network, making it easier to train and faster to process new input images.

Question-10-> What is NLP?

NLP stands for Natural Language Processing,it is a branch of artificial intelligence that focuses on teaching machines how to understand, interpret, and generate human language. It involves developing algorithms and models that can analyze and manipulate text, speech, and other forms of natural language data.

It enables machines to perform tasks such as sentiment analysis, language translation, text summarization, chatbot conversations, and more. NLP is used in a wide range of applications, from virtual assistants like Siri and Alexa to spam filters in email, and even in healthcare for medical diagnosis and treatment planning.

Question-11-> What are the algorithms to embed a sentence into a vector?

Ans-11 Some of the algorithms which are used to embed a sentence into a vector is:-

  1. Bag Of Words
  2. TF-IDF
  3. Word2Vector
  4. BERT(Transfer)
  5. GPT Models

Question-12-> What’s your favourite algorithm?

Ans-12 My favourite algorithm is Random Forest.

Random Forest is a popular machine learning algorithm that uses the concept of ensemble learning to build a robust and accurate model.

Ensemble learning involves combining multiple models to make a prediction. In the case of Random Forest, multiple decision trees are built using randomly selected subsets of the training data and features. Each decision tree is trained independently, and the final prediction is made by taking the average or majority vote of the predictions from all the decision trees.

The random selection of subsets of data and features helps to reduce the impact of individual noisy data points or features on the model’s predictions, and the combination of multiple decision trees helps to improve the model’s accuracy and reduce overfitting.

Question-13-> What is a hyper parameter?

In machine learning, a hyperparameter is a parameter that is set before the training process begins and determines the overall behavior and performance of the machine learning algorithm.

Hyper parameters are used to control the learning process and affect how the model is trained, such as the learning rate, regularization parameter, number of hidden layers, number of neurons per layer, etc.

The process of choosing the best hyperparameters for a given problem is known as hyperparameter tuning

Question-14-> What is Under-fitting

Ans-14 Underfitting is a common problem in machine learning where a model is unable to capture the underlying patterns and relationships in the training data, resulting in poor performance on both the training and test data.

In simple terms, underfitting occurs when the model is too simple or not complex enough to represent the data. This means that the model cannot learn the relevant features and relationships between the input variables and the target variable. As a result, the model produces high bias and low variance, and it performs poorly on both the training data and new unseen data.

Question-15->What is Over-Fitting ?

Ans-15 Overfitting is a common problem in machine learning where a model is too complex and fits the training data too closely, resulting in poor generalization performance on new unseen data.

In simple terms, overfitting occurs when the model is too complex and learns the noise in the training data along with the underlying patterns and relationships between the input variables and the target variable. This means that the model is not able to generalize well to new data and produces high variance and low bias, which leads to poor performance on new data.

Question-16-> What is model error?

Ans-16 The model error is defined as-:

Model error= (Bias)²+ Variance

Question-17-> What is One-Hot-Encoding?

Ans-17 One hot encoding is a technique used in machine learning to represent categorical data as numerical data. It involves creating a binary vector for each category in a categorical variable, where each vector has a length equal to the number of unique categories in the variable.

In simple terms, one hot encoding replaces each category in a categorical variable with a binary vector that has a value of 1 in the corresponding index and a value of 0 in all other indices.

For example, suppose we have a categorical variable “Color” with three unique categories: Red, Green, and Blue. One hot encoding would create a binary vector of length three for each color: [1, 0, 0] for Red, [0, 1, 0] for Green, and [0, 0, 1] for Blue.

Question-18-> What is Multi-class classification?

Ans-18 Multi-class classification is the task of predicting a target variable that has more than two possible outcomes or classes. For example, predicting the type of fruit in an image as either an apple, banana, or orange would be a multi-class classification problem.

The main difference between binary classification and multi-class classification is the number of classes or categories in the target variable. In binary classification, the target variable has only two classes, while in multi-class classification, the target variable has more than two classes.

Question-19->Can we use all the algorithm if we have Multi-class classification problem?

Ans-19 No we can not use all algorithm directly to use Multi-Class Classification, We have use techniques like One Vs ALL to make it work.

For example, Logistic Regression is defined for binary class but we can use One Vs All to make it use for Multi Class Classification.

Question-20-> What is the performance metric you will use if you have a medical related problem like “Cancer Prediction”?

Ans-20 One commonly used performance metric in medical applications is sensitivity, also known as the true positive rate. Sensitivity measures the proportion of actual positive cases that are correctly identified as positive by the model. In cancer prediction, sensitivity is important because it reflects the ability of the model to correctly identify individuals who have cancer and who may require further medical evaluation.

Another important performance metric for cancer prediction is specificity, also known as the true negative rate. Specificity measures the proportion of actual negative cases that are correctly identified as negative by the model. In cancer prediction, specificity is important because it reflects the ability of the model to correctly identify individuals who do not have cancer and who may not require further medical evaluation.

Question-21-> What is F1 Score?

Ans-21 F1 score is the harmonic mean of Precision and Recall.

F1=2PR/P+R

Question-22-> Why do think Data science is the future?

Ans-22 Data science is becoming more and more important because there is a lot of information being created and stored every day due to the growth of the internet and technology. Data science helps us make sense of all this information and find useful patterns and insights that can help individuals and businesses make better decisions. As a result, there is a growing demand for data scientists who can analyse and interpret data to solve real-world problems.

As more and more people will understand how data science can actually bring more business to their company, more hirings will happen.

Question-23-> What is NLTK?

Ans 23-NLTK is an open-source software library written in Python that provides tools and resources for natural language processing (NLP) tasks, such as tokenization, stemming, tagging, parsing, and semantic reasoning. It also offers access to a vast collection of language corpora and lexical resources. NLTK is widely used in academia and industry for NLP research, development, and education.

Question-24-> What is Dropouts?

Ans 24-Dropout is a regularization technique used in deep learning neural networks to prevent overfitting. Overfitting occurs when a model becomes too complex and is trained to fit the training data too closely, which can cause it to perform poorly on new, unseen data.

Dropout works by randomly dropping out (i.e., setting to zero) some of the neurons in a neural network during training. This forces the network to learn more robust features and prevents any single neuron from becoming too important in making predictions.

During the training process, each neuron in the network is either dropped out with a specified probability or retained with probability (1 — dropout rate). The dropout rate is typically set between 0.1 and 0.5, and the optimal value can depend on the specific problem and architecture of the neural network.

Question-25-> What is Batch Normalisation?

Ans 25- Batch normalization works by normalizing the output of each layer before applying the activation function. During training, the mean and standard deviation of the output of each layer are computed for each mini-batch, and these statistics are used to normalize the output. Specifically, the output is normalized by subtracting the mean and dividing by the standard deviation, which is then scaled and shifted by learnable parameters called gamma and beta.

Batch normalization has several benefits, including:

  • It reduces the dependence of the model on the initialization of the weights, making it easier to train deep networks.
  • It can improve the generalization performance of the model by reducing overfitting.
  • It makes the optimization process more stable, allowing for the use of higher learning rates.

Question-26-> What is a Perceptron?

Ans 26- A perceptron is a type of artificial neural network that is commonly used in machine learning for binary classification problems. It is a simple model that takes a vector of inputs, applies a linear function to the inputs, and produces a binary output based on a threshold.

The most commonly used activation function in perceptrons is the step function, which produces an output of 1 if the weighted sum of the inputs is greater than or equal to a threshold, and 0 otherwise. Other activation functions, such as the sigmoid or ReLU functions, can also be used in more complex models.

Question-27-> How logistic Regression is a single neuron model?

Ans-27 It is a single neuron model in the sense that it is based on a single output neuron that applies a logistic function (also known as a sigmoid function) to a linear combination of the inputs.

In logistic regression, the input features x1, x2, …, xn are combined linearly using weights w1, w2, …, wn, and an intercept b to produce a single output z = w1x1 + w2x2 + … + wn*xn + b. This output is then passed through a logistic function to produce the predicted probability y_hat of the positive class:

y_hat = 1 / (1 + exp(-z))

Only one neuron is giving the output, thats we called it Single Neuron Model.

Question-28-> Why the name of logistic regression has regression when it’s a classification technique.

Ans 28- The output of the logistic regression is:-

y_hat = 1 / (1 + exp(-z))

y_hat will be always between 0 to 1, so it will any any value between 0 to 1 which resembles regression. Classification means 0 or 1, Regression means any value between 0 to 1. so the nature of the output is regression but then we take a threshold value to decide whether it belongs to class 0 or class 1.

For example- If output comes out to be 0.8, and the threshold is 0.5, then we will say because the value of output>0.5, it belongs to 0 class.

Question-29-> What is a sigmoid function? Why it is used in logistic regression?

Ans-29 Sigmoid Function is

Sigmoid(x)= 1 / (1 + exp(-x))

Sigmoid function is a differentiable function and has a probabilistic nature.

It is used in logistic regression because it taper off the outliers in the data. We use sigmoid function to squash the outliers from the data.

Question-30-> What do you mean by a hyperplane and how hyperplane is important with respect to machine learning?

Ans30- Let’s say I have a classification task and i want to classify two classes, spam and not spam, and I have only 2 features.

Two features means the geometry of the problem lies in 2D, x axis is feature 1 and y axis is feature 2.

Now if you see in the diagram, if i want to classify 2 classes, i would need a line.

If I got three features, to separate these two classes we need a plane.

If I got more than three features, I would need a hyperplane.

In most of the machine learning problems, we have more than 3 features and to solve that i would need a hyperplane. I would know equation of hyperplane because hyperplane is actually the decision boundary.

Question-31-> What is Backpropogation?

Ans31 Backpropogation is a way where we leverage chain rule of differentiation and Memoization concept to find the derivative of the loss function with respect to weights so that we can update our weights in update equation of gradient descent.

Backpropogation is the back of deep learning because at the end we need to find the weights to complete the training process.

Question-32-> What is Gradient Descent?

Ans32- Gradient descent is an optimization algorithm used to minimize the cost function of a machine learning model. The cost function is a measure of the difference between the predicted output of the model and the actual output.

The algorithm works by starting at a random point in the parameter space of the model and iteratively adjusting the parameters in the direction of steepest descent of the cost function, which is determined by the gradient of the cost function. The gradient is a vector that points in the direction of the greatest rate of increase of the cost function, and the algorithm updates the parameters in the opposite direction of the gradient, in order to move towards a minimum of the cost function.

Question-33->What is the problem with gradient descent?

Ans33- While gradient descent is a widely used and effective optimization algorithm for many machine learning models, there are some potential problems associated with it:

  1. Local minima: Gradient descent can converge to a local minimum of the cost function, which may not be the global minimum. This can lead to suboptimal solutions that do not perform as well as they could.
  2. Learning rate: The learning rate determines the step size taken by the algorithm in the direction of the gradient. If the learning rate is too small, the algorithm may converge too slowly, while if the learning rate is too large, the algorithm may overshoot the minimum and fail to converge.

Question-34-> What is SGD(Stochastic Gradient Descent)?

Ans34-

  1. SGD is a variant of gradient descent used for training machine learning models.
  2. It updates the model parameters after each individual training example or a small batch of examples.
  3. This makes the algorithm more computationally efficient and can lead to faster convergence.
  4. SGD is commonly used for large datasets and deep learning models.

Question-35-> Explain the problem of Vanishing Gradient?

Ans35- The problem of vanishing gradients refers to a situation where the gradients of the cost function with respect to the model parameters become very small, particularly in deep neural networks, making it difficult to train the network effectively.

When the gradients are small, the updates to the model parameters during training become very small, and the network may converge to a suboptimal solution or not converge at all. The problem is particularly acute in deep neural networks with many layers, where the gradients can become exponentially small as they propagate through the layers.

Question-36-> What are the different activation function we used in deep learning?

Ans36-

  1. Sigmoid: The sigmoid function maps any input value to a value between 0 and 1. It is often used in the output layer of binary classification models.
  2. Hyperbolic tangent (tanh): The tanh function maps any input value to a value between -1 and 1. It is commonly used as an activation function in hidden layers.
  3. Rectified Linear Unit (ReLU): The ReLU function outputs the input value if it is positive, and 0 otherwise. It is widely used in deep neural networks due to its simplicity and effectiveness.
  4. Leaky ReLU: The leaky ReLU function is a variant of the ReLU function that allows for a small positive gradient when the input is negative. This can help alleviate the “dying ReLU” problem where the gradients become zero for negative inputs.

Question-37-> How to deal with Outliers in Machine Learning?

Ans -37- Here are some techniques that can be used to deal with outliers in machine learning:

  1. Detection: Before dealing with outliers, it is important to first detect them. This can be done using various statistical techniques such as Z-score, interquartile range (IQR), and box plots.
  2. Removal: One approach to dealing with outliers is to simply remove them from the dataset. However, this approach should be used with caution, as removing too many data points can lead to a loss of information and biased results.

Question-38-> What will you do if you model is overfit?

Ans 38- Increase the size of the dataset: One way to reduce overfitting is to increase the size of the training dataset. This can help the model generalize better to unseen data.

  1. Feature selection: Another approach is to carefully select the most relevant features that are likely to have a strong impact on the model’s predictions. This can help reduce noise in the data and improve the model’s performance.
  2. Regularisation: Regularization is a technique that adds a penalty term to the cost function during training, which helps to prevent the model from overfitting the training data. Popular regularisation techniques include L1 and L2 regularisation, dropout, and early stopping.

Question-39-> What is RFR(Randomisation For Regularisation)?

Ans39- Randomized Regularization is a technique that combines the concepts of regularization and randomization to improve the performance of machine learning models. It involves adding a random component to the regularization process, which helps to prevent the model from overfitting to the training data.

Question-40-> Can you explain the difference between Convex and Non Convex function?

Ans40 -In simple terms, a convex function is a function that has a bowl-like shape, where any line segment connecting two points on the function lies above or on the function.

On the other hand, a non-convex function is a function that has a more complex shape, where there may be multiple local minimums and maximums.

In optimization problems, convex functions are easier to optimize because they have a single global minimum, which can be found efficiently. Non-convex functions, on the other hand, are more challenging to optimize because they may have multiple local minimums, and finding the global minimum requires exploring a larger search space.

CODING ROUND

Question-41-> Find the majority element in a list.

Question-42-> Rotate the array by d elements.

Question-43->Given a string, Reverse the order of strings in each word within a sentence while preserving white space and initial word order.

Question-44-> Given 2 strings, you need to find the common elements from both the strings.

one update in the code, please do the lower case for the strings and then solve it.

Question-45-> Write a code in python to print the pattern.

Question-46-> What is the search time complexity of a list and a dictionary?

Ans 46- Search Time complexity is O(n) which is not good when compare to dictionaries which is O(1).

Question-47->Write a python program to define power function from scratch?

Question-48-> Write a python function to find the cubic sum only by recursion?

Question-49-> Write a program to find the dot product in python from scratch?

Question-50-> Write a python program which gives us euclidian distance from scratch?

That was a long long interview, I was very happy and satisfied with the answers.

Team looked happy, director smiled at me but said nothing.

1 hour later, I got to know that “I was selected”.

So if you ask me what went right for me, i would say, “My ability to break complex definitions into simple words made all the difference.”

Job of a Data Analyst/Data Scientist is not only to analyse the data or to build the model but to actually communicate the results to people who are the decision makers in the company and they might not know a single word about data science but they know business. As soon as you understand leveraging your technical skills to solve a business problem in simple terms, you are good to go.

So yes Interpretability of a Machine Learning Model and Interpretability of your words is very important in any interview.

That’s all for today, Thank you so much for reading…

--

--