10 Data Science Interview Questions that You May Have Missed

Review concepts that you may think you already know because they might strike you back during interviews

Seungjun (Josh) Kim
Mind Talk
5 min readOct 11, 2022

--

Free for Use Photo from Pexels

Python List v.s. Tuples

Lists can be edited and hence mutable. Tuples, on the other hand, are immutable. Although it may depend on the specific situation, tuples are generally faster than lists in terms of computational or operational speed.

What is Inheritance?

Inheritance is a quality in a lot of object oriented programming languages such as Python where one class earns access to all the members of another class. Here, the term members refer to things like attributes and methods. Inheritance is useful because it allows users to re-use specific components, say methods, of one class for other newly creating classes without having to rewrite them. The class that we are inheriting from is called the mother or super class while the class the is receiving the inheritance is called the child class.

There are mainly 4 types of inheritance as you can see below:

  • Single Inheritance: Child class acquires the members of a single super class.
  • Hierarchical Inheritance: Inheritance of any number of child classes from one base class is possible
  • Multi-level Inheritance: Child class (c1) in inherited from base class (b1) and another child class (c2) is inherited from another base class (b2).
  • Multiple Inheritance: A child class is inherited from more than one base class.

One thing to note is that Python does support multiple inheritance while Java does.

Loss Function v.s. Cost Function

For loss functions, we usually consider a single data point, prediction or label. The cost function, on the other hand, is a more general measurement that aggregates the difference across the entire dataset (e.g. training data). Representative examples of loss functions are mean-squared error and hinge loss. They are defined as the following:

Square Loss
Hinge Loss

Note that mean squared error can be used for most regression algorithms while hinge loss is often used for Support Vector Machines (SVM).

In a deep learning context, cost function helps us evaluate how good the performance of a model is. It is used to compute the error of the output layer during back-propagation where that error is pushed backwards through the neural network.

Feed-forward Neural Network (FNN) v.s. Recurrent Neural Network (RNN)

The signals of a Recurrent Neural Network (RNN) go in both directions. It generates a layer’s output by combining the present input with previously received inputs and can recall prior data from its internal memory structure. Due to this trait, it is an appropriate algorithm for time series modelling such as stock market predictions.

On the other hand, feed-forward Neural Networks (FNN) such as the Convolution Neural Network (CNN) do not have recurring loops that RNNs have and simply computes the current input and pass it over to the next layer.

What is DBMS and what are the two types of it? How are they different from each other?

DBMS stands for Database Management System. It provides an interface for users to interact with the database itself. Here, the word “interact” rfers to any kind of activity performed on the database for the purpose of deleting, editing, and retrieving from the database.

There are two kinds of DBMS. One is the Relational Database Management System (RDBMS). The data in this kind of database is organized in a “relational” manner where different tables are linked with another based on some rules, patterns and keys. One example of this type of database is the MySQL DBMS.

Another kind of DBMS is the Non-Relational Database Management System. There exists no relations among the tables and one example of this kind of database is the MongoDB DBMS.

Error v.s. Residual (Error)

An error refers to the difference between the predicted value and the actual value. Residual is the difference observed or sampled values and some form of standard value that represents all the data points (e.g. mean).

Normalization v.s. Standardization

Normalization is a statistical method that converts all the values to lie in between 0 and 1 and hence in the same scale.

Normalization Formula

It is useful for algorithms that are sensitive to outliers or unequal variances and scales.

Standardization refers to transforming a set of values so that they fall into a standard normal distribution with a mean of 0 and standard deviation of 1.

Entropy v.s. Gain

Entropy and gain are two closely related concepts that have the following formulaic relationship.

Gain(T, X) = Entropy(T) — Entropy(T,X)

One application of these two concepts in in building decision tree models where the information gain is based on the decrease in entropy after a dataset is split on a feature. The algorithm attempts to find features that return the highest information gain during the training process.

What Is the Difference Between Epoch, Batch, and Iteration in Deep Learning?

  • Batch: It refers to divided fragments of a dataset which are fed into machine or deep learning models sequentially usually because the entire dataset is too big to be inputted into the model at once. Batch size means the number of instances that would be included in a single batch.
  • Iteration: It is the number of times the model sequentially trains itself to batches of data. Naturally, it is inversely proportional to batch size.
  • Epoch: Represents one iteration over the entire dataset.

TF-IDF

TF-IDF (term frequency-inverse document frequency) is a metric in the domain of Natural Language Processing (NLP) that evaluates how relevant a word is to a document in a collection of documents.

This metric is a product of term frequency and inverse document frequency and those two sub-metrics can be calculated as the following:

filotechnologia blogspot

About the Author

Data Scientist. 1st Year PhD student in Informatics at UC Irvine.

Former research area specialist at the Criminal Justice Administrative Records System (CJARS) economics lab at the University of Michigan, working on statistical report generation, automated data quality review, building data pipelines and data standardization & harmonization. Former Data Science Intern at Spotify. Inc. (NYC).

He loves sports, working-out, cooking good Asian food, watching kdramas and making / performing music and most importantly worshiping Jesus Christ, our Lord. Checkout his website!

--

--

Seungjun (Josh) Kim
Mind Talk

Data Scientist; PhD Student in Informatics; Artist (Singing, Percussion); Consider Supporting Me : ) https://joshnjuny.medium.com/membership