Stories by Mrinal Gupta on Medium

G.O.A.T Way to Store Images in SQL Database Using Multi-Threading

Mrinal Gupta — Tue, 08 Feb 2022 04:27:44 GMT

Process thousands of images in minutes!

Photo by Robin GAILLOT-DREVON on Unsplash

Have you ever come across a use case in Spark where you cannot use the power of parallel processing of spark’s executors? In this article, we are going to explore one of the cool ways to use multithreading in Python for storing images in SQL database. This use-case cannot be achieved in Spark as there’s no way you can visualize your data unless you convert the data frame in pandas format. Once you convert into pandas, you lose all the advantages of spark. Therefore, to achieve a similar execution speed (may not be as fast as spark) we can leverage the multi-threading technique in Python.

Upon reading this article, you will learn:

- How to multi-thread in Python?

- How to efficiently store images in SQL database?

Use-Case

We are going to look at an hourly energy consumption dataset where we need to save the plots of consumption for each individual day for a duration of 3 months. Let’s look at the dataset which I downloaded from Kaggle:

Image by author

Preparations

1. Install Libraries

Image by author

You all must be knowing most of the libraries stated above. I’ll go through the less frequently used ones:

Concurrent.futures — Used for launching parallel threads in python
Repeat — Used to supply a stream of constant values
Pyodbc — Used to install ODBC driver for Python to connect to SQL server. The following image shows how to connect to the SQL server which requires all the credentials:

2. Make a folder in your directory to store images

In order to store images in a SQL database, you first need to save each plot image in a .png format.

3. Generating and Storing Images

Now that you have installed the libraries and connected to the SQL server, you can begin the process of storing images. There are two different approaches through which we can achieve this:

Our beloved, ‘For loop’:

Image by author

The above code uses a for loop to loop through all the unique days and create a plot for each day and saves it in the folder that we created earlier
Afterward, we open each created image and store it in the SQL database in the already created table in the SQL database
This process will take a lot of time if you have thousands of images to process upon. Hence, won’t be a scalable approach to move forward with

2. With Multi-threading:

Image by author

The above code uses concurrent.futures library to implement multi-threading. In the map function, you pass the ‘plot_consumption ’ function which will generate images for each date in the `list_dates` that is also passed as one of the arguments. Additionally, you can see how I am passing the dataframe in the repeat function that helps in supplying a constant stream of dataframe to all the concurrent threads processing each day.

It’s a wrap! I hope you learned something new today and can implement it in your projects and/or at workplace.

Thank you for reading!

If you like my writing, then please subscribe to my list
If you liked it, follow me on Medium
Stay in touch on LinkedIn

References

G.O.A.T Way to Store Images in SQL Database Using Multi-Threading was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Uncovering Data Science Interview Questions asked to me — Part 2

Mrinal Gupta — Sun, 30 Jan 2022 21:55:24 GMT

Uncovering Data Science Interview Questions asked to me — Part 2

Photo by Kane Reinholdtsen on Unsplash

Hello readers, I hope you liked Part-1 of the article series where I am uncovering all the Data Science interview questions that have been asked to me since I have graduated with my Master’s. In Part-1, we saw questions from ML theory and Case study. In this Part, I’ll go through some of the statistics and programming questions. Let’s get started!

A) Statistics

1) Explain the p-value to a 10-year-old boy

Ans. Suppose a person is one of the suspects in a theft crime. When he is caught, he simply declines that he is not the one who did it and does not accept the charges. Now, assume the cop had studied statistics in his school and he thought of solving it using hypothetical testing. He writes two statements:

H0 — He is not guilty

Ha — He is guilty

He looks in the police database and finds 5 out of 8 crimes in his name in the past month. He then calculates the probability of him not being guilty = 3/8. Let’s assume, his threshold is 4/8 (1/2) then he would reject the NULL hypothesis. Here, the probability (3/8) is the p-value that we calculated, and the threshold is the level of significance. If the p-value is lower than the threshold then you would reject the NULL hypothesis or otherwise you would fail to reject it.

2) How do you deal with NULL values?

Ans. The key to answering this question lies in the logical understanding of yours as well as the understanding of the data. You can name different ways of dealing with missing values namely imputation using mean, KNN, dropping values, etc. However, I always like to answer this question using an example:

You need to first look at the percentage of null values present in the data, if it is <20% then you should consider filling the values. Moreover, imputation shouldn’t be done blindly as it may reduce the variance in the data. What you can do is imputation by making groups using the other columns and taking the respective mean of the groups. For example, if you want to fill a column with the heights of people then you can’t fill it with the mean of the whole table as there is a difference between the heights of females and males. You can make two groups and fill them with their respective mean.

3) How would you introduce the uncertainty in your final likelihood results?

Carrying out a bootstrapped sampling technique in the final results would help us in giving statistically significant results.

The following link explains how we can perform the bootstrap sampling:

https://carpentries-incubator.github.io/machine-learning-novice-python/08-bootstrapping/index.html

4) What test would you carry out to check the difference between the data of heights of men and women?

Ans. Two samples independent t-test where the hypothesis would be:

H0 — The difference between the mean heights of men and women is 0

Ha — The difference between the mean heights of men and women is not equal to zero

B) Programming

1) Create a normally distributed histogram in Python

Photo by author

2) You are given an array prices where prices[i] is the price of a given stock on the ith day.

You want to maximize your profit by choosing a single day to buy one stock and choosing a different day in the future to sell that stock.

Return the maximum profit you can achieve from this transaction. If you cannot achieve any profit, return 0.

Ans. This question is available in Leetcode as well and was asked by C3.ai:

Photo by author

3) How would you reverse an integer?

Photo by author

4) Difference between list and tuple

Ans. A list is a mutable data structure whereas a tuple is immutable.

5) Difference between Union and Union All

The only difference is that Union All allows duplicate rows whereas Union doesn’t in the final merged table.

6) Difference between multi-processing and multi-threading

Ans. Multiprocessing is a technique of adding multiple CPUs to increase the computing speed of the system whereas, in Multithreading, a single processor has multiple threads which run concurrently for multiple code segments. An application of multithreading from my personal experience is when you want to plot multiple consumption/demand plots for unique customers in real-time then you can run multiple threads for each unique customer concurrently reducing a lot of processing time.

7) What is inheritance?

Ans. It is an object-oriented programming technique where we can define a class that inherits all the methods and attributes from the base class. This helps in taking leverage of already defined classes and reusing them to achieve new functionality.

Inheritance is widely used by Data Scientists when they want to deploy their model into production environment.

Thank you!

If you like my work, please follow me on Medium for reading more articles in near future.

Read my other articles on Machine Learning Questions, Top 10 SQL problems, Feature Engineering & Automating basic data analysis.
Connect with me on LinkedIn

Uncovering Data Science Interview Questions asked to me — Part 2 was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Uncovering all Data Science Questions asked to me — Part 1

Mrinal Gupta — Sat, 29 Jan 2022 16:30:20 GMT

Uncovering all Data Science Interview Questions asked to me — Part 1

Photo by Sebastian Herrmann on Unsplash

It’s been about 15 months since I published my last article in Towards Data Science. A lot has happened during this period including a part-time job, conversion to a full-time Data Scientist job, and finally switching to a new company in a new state kept me away from contributing to society. Nevertheless, I am back with a new article containing a consolidated overview of all the Data Science questions that have been asked to me from all the companies that I interviewed for. The companies include Oracle, C3.ai, Experian, Zest AI, Credit Suisse, Visa, and CVS Health among many.

In this article, you can find questions in the following categories:

- ML Case Study — Part 1

- ML Theory questions — Part 1

- Statistics — Part 2

- Programming — Python & SQL — Part 2

I hope this article will help you in preparing for your future interviews. Let’s get started with the fun part!

A) ML Case Study

In such case studies, asking the right questions is very important as it shows the interviewer that you are able to think in the right direction and you have got those critical thinking skills to approach any problem.

I’ve been asked a couple of case study questions in Oracle (Utilities Division) and C3.AI:

1) How would you determine houses that have electric vehicles from hourly electricity consumption data?

The answer to this question is subjective. However, according to me, we can apply various unsupervised ML techniques namely PCA, Autoencoder, or Clustering to determine the outliers with higher electricity consumption than their neighborhood houses. It may also be helpful to include all features would you create. To name a few, you may create aggregate consumption features to track min, max, and average electricity consumption in the past 1, 3, 7, 15 days, average consumption relative to the neighborhood houses within the same zip code, etc.

2) How would you predict an out-of-stock inventory list?

1. It is important to ask the market location for which the inventory is there as that would help in finding out what is the size of the market, how the demographics of the market affect the stock.

2. Asking for the past year’s demand data would also be very helpful as it would show us the various seasonalities, patterns, holiday demands, etc. necessary for modeling.

3. For feature engineering, you can introduce lags, one-hot encoded variables to account for any seasonality.

B) ML Theory Questions

1) List different types of Regression & Classification metrics.

Ans. Regression Metrics — Mean Squared Error, Root mean squared error, mean absolute error.

Classification Metrics — Accuracy, Precision, Recall, F1 Score, AUC, ROC.

2) What are the pros and cons of Mean squared error?

Ans. Cons:

1. Affected by outliers

2. Loses interpretability if the values are high

3. Doesn’t tell you the direction of the error as it is always positive

Pros:

1. Very easy to implement

2. Easy to numerically optimize

3) Can you use Mean Absolute Error (MAE) as your loss function?

Ans. Since MAE is not differentiable, it cannot be used as a loss function.

4) Can R-squared ever be negative? If yes, why. Write its formula.

Ans. Yes, R-squared can be negative. It means that your predictions are less accurate than the average value of the data over time.

Formula:

5) How do you perform cross-validation in time series data?

Ans. The following link provides a great explanation to Time series CV

https://otexts.com/fpp3/tscv.html

6) Differentiate between Bagging and Boosting? (Asked in almost all the interviews)

Ans. Bagging is short for Bootstrap Aggregation. It is a meta-algorithm where a random sample of data in a training set is selected with replacement to build ‘m’ models. In the end, the result from ‘m’ models is averaged in case of regression or voted in classification.

Boosting is another meta-algorithm that helps in boosting the accuracy of a single learner. This is done by training a series of weak learners to grow into a strong learner while learning from the errors of each subsequent weak learner.

7) What is vanishing gradient?

Ans. Vanishing gradient is a popular problem in artificial neural networks where a large change in the input of certain activation functions like sigmoid would result in a very small change in the output. As more and more layers are added to a network, the gradient of the loss function approaches zero, making the network harder to train.

8) How do Support vector machines work?

Ans. In SVM, the objective is to find an optimal hyperplane that maximizes the minimum distance between the plane and the nearest data points. This ensures that the selected hyperplane is able to successfully segregate all the data points into the respective classes.

More can be found at: https://www.analyticsvidhya.com/blog/2021/03/beginners-guide-to-support-vector-machine-svm/

9) What are the assumptions of linear regression?

Ans. There are mainly four assumptions of linear regression:

1. Linear relationship — There is a linear relationship between the independent and dependent variables

2. Normality — It assumes that all variables follow multivariate normality

3. Multicollinearity — The independent variables are not correlated with each other

4. Homoscedasticity — It assumes that the error terms have constant variance across all the values of the independent variables

10) How would you tackle overfitting in Random Forest?

Ans. Random Forest trains on a series of uncorrelated deeply grown trees which is important to understand how it could overfit. There are some major hyperparameters that you can play with:

N_estimators — As each tree is deep, you need to make sure the number of trees is not very high. Personally, I like to keep the number around 100–200.

Max_depth — Depth is important in all the decision trees which shouldn’t be kept very high even in Random Forest. Playing with max_depth using grid-search would help.

Max_features — As each tree is uncorrelated through the use of a random set of features, one should not be using all the features for training each tree as it defeats the purpose of Random Forest, and it may start to overfit. An optimal number defined in textbooks is sqrt(# of features).

Apart from the above, you can play with other hyperparameters such as min_samples_split, min_samples_leaf, etc.

11) How would you tackle overfitting in Neural Networks?

Ans. There are many ways to tackle overfitting in Neural Networks:

1. Simplifying the model — Reducing the number of nodes, hidden layers and making it less complex should be your first intuition

2. Regularization — Ridge, Lasso, and elastic net are some of the common regularization techniques to penalize the larger coefficients of the variables

3. Early Stopping — While training neural networks, a point comes when your test error starts increasing after decreasing for a number of iterations where we can stop training the model. As that point essentially indicates the beginning of overfitting

12) Define learning rate in gradient boosting.

Ans. Gradient boosted trees trains on a series of weak learners which means each tree has control over the overall result. The learning rate becomes a critical component here as it controls the amount of change that each tree makes on the result. Higher the learning rate higher will be the training speed and vice versa.

13) What is the relationship between the learning rate & the number of estimators?

Ans. They are inversely proportional to each other as if the learning rate is very low then we need a higher number of estimators to reach the final result and vice versa.

14) What are the different feature selection techniques?

Ans. Filter methods — Filter methods use statistical measures to evaluate the relationship (correlation) of two distributions and measure the correlation between the distribution of each of the classes of each feature and the dependent variable. The features that are chosen are the ones with the highest correlation with the dependent variable. For eg. Kolmogorov-Smirnov test.

Wrapper Methods — Wrapper methods utilize statistical models to evaluate the performance of each feature (or a subset of features) based on a performance metric (accuracy, AUC, f1 score, etc.). A common wrapper method is recursive feature elimination, in which a model recursively uses smaller and smaller sets of features until a desired number of features is reached.

Embedded Methods — Embedded methods perform feature elimination as the model is built. A common embedded method for feature selection is regularization, in which a norm is included in the loss function of a statistical model to penalize the number of features used.

15) How to determine your model is overfitting?

Ans. You can determine overfitting by plotting the learning curves which is a plot between the model performance on the train and test data. If the gap between the train and test curve increases with higher complexity in the model then it would indicate overfitting.

16) What is the effect of multi-collinearity on feature importances of XGBoost?

Ans. Multi-collinearity has a huge effect on feature importances as if two variables are highly correlated with each other then one variable compensates the absence of the other in the feature importance scores. Due to this, the feature importance score decreases for what could be a very important feature and you may result in dropping it.

17) What is the effect of multi-collinearity on model performances and model interpretation?

Ans. Multi-collinearity makes it harder to interpret your coefficients as they become very sensitive to small changes in the model.

18) What is the effect of a higher number of features than the number of rows?

Ans. To answer this, you can give an example like if you have only 2 columns and one data point then there can be infinitely many lines that can be the solution to that case. In other words, there won’t be a unique solution to the problem. Hence, it is important to have a higher number of rows than the columns. However, one can use techniques like ridge and lasso to tackle such cases.

19) What is the difference between Ridge and Lasso regression?

Ans. In Ridge, the penalty term is the sum of squares of coefficients whereas, in Lasso, it is the sum of absolute values of coefficients.

20) Explain overfitting to a non-technical audience.

Ans. Let’s assume you have a maths exam tomorrow for which you have practiced all book problems and have somehow memorized most of the solutions to the questions. However, during the exam, the questions asked were a little different from the ones that you memorized, and you couldn’t score well. This is what essentially happens in overfitting where the model learns the training data so well that it can’t perform well on the test data which it hasn’t seen ever.

If you have reached this point, then thank you so much for reading my article. I’ll be back with Part 2 of this article answering statistics questions and programming questions. Stay tuned!

Thank you!

If you like my work, please follow me on Medium for reading more articles in near future.

Read my other articles on Top 10 SQL problems, Feature Engineering & Automating basic data analysis.
Would love to connect with you on LinkedIn.

Uncovering all Data Science Questions asked to me — Part 1 was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dog Breed Classifier Using Convolutional Neural Networks

Mrinal Gupta — Fri, 21 Aug 2020 17:14:05 GMT

Do you also want to identify the breed of any dog in just 5 seconds?

Photo by Nathan Dumlao on Unsplash

Introduction

Do you know the breed of the dog in the picture above? If you don’t then it’s completely fine because I don’t know either. Well, we come across a lot of different breeds of dogs while walking on the street and the second thing that we want to know is his breed (wondering what’s the first thing?! His name!). Why waste time then and let’s take some help from one of the most popular machine learning methods namely Convolutional Neural Network (CNN) to detect the breed of the dog. In this article, we will review a full algorithm to detect the breed of the dog using the given Dataset. We will also see how to use a pre-trained ResNet50 model to use it to detect the breed of the dog.

Step by Step!

Import the Dataset
Detect Humans using CV2
Detect Dogs
Create a CNN to classify Dog Breeds (from Scratch)
Use a CNN to Classify Dog Breeds (using Transfer Learning)
Create a CNN to Classify Dog Breeds (using Transfer Learning)
Write the Algorithm
Test the Algorithm

STEP-1 Import the Dataset

Our dataset contains 8351 total dog images with 133 different categories of breed.

def load_dataset(path):
    data = load_files(path)
    dog_files = np.array(data['filenames'])
    dog_targets = np_utils.to_categorical(np.array(data['target']), 133)
    return dog_files, dog_targets

After calling the above function and passing the path of the images where it is stored, the dog_files would contain the path of all the images in the whole dataset and the dog_targets would contain the one-hot encoded 133 variables. Let’s load the train, test, and validation sets using the above function.

train_files, train_targets = load_dataset('../../../data/dog_images/train')
valid_files, valid_targets = load_dataset('../../../data/dog_images/valid')
test_files, test_targets = load_dataset('../../../data/dog_images/test')

STEP-2 Detect Humans

To add a feature to our model where any human disguised in a dog’s costume does not fool our classification results, we will detect humans through OpenCV’s implementation of Haar feature-based cascade classifiers. To implement this, the following is the code:

import cv2
# extract pre-trained face detector
# cv2.CascadeClassifier is the model for detecting faces
face_cascade = cv2.CascadeClassifier('haarcascades/haarcascade_frontalface_alt.xml')

# load color (BGR) image
# cv2.imread(image_file_name) reads an imageimg = cv2.imread(human_files[3])

# convert BGR image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# find faces in image
faces = face_cascade.detectMultiScale(gray)

# print number of faces detected in the image
print('Number of faces detected:', len(faces))

You can expect to have something like this:

Image by Author

The result on the test data predicted 11% of the dogs to be human faces.

STEP-3 Detect Dogs

In this section, we will implement one of the most powerful CNN architecture available namely ResNet50. It is pre-trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks. We will use this pre-trained model to detect whether the image contains a dog or not.

# This line will download the ResNet-50 model, along with weights that have been trained on ImageNet

from keras.applications.resnet50 import ResNet50

# define ResNet50 model
ResNet50_model = ResNet50(weights='imagenet')

Data Pre-processing

When using TensorFlow as backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input, with shape:

		(nb_samples,rows,columns,channels)

where nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.

The path_to_tensor function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224×224224×224 pixels. Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have shape

			(1,224,224,3)

The paths_to_tensor function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape

		    (nb_samples,224,224,3)

Here, nb_samples is the number of samples, or number of images, in the supplied array of image paths. It is best to think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image) in your dataset!

The following code performs the data pre-processing:

from keras.preprocessing import image                  
from tqdm import tqdm

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(224, 224))
    # convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

Predicting using ResNet50:

from keras.applications.resnet50 import preprocess_input, decode_predictions

def ResNet50_predict_labels(img_path):
    # returns prediction vector for image located at img_path
    img = preprocess_input(path_to_tensor(img_path))
    return np.argmax(ResNet50_model.predict(img))

# returns "True" if a dog is detected in the image stored at img_path
def dog_detector(img_path):
    prediction = ResNet50_predict_labels(img_path)
    return ((prediction <= 268) & (prediction >= 151))

Note: The above function returns the probability for only those categories which are related to dogs. As the ImageNet database is a huge dataset, we are only concerned about the breed of dogs. Also, the test results were as expected and we got 100% accuracy in detecting whether the given image is of a dog or not.

STEP-3 Creating your own CNN from scratch

Now, it’s time to create our own CNN architecture right from the number of convolutional layers, max pooling layers, and deciding other parameters too. The following is the architecture that you can build:

model = Sequential()
model.add(Conv2D(filters = 6, kernel_size=5, strides=1, padding='same', activation='relu', input_shape = (224,224,3)))
model.add(MaxPooling2D(pool_size=2))

model.add(Conv2D(filters=16, kernel_size=5, activation='relu', padding='same', strides=1))
model.add(MaxPooling2D(pool_size=2))

model.add(Dropout(0.2))
model.add(Flatten())

model.add(Dense(200, activation='relu'))
model.add(Dropout(0.4))

model.add(Dense(133,activation='softmax'))

model.summary()

Features of my architecture:

My architecture contains two Convolutional layers to extract the features from the images.
The first convolutional layer is made up of 6 filters of size 5x5 with ReLu as the activation function as it solves the problem of vanishing gradient that we face in Sigmoid. Similarly, in the second Convolutional layer we have 16 filters of same size and same activation function.
In order to reduce the number of parameters and extract only the most important features, two Max Pooling layers of size 2x2 are added after each Convolutional layer.
Further, a Dropout layer with a probability of 0.2 is added in order to prevent overfitting.
Towards the end, we have a fully connected layer with 200 input number of nodes and ReLu activation function.
Another dropout layer of 0.4 is added to speed up the process and prevent overfitting.
Finally, we have the output layer with the number of nodes equal to the number of dog breeds that we have in the dataset with softmax activation function to predict the probabilities of the different breeds.

Image by author

Training the CNN

For training our model, we will take number of epochs = 20 with batch size = 20 and saving the best model weights using the ModelCheckpoint. The following is the code:

# Compile the model
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

epochs = 20

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.from_scratch.hdf5', verbose=1, save_best_only=True)

model.fit(train_tensors, train_targets, validation_data=(valid_tensors, valid_targets), epochs=epochs, batch_size=20, callbacks=[checkpointer], verbose=1)

# load the best weights
model.load_weights('saved_models/weights.best.from_scratch.hdf5')

# Test the model
# get index of predicted dog breed for each image in test set
dog_breed_predictions = [np.argmax(model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]

# report test accuracy
test_accuracy = 100*np.sum(np.array(dog_breed_predictions)==np.argmax(test_targets, axis=1))/len(dog_breed_predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

The test accuracy that I got was 5.8612%. However, you can increase the number of epochs and try hyperparameter tuning to tune the parameters further and increase the accuracy.

STEP-4 Using Transfer learning to classify breeds

Transfer learning is a technique which saves the time of building a CNN from scratch and we can just extract the bottleneck features from a pre-trained classifier and make a few modifications to that model to use it for a different dataset. For e.g. we can remove the final dense layer from that model and add another dense layer with a different number of outputs using a softmax activation function. Here, we are going to use VGG16 and extract features from it and feed it to a global average pooling layer to decrease the number of parameters and finally add the fully connected layer with 133 output nodes using a softmax activation function.

Bottleneck features

bottleneck_features = np.load('bottleneck_features/DogVGG16Data.npz')
train_VGG16 = bottleneck_features['train']
valid_VGG16 = bottleneck_features['valid']
test_VGG16 = bottleneck_features['test']

Add layers at the end

VGG16_model = Sequential()
VGG16_model.add(GlobalAveragePooling2D(input_shape=train_VGG16.shape[1:]))
VGG16_model.add(Dense(133, activation='softmax'))

Similarly, after compiling the model and testing the model at the test dataset, you should get accuracy around 43%

STEP-5 Create your own CNN to classify Dog breeds using Transfer learning

For this section, let’s use ResNet50. The steps that were followed in the previous step remains the same but this time the model will be different and hopefully, the accuracy too. The ResNet50 contains 50 convolutional layers and hence, is very powerful in image classification problems as discussed earlier too.

Model Architecture

ResNet50_model_transfer = Sequential()
ResNet50_model_transfer.add(GlobalAveragePooling2D(input_shape=train_ResNet.shape[1:]))
ResNet50_model_transfer.add(Dense(133, activation='softmax'))

ResNet50_model_transfer.summary()

Architecture image by author

After running it with 20 epochs and batch size = 20, you will observe a test accuracy of 82%.

Other models that you can try are the following:

STEP-6 Write your Algorithm

Now, we will combine all the above steps to convert it into a complete algorithm that would do the following:

if a dog is detected in the image, return the predicted breed.
if a human is detected in the image, return the resembling dog breed.
if neither is detected in the image, provide output that indicates an error.

The following are the series of functions that does the above tasks:

from PIL import Image

def dog_classifier(img_path):
    if dog_detector(img_path):
        print('Lemme guess, Hey! You are a Dog!')
        image = Image.open(img_path)
        plt.imshow(image, interpolation='nearest')
        plt.axis('off')
        plt.show()
        breed = Resnet_predict_breed(img_path)
        print('OMG! You are {}\n\n'.format(breed))
    
    elif face_detector(img_path):
        print('Lemme guess, Hey! You are Human!')
        image = Image.open(img_path)
        plt.imshow(image, interpolation='nearest')
        plt.axis('off')
        plt.show()
        breed = Resnet_predict_breed(img_path)
        print('Hahahha! You look like {}\n\n'.format(breed))
    
    else:
        print('Sorry, you are neither a dog nor a human')
        image = Image.open(img_path)
        plt.imshow(image, interpolation='nearest')
        plt.axis('off')
        plt.show()

STEP-7 Testing Time!

Let’s test some images and see the results:

Image by author

Areas of improvement

There are a few cat breeds which resembles one of the dog breeds which we can see in the last picture. However, the model predicted it to be a dog and hence, we can improve this feature using better hyperparameters and/or growing deeper CNN.
Detecting a dog and a human together in an image even if they are not facing the image would be a great feature to add. We can do this by Data augmentation in the training set by adding the images when a dog is not facing the camera and similarly for the human as well.
I tried an animated dog image which the model did not recognise. Predicting the breed of an animated image of a dog would be quite interesting to see.

Conclusion

I hope you enjoyed going through my article and must have learnt something new today. The whole code of this project is uploaded onto my github. Please feel free to have a look at it.

Dog Breed Classifier Using Convolutional Neural Networks was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

10 problems to practice almost all SQL concepts

Mrinal Gupta — Mon, 06 Jul 2020 08:41:24 GMT

Introduction

I recently completed all 117 questions of SQL in 25 days on Leetcode which is probably one of the most popular websites to practice your coding skills in various programming languages. The website beautifully categorizes all the questions in three categories namely Easy, Medium, and Hard where the level of difficulty handsomely rises with each subsequent level. After completing all of them, I decided to highlight 10 questions which covers almost all the concepts ranging from Basic to Advanced SQL that you can practice in order to brush up your SQL programming skills. Additionally, all of these questions have been asked in interviews from almost all the big tech companies.

The following is the breakdown of SQL skills tested in every question:

Q1 Average Salary: CTE, Aggregates in Window functions, CASE WHEN, Date functions such as DATE_PART, INNER JOIN
Q2 Find Quiet students in results— Subqueries, MIN, MAX, Window functions, Window Alias, INNER JOIN, ALL keyword
Q3 Human Traffic of Stadium — LEFT JOIN with Subqueries, CTE, ROW_NUMBER
Q4 Number of Transactions per Visit —RECURSIVE CTE, COALESCE, COUNT
Q5 Report contiguous dates (MySQL)— Date_sub, ROW_NUMBER
Q6 Sales by Day of the week — Pivot table, CASE WHEN
Q7 Department Top 3 Salaries— DENSE_RANK
Q8 Restaurant Growth — PRECEDING for moving average, OFFSET
Q9 Shortest distance in a Plane — CROSS JOIN, SQRT, POW
Q10 Consecutive Numbers —LAG, LEAD

So, let’s get to the business!

Given two tables below, write a query to display the comparison result (higher/lower/same) of the average salary of employees in a department to the company’s average salary.

https://medium.com/media/b5f8866f84014d770c858ddd99ccd62e/href

Solution 1:

https://medium.com/media/e333a7af87b5aa14f38a5c2cc9765e0c/href

2. Write an SQL query to report the students (student_id, student_name) being “quiet” in ALL exams. A “quite” student is the one who took at least one exam and didn’t score neither the high score nor the low score.

https://medium.com/media/052921e037d0cbd8c0abf017615697b3/href

Solution 2:

https://medium.com/media/344c185eeccd96fd7ebdc80eaac30342/href

3. Write a query to display the records which have 3 or
more consecutive rows and the amount of people more than 100(inclusive).

https://medium.com/media/a3a6004f0f006c4ac46fbb64adf2c236/href

Solution 3:

https://medium.com/media/61a85701703338f962eb34a9114dda0f/href

4. Write an SQL query to find how many users visited the bank and didn’t do any transactions, how many visited the bank and did one transaction and so on.

https://medium.com/media/044e66f63e7a2316d46683715fd1003c/href

Solution 4:

https://medium.com/media/d2de7cc3a744a02fd725f0fd20f14131/href

5. Write an SQL query to generate a report of period_state for each continuous interval of days in the period from 2019–01–01 to 2019–12–31.

https://medium.com/media/fc4b83a4b2b4d268c8c6b3a80624a536/href

Solution 5:

https://medium.com/media/c4f695c755f9bb1722cf28b8a26263c2/href

6. Write an SQL query to report how many units in each category have been ordered on each day of the week.

https://medium.com/media/84a5747682b8d30d35cea495157e5c0c/href

Solution 6:

https://medium.com/media/93ff5f4b1b4e5354e597a0503cde51c3/href

7. Write an SQL query to find employees who earn the top three salaries in each of the department. For the above tables, your SQL query should return the following rows (order of rows does not matter).

https://medium.com/media/f563c243bab1b6a5f4f3fc37e521b3a0/href

Solution 7:

https://medium.com/media/4a8edacae76049073c4929c478f14084/href

8. Write an SQL query to compute moving average of how much customer paid in a 7 days window (current day + 6 days before) .

https://medium.com/media/76faea6e2eaa476d1f57cabc8db256b9/href

Solution 8:

https://medium.com/media/23c2d5db9ac862068e160027773f5dee/href

9. Write a query to find the shortest distance between these points rounded to 2 decimals.

https://medium.com/media/b75d95b9491e4f014532867c0186d837/href

Solution 9:

https://medium.com/media/0494d689dceb83b59429a499c96d1abe/href

10. Write an SQL query to find all numbers that appear at least three times consecutively.

https://medium.com/media/224c6436eb5a528beeebe613daf8b01b/href

Solution 10:

https://medium.com/media/e1b306fec69c48fad748ed800ef8fca6/href

That’s a wrap! I hope you liked the questions and were able to practice some of the most important concepts of SQL. If you want to practice more questions like these, feel free to go on to my Github page where I have uploaded all the 117 solutions.

Thank you!

If you like my work, please follow me on Medium for reading more articles in near future.

Read my other articles on Feature Engineering & Automating basic data analysis.
Would love to connect with you on LinkedIn.

10 problems to practice almost all SQL concepts was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learn how to automate the basic steps of Data Analysis

Mrinal Gupta — Wed, 17 Jun 2020 21:15:57 GMT

Learn how to automate the basic steps of data analysis

Are you also bored with writing df.shape, df.info() again and again?

Photo by Markus Spiske on Unsplash

Introduction

Are you also bored of writing df.shape, df.info(), plt.plot(kind=’bar’), df[‘column_name’].nunique(), and many other basic functions again and again to get the basic insights from any data all the time. I am sure you all must have started to find this process monotonous too. After reading this article, you will see how you can automate these basic functions in five basic steps by developing your own Python package in a matter of a few minutes.

Preview

https://medium.com/media/ef5a84f5c6169b62560db9c26d5dc72f/href

Let’s get started

To begin developing your own customised Python package, the following are the steps that we need to perform:

STEP 1 — Creation of the Python script file

This file will contain the Python code necessary to run the basic data analysis. To demonstrate, let us automate the steps such as calculation of -

Dimension of the dataset
The data types of all the columns
Number of Unique values
Percentage of NA values
Plot the bar chart for all categorical columns
Plot the histogram for all numeric columns to see the distribution of the data
Make a heatmap to show the null values

The following is the snippet of the code that I wrote:

Image by Author

The name of the file should be the name of the package that you want it to be called as such as Pandas, Numpy, etc and should be unique. In our case, I have named it ‘Mrinal’.

STEP 2 Create a Setup.py file

This file is necessary to install the package and contains the information like the package name, author name etc. This file resides outside the folder which contains the Python script file from Step 1 and other files discussed later.

Image by Author

The above image shows the code to be written in the Setup.py. Some things to be noted here are that the name of your package should be unique as if you want to publish to pypi.org later then you can’t use any matching name which is already present in the website. For example, you cannot create a package named ‘Pandas’ or ‘Numpy’ as they are already in the library.

STEP 3 Create an init.py file

This file tells Python that the folder contains a package. It should be present in the same folder along with the Python script file created in Step 1.

Image by Author

The above code is referencing the name of the class that we created in the Python script which was ‘Insights’ and the name of the package that is ‘Mrinal’ in our case. The ‘.’ is mandatory in Python3 and later versions.

STEP 4 Arrange the files in the right folder

For this step:

Create a folder which you can name anything that you want as it wouldn't affect the installation in any way. Let’s name it ‘My first Python package’ for reference
Store the Setup.py file inside this folder
Create another folder inside it and name it the same that you gave to the name of the package, in our case it is ‘Mrinal’ and whenever you want to import the package, you would be writing ‘From Mrinal import Insights’.
Store the Python script file named ‘Mrinal.py’ and the ‘__init__.py’ file inside the newly created folder

STEP 5 Pip Install

Open the command prompt
Use ‘cd’ command to navigate to ‘My first Python package’ folder
Type ‘Pip install .’
This would install your package
Then open any IDE such as Jupyter Notebook and type: ‘From Mrinal import Insights’
Create a class object, for instance, insight_1 = Insights(). You can also look at the preview video.
Then call the ‘automate_analysis()’ function just like in the video. You would see how those repeated steps are now automated and now you have to just call this function which would do all the work.

Congratulations!

You built your first python package on your own and would be saving a lot of time in future by not writing those functions again and again. Similarly, you can add more functions and classes to add more content to your package and make your process of data analysis smoother.

Resources

You can also download all the code files from my GitHub page
If you want to upload your package to pypi.org then you can go to this link

If you like this article then do read my another article on how you can develop the critical skills needed to perform Feature Engineering for a strong Machine Learning model.

Learn how to automate the basic steps of Data Analysis was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Art of engineering features for a strong Machine learning model

Mrinal Gupta — Sun, 31 May 2020 08:29:49 GMT

The art of engineering features for a strong machine learning model

The most critical process for any data science problem that you should learn.

Photo by Franck V. on Unsplash

What you’ll learn?

Develop the critical thinking skills required for feature engineering
Feature engineering for an anti-money laundering algorithm

Introduction

Feature Engineering? [1]

A feature is a numeric representation of raw data. In structured data, they are the independent variables on which one of the variables is dependent. The features that are already present in a dataset are commonly known as data fields and the ones which are created through domain knowledge are known as candidate variables or expert variables. This process of encoding information into a form of a new variable is known as feature engineering.

Why do we need more Features?

A machine learning model’s performance is directly associated with how accurately the independent features capture the right information about the problem at hand. As a result, to deal with any problem we should create as many variables as we can so that later, we can select the most important features for our model and hence, enhancing the model performance. However, this process of creating new features is a tedious job and requires a good understanding of the problem with some domain knowledge. In this article, I am going to describe an example to demonstrate how you can create various candidate variables for an anti-money laundering machine learning model.

We will start first by understanding the problem and then applying that knowledge to perform feature engineering. The following are the series of steps described below.

STEP 1

Understanding the problem

Photo by Alexander Schimmeck on Unsplash

Money laundering is an illegal process of turning the “dirty” money (money obtained from illegal businesses like selling drugs) into “clean” money (legitimate money) either through an obscure sequence of banking transfers or through commercial transactions.

The three broad stages of Money laundering are:

Placement — It is that stage when the “dirty” money is put in the legitimate financial system. The most common way of achieving it is through smurfing, which involves sending small amounts of money to bank accounts that are below anti-money laundering reporting thresholds and later depositing it to the same sender.

Layering — This is the second stage and one of the most complex stages which involves making the money as hard to detect as possible, and further moving it away from the source. The money is purposefully transferred so fast such that the bank cannot detect it.

Integration — The final stage involves putting the “clean” money back into the economy. One of the most common ways is to buy a property in the name of a shell company which shows a legitimate transaction.

Because of the space constraints, I have mentioned only the broader definition of the problem just for demonstration. However, one should be doing proper research on the problem by reading various research papers, patents, and many more.

STEP 2

Breakdown the problem into smaller fragments for effective variable creation [2]

After researching the problem, you should highlight all the insights that you have developed. For instance, I have written a few of them below that you can easily get from the Step-1:

Substantial increases in cash deposits of any individual or business without apparent cause
Deposits subsequently transferred within a short period out of the account and to a destination not normally associated with the customer
Accounts dominated by cash transactions rather than using cheques or letters of credit
A large number of individuals making payments into the same account without an adequate explanation
Large cash withdrawals from a brand-new account, or from an account which has just received an unexpected large credit from abroad

Similarly, the better understanding that you develop about the problem the more insights you will get and hence, better features would be there to enhance the performance of your model. Therefore, all the above insights from the problem should be accounted for while creating the candidate variables to inject more information into the model.

STEP 3

Understand the Dataset

To progress further, let’s assume we have a hypothetical dataset with the following data fields with their description for developing an anti-money laundering model. In a real scenario, the data scientists working for the banks can easily get such data with the following data fields. You can also take a look at the public dataset available.

Image by author

STEP 4

Building Candidate Variables

Here comes the most interesting, the most crucial, and the most difficult part of any Data science problem aka Feature engineering:

4.1. Concatenate two or more different data fields to form a new categorical variable

For our first set of variables, you can think of joining two or more data fields together to make a new variable. To understand this, let’s make pairs of “Origin_acct”, “Destination_acct”, and the “Transaction_type” as shown in the table below:

Image by author

In the table above, you can see a new column named “Origin_acct-Destination_acct” containing the concatenated values of their respective data fields. Similarly, other columns can also be seen with the same approach.

Why?

Concatenation of “Origin_acct” with “Destination_acct” would help in policing the process of “Smurfing” where multiple intermediate accounts transfer small amounts to a single sender multiple times. Additionally, in one of the problems discussed earlier, it was observed that these criminals prefer cash transactions rather than the forms of debit and credit such as cheques, bills of exchange etc. Therefore, concatenating with the “Transaction_type” would also help in giving another dimension of learning to our algorithm to know more about the nature of transactions and track whether the number of cash transactions for that particular account has increased or not (discussed in 4.2). Such activities are strictly not normal and you would see how other numerical candidate variables (discussed later) linked to these concatenated fields would certainly help us in going in the right direction.

4.2. Frequency Candidate Variables [1]

The frequency variables would encode the number of transactions being done by each feature(shown in figure) which would help in capturing the information such as an increase in the number of transactions for a particular pair of accounts which could be a signal of suspicious activity. The figure below shows different combinations of frequency variables. For example, you can calculate the number of times the origin_acct was used on the same day (0), in the last 1 day, in the last 3 days, and so on.

Image by author

A higher value calculated for a time period means there is something abnormal in the behaviour of that account. In the table below you can see how the frequency variable would look like for one of the Origin accounts:

Image by author

The column Origin_frequency_0 starts with 1 (assuming it is for the first time used for transaction) and the number remains 1 for the next day because on 05/02/2014 it is seen for the first time. Similarly, you can deduce how other numbers were calculated.

4.3. Amount Variables

The amount variables would help in calculating the average, maximum, median, and the total amount of the transaction from each account over the past 0, 1, 3, 7, 14, and 30 days (0 indicates the same day) which would help in tracking the third stage namely Integration where a large sum of money is withdrawn from a bank account without any adequate reason possibly for buying a property. Hence, would help the model in identifying any abnormality in the transaction amounts. For instance, there would be one column which contains the total amount transacted for a Destination account over the last 3 days. Similarly, other combinations can be formed as shown below:

Image by author

The table below shows a pair of amount variables:

Image by author

In the table above, the column “Origin_acct-total_Amount_3_days” contains the total amount transacted by the Origin account with #4586524 over the past 3 days. This is why the total remains 98.4 in the last row because the account was not used in the last 3 days. The other column calculates the actual amount transacted on the same day divided by the total amount over the last 3 days.

4.4. Time-since Variables [1]

To encapsulate the information of how fast the transactions are taking place for accounts, these variables can be very handy. It calculates the time between when an account was last used for transaction and the time of the current transaction. The faster the subsequent transactions for a single entity, the higher would be the probability of fraud. Hence, this would help in tracking the 2nd stage namely Layering. The following table shows an example of the time since variable:

Image by author

4.5. Velocity-change candidate Variables [1]

This last set of variables can track the sudden change in the normal behaviour of an account by calculating how the number of transactions or the amount transferred in the past day (0 & 1 day) has changed over the other set of periods (7, 14, & 30 days). The formula for the same is as follows:

Image by author

Hence, if there is an unexpected change in the number of transactions or in the average amount for that account then our model would be able to learn that change. The following table shows an example of velocity-change variables:

Image by author

Summary

You saw how we all were able to encode more and more information about the given problem through many candidate variables using only the given data fields and without any external data. To summarise, you learned the following variables with the respective information encoded in it:

Concatenated Variables- Helped in linking the origin account, destination account, and transaction type with each other that assisted in tracking the problem of smurfing and the higher cash withdrawals

Frequency Variables- Helped in learning how frequently the account is used

Amount Variables- Helped in learning about the magnitude of the amount of transactions.

Time-since Variables- Helped in learning the speed of transactions

Velocity-change Variables- Helped in identifying a sudden change in the behaviour of accounts

I know the above-discussed problem seems very specific to only fraud detection models but, trust me, it would surely help you in developing those critical thinking skills required for creating expert variables for any data science problem. I hope you found it helpful and worth reading. Cheers!

References

[1] Gao, J.X., Zhou, Z.R., Ai, J.S., Xia, B.X. and Coggeshall, S. (2019) Predicting Credit Card Transaction Fraud Using Machine Learning Algorithms. Journal of Intelligent Learning Systems and Applications, 11, 33–63. https://doi.org/10.4236/jilsa.2019.113003

[2] Guideline on Combating Money Laundering and Terrorist Financing. https://www.imolin.org/doc/amlid/Trinidad&Tobago_Guidlines%20on%20Combatting%20Money%20Laundering%20&%20Terrorist%20Financing.pdf

The Art of engineering features for a strong Machine learning model was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Mrinal Gupta on Medium

G.O.A.T Way to Store Images in SQL Database Using Multi-Threading

Process thousands of images in minutes!

Use-Case

Preparations

1. Install Libraries

2. Make a folder in your directory to store images

3. Generating and Storing Images

It’s a wrap! I hope you learned something new today and can implement it in your projects and/or at workplace.

Thank you for reading!

References

Uncovering Data Science Interview Questions asked to me — Part 2

Uncovering Data Science Interview Questions asked to me — Part 2

A) Statistics

B) Programming

Thank you!

Uncovering all Data Science Questions asked to me — Part 1

Uncovering all Data Science Interview Questions asked to me — Part 1

A) ML Case Study

B) ML Theory Questions

Thank you!

Dog Breed Classifier Using Convolutional Neural Networks

Do you also want to identify the breed of any dog in just 5 seconds?

Introduction

Step by Step!

STEP-1 Import the Dataset

STEP-2 Detect Humans

STEP-3 Detect Dogs

Data Pre-processing

STEP-3 Creating your own CNN from scratch

Training the CNN

STEP-4 Using Transfer learning to classify breeds

Bottleneck features

Add layers at the end

STEP-5 Create your own CNN to classify Dog breeds using Transfer learning

Model Architecture

STEP-6 Write your Algorithm

STEP-7 Testing Time!

Areas of improvement

Conclusion

10 problems to practice almost all SQL concepts

Top 10 problems to practice almost all SQL concepts

Covers all SQL concepts of JOIN, Aggregates, Window functions, and Subqueries

Introduction

The following is the breakdown of SQL skills tested in every question:

Thank you!

Learn how to automate the basic steps of Data Analysis

Learn how to automate the basic steps of data analysis

Are you also bored with writing df.shape, df.info() again and again?

Introduction

Preview

Let’s get started

STEP 1 — Creation of the Python script file

STEP 2 Create a Setup.py file

STEP 3 Create an __init__.py file

STEP 4 Arrange the files in the right folder

STEP 5 Pip Install

Congratulations!

Resources

The Art of engineering features for a strong Machine learning model

The art of engineering features for a strong machine learning model

The most critical process for any data science problem that you should learn.

What you’ll learn?

Introduction

Feature Engineering? [1]

Why do we need more Features?

STEP 1

Understanding the problem

STEP 2

Breakdown the problem into smaller fragments for effective variable creation [2]

STEP 3

Understand the Dataset

STEP 4

Building Candidate Variables

Why?

Summary

References

STEP 3 Create an init.py file