<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Mrinal Gupta on Medium]]></title>
        <description><![CDATA[Stories by Mrinal Gupta on Medium]]></description>
        <link>https://medium.com/@mrinalgupta1704?source=rss-7cf002511db6------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*2udlXdIXAumWul8fcFVCEg.jpeg</url>
            <title>Stories by Mrinal Gupta on Medium</title>
            <link>https://medium.com/@mrinalgupta1704?source=rss-7cf002511db6------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 17 May 2026 19:21:32 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@mrinalgupta1704/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[G.O.A.T Way to Store Images in SQL Database Using Multi-Threading]]></title>
            <link>https://medium.com/data-science/g-o-a-t-way-to-store-images-in-sql-database-using-multi-threading-3ef7281b6247?source=rss-7cf002511db6------2</link>
            <guid isPermaLink="false">https://medium.com/p/3ef7281b6247</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[programming]]></category>
            <dc:creator><![CDATA[Mrinal Gupta]]></dc:creator>
            <pubDate>Tue, 08 Feb 2022 04:27:44 GMT</pubDate>
            <atom:updated>2022-02-08T04:27:44.899Z</atom:updated>
            <content:encoded><![CDATA[<h4>Process thousands of images in minutes!</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*DJKMHnfz3e5pncms" /><figcaption>Photo by <a href="https://unsplash.com/@robingaillotdrevon?utm_source=medium&amp;utm_medium=referral">Robin GAILLOT-DREVON</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Have you ever come across a use case in Spark where you cannot use the power of parallel processing of spark’s executors? In this article, we are going to explore one of the cool ways to use multithreading in Python for storing images in SQL database. This use-case cannot be achieved in Spark as there’s no way you can visualize your data unless you convert the data frame in pandas format. Once you convert into pandas, you lose all the advantages of spark. Therefore, to achieve a similar execution speed (may not be as fast as spark) we can leverage the multi-threading technique in Python.</p><p>Upon reading this article, you will learn:</p><p>- How to multi-thread in Python?</p><p>- How to efficiently store images in SQL database?</p><h3>Use-Case</h3><p>We are going to look at an hourly energy consumption dataset where we need to save the plots of consumption for each individual day for a duration of 3 months. Let’s look at the dataset which I downloaded from <a href="https://www.kaggle.com/robikscube/hourly-energy-consumption/version/3">Kaggle</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/248/1*EjgUD5X-ltC7NSXwqKXoLA.png" /><figcaption>Image by author</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/785/1*1rCm0KfOAVUbJrKgw3zTmA.png" /><figcaption>Image by author</figcaption></figure><h3>Preparations</h3><h4>1. Install Libraries</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/283/1*RUFHnN_jQsrXszaYcHnAYw.png" /><figcaption>Image by author</figcaption></figure><p>You all must be knowing most of the libraries stated above. I’ll go through the less frequently used ones:</p><ol><li><a href="https://docs.python.org/3/library/concurrent.futures.html">Concurrent.futures</a> — Used for launching parallel threads in python</li><li><a href="https://docs.python.org/3/library/itertools.html#itertools.repeat">Repeat </a>— Used to supply a stream of constant values</li><li><a href="https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/python-sql-driver-pyodbc?view=sql-server-ver15">Pyodbc </a>— Used to install ODBC driver for Python to connect to SQL server. The following image shows how to connect to the SQL server which requires all the credentials:</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/711/1*gVF7Ih_UKPxQvR-Fb6TCag.png" /></figure><h4>2. Make a folder in your directory to store images</h4><p>In order to store images in a SQL database, you first need to save each plot image in a .png format.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/488/1*kc7RMeB7fRiSA-GYNi0Ixg.png" /></figure><h4>3. Generating and Storing Images</h4><p>Now that you have installed the libraries and connected to the SQL server, you can begin the process of storing images. There are two different approaches through which we can achieve this:</p><ol><li>Our beloved, ‘For loop’:</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/846/1*15wjWVnJaFPB3fJ2cY6z5A.png" /><figcaption>Image by author</figcaption></figure><ul><li>The above code uses a for loop to loop through all the unique days and create a plot for each day and saves it in the folder that we created earlier</li><li>Afterward, we open each created image and store it in the SQL database in the already created table in the SQL database</li><li>This process will take a lot of time if you have thousands of images to process upon. Hence, won’t be a scalable approach to move forward with</li></ul><p>2. With Multi-threading:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/853/1*b0CfVvO6y7CGtnhxj5EpaA.png" /><figcaption>Image by author</figcaption></figure><p>The above code uses concurrent.futures library to implement multi-threading. In the map function, you pass the ‘<em>plot_consumption </em>’ function which will generate images for each date in the `<em>list_dates` </em>that<em> </em>is also passed as one of the arguments. Additionally, you can see how I am passing the dataframe in the repeat function that helps in supplying a constant stream of dataframe to all the concurrent threads processing each day.</p><h4>It’s a wrap! I hope you learned something new today and can implement it in your projects and/or at workplace.</h4><h3>Thank you for reading!</h3><ul><li><em>If you like my writing, then please subscribe to my </em><a href="https://medium.com/subscribe/@mrinalgupta1704"><em>list</em></a></li><li><em>If you liked it, </em><a href="https://medium.com/@mrinalgupta1704"><em>follow me on Medium</em></a></li><li><em>Stay in touch on </em><a href="https://www.linkedin.com/in/mrinal-gupta-5319a9ab/"><em>LinkedIn</em></a></li></ul><h3>References</h3><ul><li><a href="https://www.pjm.com/Search%20Results.aspx?#q=hourly%20data&amp;sort=relevancy&amp;f:_E50F7924-B7B4-46B5-BA96-7E2D0F3D7882=[Csv]">PJM — Search Results</a> — Data Source</li><li><a href="https://www.kaggle.com/robikscube/hourly-energy-consumption">Hourly Energy Consumption | Kaggle</a></li><li><a href="https://creativecommons.org/publicdomain/zero/1.0/">License </a>to use the dataset</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3ef7281b6247" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/g-o-a-t-way-to-store-images-in-sql-database-using-multi-threading-3ef7281b6247">G.O.A.T Way to Store Images in SQL Database Using Multi-Threading</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Uncovering Data Science Interview Questions asked to me — Part 2]]></title>
            <link>https://medium.com/swlh/uncovering-data-science-interview-questions-asked-to-me-part-2-22f25fb031c8?source=rss-7cf002511db6------2</link>
            <guid isPermaLink="false">https://medium.com/p/22f25fb031c8</guid>
            <category><![CDATA[programming]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[towards-data-science]]></category>
            <dc:creator><![CDATA[Mrinal Gupta]]></dc:creator>
            <pubDate>Sun, 30 Jan 2022 21:55:24 GMT</pubDate>
            <atom:updated>2022-01-31T16:51:21.700Z</atom:updated>
            <content:encoded><![CDATA[<h3>Uncovering Data Science Interview Questions asked to me — Part 2</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*LT382l1xImdaFYEl" /><figcaption>Photo by <a href="https://unsplash.com/@kanereinholdtsen?utm_source=medium&amp;utm_medium=referral">Kane Reinholdtsen</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Hello readers, I hope you liked <a href="https://medium.com/@mrinalgupta1704/uncovering-all-data-science-questions-asked-to-me-part-1-8143a72cd247">Part-1</a> of the article series where I am uncovering all the Data Science interview questions that have been asked to me since I have graduated with my Master’s. In Part-1, we saw questions from ML theory and Case study. In this Part, I’ll go through some of the statistics and programming questions. Let’s get started!</p><h3>A) Statistics</h3><p><strong>1) Explain the p-value to a 10-year-old boy</strong></p><p>Ans. Suppose a person is one of the suspects in a theft crime. When he is caught, he simply declines that he is not the one who did it and does not accept the charges. Now, assume the cop had studied statistics in his school and he thought of solving it using hypothetical testing. He writes two statements:</p><p>H0 — He is not guilty</p><p>Ha — He is guilty</p><p>He looks in the police database and finds 5 out of 8 crimes in his name in the past month. He then calculates the probability of him not being guilty = 3/8. Let’s assume, his threshold is 4/8 (1/2) then he would reject the NULL hypothesis. Here, the probability (3/8) is the p-value that we calculated, and the threshold is the level of significance. If the p-value is lower than the threshold then you would reject the NULL hypothesis or otherwise you would fail to reject it.</p><p><strong>2) How do you deal with NULL values?</strong></p><p>Ans. The key to answering this question lies in the logical understanding of yours as well as the understanding of the data. You can name different ways of dealing with missing values namely imputation using mean, KNN, dropping values, etc. However, I always like to answer this question using an example:</p><p>You need to first look at the percentage of null values present in the data, if it is &lt;20% then you should consider filling the values. Moreover, imputation shouldn’t be done blindly as it may reduce the variance in the data. What you can do is imputation by making groups using the other columns and taking the respective mean of the groups. For example, if you want to fill a column with the heights of people then you can’t fill it with the mean of the whole table as there is a difference between the heights of females and males. You can make two groups and fill them with their respective mean.</p><p><strong>3) How would you introduce the uncertainty in your final likelihood results?</strong></p><p>Carrying out a bootstrapped sampling technique in the final results would help us in giving statistically significant results.</p><p>The following link explains how we can perform the bootstrap sampling:</p><p><a href="https://carpentries-incubator.github.io/machine-learning-novice-python/08-bootstrapping/index.html">https://carpentries-incubator.github.io/machine-learning-novice-python/08-bootstrapping/index.html</a></p><p><strong>4) What test would you carry out to check the difference between the data of heights of men and women?</strong></p><p>Ans. Two samples independent t-test where the hypothesis would be:</p><p>H0 — The difference between the mean heights of men and women is 0</p><p>Ha — The difference between the mean heights of men and women is not equal to zero</p><h3>B) Programming</h3><p><strong>1) Create a normally distributed histogram in Python</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/875/0*w2PGzA3Bo7aD3NFa.png" /><figcaption>Photo by author</figcaption></figure><p><strong>2) You are given an array </strong><strong>prices where </strong><strong>prices[i] is the price of a given stock on the </strong><strong>ith day.</strong></p><p><strong>You want to maximize your profit by choosing a single day to buy one stock and choosing a different day in the future to sell that stock.</strong></p><p><strong>Return <em>the maximum profit you can achieve from this transaction</em>. If you cannot achieve any profit, return </strong><strong>0.</strong></p><p>Ans. This question is available in Leetcode as well and was asked by C3.ai:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/875/0*KgHQjeL7u-O-GcaP.png" /><figcaption>Photo by author</figcaption></figure><p><strong>3) How would you reverse an integer?</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/875/0*bbNFlPvmmUzS8Yjv.png" /><figcaption>Photo by author</figcaption></figure><p><strong>4) Difference between list and tuple</strong></p><p>Ans. A list is a mutable data structure whereas a tuple is immutable.</p><p><strong>5) Difference between Union and Union All</strong></p><p>The only difference is that Union All allows duplicate rows whereas Union doesn’t in the final merged table.</p><p><strong>6) Difference between multi-processing and multi-threading</strong></p><p>Ans. Multiprocessing is a technique of adding multiple CPUs to increase the computing speed of the system whereas, in Multithreading, a single processor has multiple threads which run concurrently for multiple code segments. An application of multithreading from my personal experience is when you want to plot multiple consumption/demand plots for unique customers in real-time then you can run multiple threads for each unique customer concurrently reducing a lot of processing time.</p><p><strong>7) What is inheritance?</strong></p><p>Ans. It is an object-oriented programming technique where we can define a class that inherits all the methods and attributes from the base class. This helps in taking leverage of already defined classes and reusing them to achieve new functionality.</p><p>Inheritance is widely used by Data Scientists when they want to deploy their model into production environment.</p><h3>Thank you!</h3><p>If you like my work, please follow me on Medium for reading more articles in near future.</p><ul><li>Read my other articles on <a href="https://medium.com/@mrinalgupta1704/uncovering-all-data-science-questions-asked-to-me-part-1-8143a72cd247">Machine Learning Questions</a>, <a href="https://towardsdatascience.com/10-problems-to-practice-almost-all-sql-concepts-37545e7c5219">Top 10 SQL problems</a>, <a href="https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c">Feature Engineering</a> &amp; <a href="https://towardsdatascience.com/learn-how-to-automate-the-basic-steps-of-data-analysis-45e118048172">Automating basic data analysis</a>.</li><li>Connect with me on <a href="https://www.linkedin.com/in/mrinal-gupta-5319a9ab/">LinkedIn</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=22f25fb031c8" width="1" height="1" alt=""><hr><p><a href="https://medium.com/swlh/uncovering-data-science-interview-questions-asked-to-me-part-2-22f25fb031c8">Uncovering Data Science Interview Questions asked to me — Part 2</a> was originally published in <a href="https://medium.com/swlh">The Startup</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Uncovering all Data Science Questions asked to me — Part 1]]></title>
            <link>https://medium.com/data-science/uncovering-all-data-science-questions-asked-to-me-part-1-8143a72cd247?source=rss-7cf002511db6------2</link>
            <guid isPermaLink="false">https://medium.com/p/8143a72cd247</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[interview]]></category>
            <category><![CDATA[towards-data-science]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Mrinal Gupta]]></dc:creator>
            <pubDate>Sat, 29 Jan 2022 16:30:20 GMT</pubDate>
            <atom:updated>2022-02-15T03:24:00.138Z</atom:updated>
            <content:encoded><![CDATA[<h3>Uncovering all Data Science Interview Questions asked to me — Part 1</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*PrOnX7bZ7Idp-NgX" /><figcaption>Photo by <a href="https://unsplash.com/@officestock?utm_source=medium&amp;utm_medium=referral">Sebastian Herrmann</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>It’s been about 15 months since I published my last article in Towards Data Science. A lot has happened during this period including a part-time job, conversion to a full-time Data Scientist job, and finally switching to a new company in a new state kept me away from contributing to society. Nevertheless, I am back with a new article containing a consolidated overview of all the Data Science questions that have been asked to me from all the companies that I interviewed for. The companies include Oracle, C3.ai, Experian, Zest AI, Credit Suisse, Visa, and CVS Health among many.</p><p>In this article, you can find questions in the following categories:</p><p>- ML Case Study — Part 1</p><p>- ML Theory questions — Part 1</p><p>- Statistics — Part 2</p><p>- Programming — Python &amp; SQL — Part 2</p><p>I hope this article will help you in preparing for your future interviews. Let’s get started with the fun part!</p><h3>A) ML Case Study</h3><p>In such case studies, asking the right questions is very important as it shows the interviewer that you are able to think in the right direction and you have got those critical thinking skills to approach any problem.</p><p>I’ve been asked a couple of case study questions in Oracle (Utilities Division) and C3.AI:</p><p><strong>1) How would you determine houses that have electric vehicles from hourly electricity consumption data?</strong></p><p>The answer to this question is subjective. However, according to me, we can apply various unsupervised ML techniques namely PCA, Autoencoder, or Clustering to determine the outliers with higher electricity consumption than their neighborhood houses. It may also be helpful to include all features would you create. To name a few, you may create aggregate consumption features to track min, max, and average electricity consumption in the past 1, 3, 7, 15 days, average consumption relative to the neighborhood houses within the same zip code, etc.</p><p><strong>2) How would you predict an out-of-stock inventory list?</strong></p><p>1. It is important to ask the market location for which the inventory is there as that would help in finding out what is the size of the market, how the demographics of the market affect the stock.</p><p>2. Asking for the past year’s demand data would also be very helpful as it would show us the various seasonalities, patterns, holiday demands, etc. necessary for modeling.</p><p>3. For feature engineering, you can introduce lags, one-hot encoded variables to account for any seasonality.</p><h3>B) ML Theory Questions</h3><p><strong>1) List different types of Regression &amp; Classification metrics.</strong></p><p>Ans. Regression Metrics — Mean Squared Error, Root mean squared error, mean absolute error.</p><p>Classification Metrics — Accuracy, Precision, Recall, F1 Score, AUC, ROC.</p><p><strong>2) What are the pros and cons of Mean squared error?</strong></p><p>Ans.<strong> Cons:</strong></p><p>1. Affected by outliers</p><p>2. Loses interpretability if the values are high</p><p>3. Doesn’t tell you the direction of the error as it is always positive</p><p><strong>Pros:</strong></p><p>1. Very easy to implement</p><p>2. Easy to numerically optimize</p><p><strong>3) Can you use Mean Absolute Error (MAE) as your loss function?</strong></p><p>Ans. Since MAE is not differentiable, it cannot be used as a loss function.</p><p><strong>4) Can R-squared ever be negative? If yes, why. Write its formula.</strong></p><p>Ans. Yes, R-squared can be negative. It means that your predictions are less accurate than the average value of the data over time.</p><p>Formula:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/406/0*nKKnTb_MXxfFefmF.jpeg" /></figure><p><strong>5) How do you perform cross-validation in time series data?</strong></p><p>Ans. The following link provides a great explanation to Time series CV</p><p><a href="https://otexts.com/fpp3/tscv.html">https://otexts.com/fpp3/tscv.html</a></p><p><strong>6) Differentiate between Bagging and Boosting? (Asked in almost all the interviews)</strong></p><p>Ans. Bagging is short for Bootstrap Aggregation. It is a meta-algorithm where a random sample of data in a training set is selected with replacement to build ‘m’ models. In the end, the result from ‘m’ models is averaged in case of regression or voted in classification.</p><p>Boosting is another meta-algorithm that helps in boosting the accuracy of a single learner. This is done by training a series of weak learners to grow into a strong learner while learning from the errors of each subsequent weak learner.</p><p><strong>7) What is vanishing gradient?</strong></p><p>Ans. Vanishing gradient is a popular problem in artificial neural networks where a large change in the input of certain activation functions like sigmoid would result in a very small change in the output. As more and more layers are added to a network, the gradient of the loss function approaches zero, making the network harder to train.</p><p><strong>8) How do Support vector machines work?</strong></p><p>Ans. In SVM, the objective is to find an optimal hyperplane that maximizes the minimum distance between the plane and the nearest data points. This ensures that the selected hyperplane is able to successfully segregate all the data points into the respective classes.</p><p>More can be found at: <a href="https://www.analyticsvidhya.com/blog/2021/03/beginners-guide-to-support-vector-machine-svm/">https://www.analyticsvidhya.com/blog/2021/03/beginners-guide-to-support-vector-machine-svm/</a></p><p><strong>9) What are the assumptions of linear regression?</strong></p><p>Ans. There are mainly four assumptions of linear regression:</p><p>1. <strong>Linear relationship </strong>— There is a linear relationship between the independent and dependent variables</p><p>2. <strong>Normality</strong> — It assumes that all variables follow multivariate normality</p><p>3. <strong>Multicollinearity</strong> — The independent variables are not correlated with each other</p><p>4. <strong>Homoscedasticity</strong> — It assumes that the error terms have constant variance across all the values of the independent variables</p><p><strong>10) How would you tackle overfitting in Random Forest?</strong></p><p>Ans. Random Forest trains on a series of uncorrelated deeply grown trees which is important to understand how it could overfit. There are some major hyperparameters that you can play with:</p><p><strong>N_estimators</strong> — As each tree is deep, you need to make sure the number of trees is not very high. Personally, I like to keep the number around 100–200.</p><p><strong>Max_depth </strong>— Depth is important in all the decision trees which shouldn’t be kept very high even in Random Forest. Playing with max_depth using grid-search would help.</p><p><strong>Max_features </strong>— As each tree is uncorrelated through the use of a random set of features, one should not be using all the features for training each tree as it defeats the purpose of Random Forest, and it may start to overfit. An optimal number defined in textbooks is sqrt(# of features).</p><p>Apart from the above, you can play with other hyperparameters such as min_samples_split, min_samples_leaf, etc.</p><p><strong>11) How would you tackle overfitting in Neural Networks?</strong></p><p>Ans. There are many ways to tackle overfitting in Neural Networks:</p><p>1. <strong>Simplifying the model</strong> — Reducing the number of nodes, hidden layers and making it less complex should be your first intuition</p><p>2. <strong>Regularization</strong> — Ridge, Lasso, and elastic net are some of the common regularization techniques to penalize the larger coefficients of the variables</p><p>3. <strong>Early Stopping </strong>— While training neural networks, a point comes when your test error starts increasing after decreasing for a number of iterations where we can stop training the model. As that point essentially indicates the beginning of overfitting</p><p><strong>12) Define learning rate in gradient boosting.</strong></p><p>Ans. Gradient boosted trees trains on a series of weak learners which means each tree has control over the overall result. The learning rate becomes a critical component here as it controls the amount of change that each tree makes on the result. Higher the learning rate higher will be the training speed and vice versa.</p><p><strong>13) What is the relationship between the learning rate &amp; the number of estimators?</strong></p><p>Ans. They are inversely proportional to each other as if the learning rate is very low then we need a higher number of estimators to reach the final result and vice versa.</p><p><strong>14) What are the different feature selection techniques?</strong></p><p>Ans. <strong>Filter methods</strong> — Filter methods use statistical measures to evaluate the relationship (correlation) of two distributions and measure the correlation between the distribution of each of the classes of each feature and the dependent variable. The features that are chosen are the ones with the highest correlation with the dependent variable. For eg. Kolmogorov-Smirnov test.</p><p><strong>Wrapper Methods </strong>— Wrapper methods utilize statistical models to evaluate the performance of each feature (or a subset of features) based on a performance metric (accuracy, AUC, f1 score, etc.). A common wrapper method is recursive feature elimination, in which a model recursively uses smaller and smaller sets of features until a desired number of features is reached.</p><p><strong>Embedded Methods </strong>— Embedded methods perform feature elimination as the model is built. A common embedded method for feature selection is regularization, in which a norm is included in the loss function of a statistical model to penalize the number of features used.</p><p><strong>15) How to determine your model is overfitting?</strong></p><p>Ans. You can determine overfitting by plotting the learning curves which is a plot between the model performance on the train and test data. If the gap between the train and test curve increases with higher complexity in the model then it would indicate overfitting.</p><p><strong>16) What is the effect of multi-collinearity on feature importances of XGBoost?</strong></p><p>Ans. Multi-collinearity has a huge effect on feature importances as if two variables are highly correlated with each other then one variable compensates the absence of the other in the feature importance scores. Due to this, the feature importance score decreases for what could be a very important feature and you may result in dropping it.</p><p><strong>17) What is the effect of multi-collinearity on model performances and model interpretation?</strong></p><p>Ans. Multi-collinearity makes it harder to interpret your coefficients as they become very sensitive to small changes in the model.</p><p><strong>18) What is the effect of a higher number of features than the number of rows?</strong></p><p>Ans. To answer this, you can give an example like if you have only 2 columns and one data point then there can be infinitely many lines that can be the solution to that case. In other words, there won’t be a unique solution to the problem. Hence, it is important to have a higher number of rows than the columns. However, one can use techniques like ridge and lasso to tackle such cases.</p><p><strong>19) What is the difference between Ridge and Lasso regression?</strong></p><p>Ans. In Ridge, the penalty term is the sum of squares of coefficients whereas, in Lasso, it is the sum of absolute values of coefficients.</p><p><strong>20) Explain overfitting to a non-technical audience.</strong></p><p>Ans. Let’s assume you have a maths exam tomorrow for which you have practiced all book problems and have somehow memorized most of the solutions to the questions. However, during the exam, the questions asked were a little different from the ones that you memorized, and you couldn’t score well. This is what essentially happens in overfitting where the model learns the training data so well that it can’t perform well on the test data which it hasn’t seen ever.</p><p>If you have reached this point, then thank you so much for reading my article. I’ll be back with Part 2 of this article answering statistics questions and programming questions. Stay tuned!</p><h3>Thank you!</h3><p>If you like my work, please follow me on Medium for reading more articles in near future.</p><ul><li>Read my other articles on <a href="https://towardsdatascience.com/10-problems-to-practice-almost-all-sql-concepts-37545e7c5219">Top 10 SQL problems</a>, <a href="https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c">Feature Engineering</a> &amp; <a href="https://towardsdatascience.com/learn-how-to-automate-the-basic-steps-of-data-analysis-45e118048172">Automating basic data analysis</a>.</li><li>Would love to connect with you on <a href="https://www.linkedin.com/in/mrinal-gupta-5319a9ab/">LinkedIn</a>.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8143a72cd247" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/uncovering-all-data-science-questions-asked-to-me-part-1-8143a72cd247">Uncovering all Data Science Questions asked to me — Part 1</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Dog Breed Classifier Using Convolutional Neural Networks]]></title>
            <link>https://medium.com/swlh/dog-breed-classifier-using-convolutional-neural-networks-6052edfab487?source=rss-7cf002511db6------2</link>
            <guid isPermaLink="false">https://medium.com/p/6052edfab487</guid>
            <category><![CDATA[udacity]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[towards-data-science]]></category>
            <dc:creator><![CDATA[Mrinal Gupta]]></dc:creator>
            <pubDate>Fri, 21 Aug 2020 17:14:05 GMT</pubDate>
            <atom:updated>2020-08-22T00:36:37.338Z</atom:updated>
            <content:encoded><![CDATA[<h4>Do you also want to identify the breed of any dog in just 5 seconds?</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*OacRr8sB6PSc945I" /><figcaption>Photo by <a href="https://unsplash.com/@nate_dumlao?utm_source=medium&amp;utm_medium=referral">Nathan Dumlao</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3>Introduction</h3><p>Do you know the breed of the dog in the picture above? If you don’t then it’s completely fine because I don’t know either. Well, we come across a lot of different breeds of dogs while walking on the street and the second thing that we want to know is his breed (wondering what’s the first thing?! His name!). Why waste time then and let’s take some help from one of the most popular machine learning methods namely Convolutional Neural Network (CNN) to detect the breed of the dog. In this article, we will review a full algorithm to detect the breed of the dog using the given <a href="https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip"><strong>Dataset</strong></a><strong>. </strong>We will also see how to use a pre-trained ResNet50 model to use it to detect the breed of the dog.</p><h3>Step by Step!</h3><ul><li>Import the Dataset</li><li>Detect Humans using CV2</li><li>Detect Dogs</li><li>Create a CNN to classify Dog Breeds (from Scratch)</li><li>Use a CNN to Classify Dog Breeds (using Transfer Learning)</li><li>Create a CNN to Classify Dog Breeds (using Transfer Learning)</li><li>Write the Algorithm</li><li>Test the Algorithm</li></ul><h3>STEP-1 Import the Dataset</h3><p>Our dataset contains 8351 total dog images with 133 different categories of breed.</p><pre><strong>def </strong>load_dataset(path):<br>    data = load_files(path)<br>    dog_files = np.array(data[&#39;filenames&#39;])<br>    dog_targets = np_utils.to_categorical(np.array(data[&#39;target&#39;]), 133)<br>    <strong>return </strong>dog_files, dog_targets</pre><p>After calling the above function and passing the path of the images where it is stored, the dog_files would contain the path of all the images in the whole dataset and the dog_targets would contain the one-hot encoded 133 variables. Let’s load the train, test, and validation sets using the above function.</p><pre>train_files, train_targets = load_dataset(&#39;../../../data/dog_images/train&#39;)<br>valid_files, valid_targets = load_dataset(&#39;../../../data/dog_images/valid&#39;)<br>test_files, test_targets = load_dataset(&#39;../../../data/dog_images/test&#39;)</pre><h3>STEP-2 Detect Humans</h3><p>To add a feature to our model where any human disguised in a dog’s costume does not fool our classification results, we will detect humans through OpenCV’s implementation of Haar feature-based cascade classifiers. To implement this, the following is the code:</p><pre>import cv2<br><em># extract pre-trained face detector<br># cv2.CascadeClassifier is the model for detecting faces</em><br>face_cascade = cv2.CascadeClassifier(&#39;haarcascades/haarcascade_frontalface_alt.xml&#39;)</pre><pre><em># load color (BGR) image<br># cv2.imread(image_file_name) reads an image</em>img = cv2.imread(human_files[3])</pre><pre><em># convert BGR image to grayscale</em><br>gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)</pre><pre><em># find faces in image</em><br>faces = face_cascade.detectMultiScale(gray)</pre><pre><em># print number of faces detected in the image</em><br>print(&#39;Number of faces detected:&#39;, len(faces))</pre><p>You can expect to have something like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/367/1*oqhYMIo-88FK2r2bkzRO3Q.png" /><figcaption>Image by Author</figcaption></figure><p>The result on the test data predicted 11% of the dogs to be human faces.</p><h3>STEP-3 Detect Dogs</h3><p>In this section, we will implement one of the most powerful CNN architecture available namely ResNet50. It is pre-trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks. We will use this pre-trained model to detect whether the image contains a dog or not.</p><pre><em># This </em>line will download the ResNet-50 model, along with weights <em>that have been trained on ImageNet</em></pre><pre>from keras.applications.resnet50 import ResNet50</pre><pre><em># define ResNet50 model</em><br>ResNet50_model = ResNet50(weights=&#39;imagenet&#39;)</pre><h4>Data Pre-processing</h4><p>When using TensorFlow as backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input, with shape:</p><pre>		(nb_samples,rows,columns,channels)</pre><p>where nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.</p><p>The path_to_tensor function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224×224224×224 pixels. Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have shape</p><pre>			(1,224,224,3)</pre><p>The paths_to_tensor function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape</p><pre>		    (nb_samples,224,224,3)</pre><p>Here, nb_samples is the number of samples, or number of images, in the supplied array of image paths. It is best to think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image) in your dataset!</p><p>The following code performs the data pre-processing:</p><pre>from keras.preprocessing import image                  <br>from tqdm import tqdm</pre><pre>def path_to_tensor(img_path):<br>    <em># loads RGB image as PIL.Image.Image type</em><br>    img = image.load_img(img_path, target_size=(224, 224))<br>    # convert PIL.Image.Image type to 3D tensor with shape (224, 224, 3)<br>    x = image.img_to_array(img)<br><em>    # convert 3D tensor to 4D tensor with shape (1, 224, 224, 3) and return 4D tensor</em><br>    return np.expand_dims(x, axis=0)</pre><pre>def paths_to_tensor(img_paths):<br>    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]<br>    return np.vstack(list_of_tensors)</pre><p>Predicting using ResNet50:</p><pre>from keras.applications.resnet50 import preprocess_input, decode_predictions</pre><pre>def ResNet50_predict_labels(img_path):<br>    <em># returns prediction vector for image located at img_path</em><br>    img = preprocess_input(path_to_tensor(img_path))<br>    return np.argmax(ResNet50_model.predict(img))</pre><pre><em># returns &quot;True&quot; if a dog is detected in the image stored at img_path</em><br>def dog_detector(img_path):<br>    prediction = ResNet50_predict_labels(img_path)<br>    return ((prediction &lt;= 268) &amp; (prediction &gt;= 151))</pre><p>Note: The above function returns the probability for only those categories which are related to dogs. As the ImageNet database is a huge dataset, we are only concerned about the breed of dogs. Also, the test results were as expected and we got 100% accuracy in detecting whether the given image is of a dog or not.</p><h3>STEP-3 Creating your own CNN from scratch</h3><p>Now, it’s time to create our own CNN architecture right from the number of convolutional layers, max pooling layers, and deciding other parameters too. The following is the architecture that you can build:</p><pre>model = Sequential()<br>model.add(Conv2D(filters = 6, kernel_size=5, strides=1, padding=&#39;same&#39;, activation=&#39;relu&#39;, input_shape = (224,224,3)))<br>model.add(MaxPooling2D(pool_size=2))</pre><pre>model.add(Conv2D(filters=16, kernel_size=5, activation=&#39;relu&#39;, padding=&#39;same&#39;, strides=1))<br>model.add(MaxPooling2D(pool_size=2))</pre><pre>model.add(Dropout(0.2))<br>model.add(Flatten())</pre><pre>model.add(Dense(200, activation=&#39;relu&#39;))<br>model.add(Dropout(0.4))</pre><pre>model.add(Dense(133,activation=&#39;softmax&#39;))</pre><pre>model.summary()</pre><p>Features of my architecture:</p><ul><li>My architecture contains two Convolutional layers to extract the features from the images.</li><li>The first convolutional layer is made up of 6 filters of size 5x5 with ReLu as the activation function as it solves the problem of vanishing gradient that we face in Sigmoid. Similarly, in the second Convolutional layer we have 16 filters of same size and same activation function.</li><li>In order to reduce the number of parameters and extract only the most important features, two Max Pooling layers of size 2x2 are added after each Convolutional layer.</li><li>Further, a Dropout layer with a probability of 0.2 is added in order to prevent overfitting.</li><li>Towards the end, we have a fully connected layer with 200 input number of nodes and ReLu activation function.</li><li>Another dropout layer of 0.4 is added to speed up the process and prevent overfitting.</li><li>Finally, we have the output layer with the number of nodes equal to the number of dog breeds that we have in the dataset with softmax activation function to predict the probabilities of the different breeds.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/658/1*3Ov8V8zLWS999T7CtwjFaw.png" /><figcaption>Image by author</figcaption></figure><h4>Training the CNN</h4><p>For training our model, we will take number of epochs = 20 with batch size = 20 and saving the best model weights using the ModelCheckpoint. The following is the code:</p><pre><em># Compile the model</em><br>model.compile(optimizer=&#39;rmsprop&#39;, loss=&#39;categorical_crossentropy&#39;, metrics=[&#39;accuracy&#39;])</pre><pre>epochs = 20</pre><pre>checkpointer = ModelCheckpoint(filepath=&#39;saved_models/weights.best.from_scratch.hdf5&#39;, verbose=1, save_best_only=True)</pre><pre>model.fit(train_tensors, train_targets, validation_data=(valid_tensors, valid_targets), epochs=epochs, batch_size=20, callbacks=[checkpointer], verbose=1)</pre><pre><em># load the best weights</em><br>model.load_weights(&#39;saved_models/weights.best.from_scratch.hdf5&#39;)</pre><pre><em># Test the model<br># get index of predicted dog breed for each image in test set</em><br>dog_breed_predictions = [np.argmax(model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]</pre><pre><em># report test accuracy</em><br>test_accuracy = 100*np.sum(np.array(dog_breed_predictions)==np.argmax(test_targets, axis=1))/len(dog_breed_predictions)<br>print(&#39;Test accuracy: %.4f%%&#39; % test_accuracy)</pre><p>The test accuracy that I got was 5.8612%. However, you can increase the number of epochs and try hyperparameter tuning to tune the parameters further and increase the accuracy.</p><h3>STEP-4 Using Transfer learning to classify breeds</h3><p>Transfer learning is a technique which saves the time of building a CNN from scratch and we can just extract the bottleneck features from a pre-trained classifier and make a few modifications to that model to use it for a different dataset. For e.g. we can remove the final dense layer from that model and add another dense layer with a different number of outputs using a softmax activation function. Here, we are going to use VGG16 and extract features from it and feed it to a global average pooling layer to decrease the number of parameters and finally add the fully connected layer with 133 output nodes using a softmax activation function.</p><h4>Bottleneck features</h4><pre>bottleneck_features = np.load(&#39;bottleneck_features/DogVGG16Data.npz&#39;)<br>train_VGG16 = bottleneck_features[&#39;train&#39;]<br>valid_VGG16 = bottleneck_features[&#39;valid&#39;]<br>test_VGG16 = bottleneck_features[&#39;test&#39;]</pre><h4>Add layers at the end</h4><pre>VGG16_model = Sequential()<br>VGG16_model.add(GlobalAveragePooling2D(input_shape=train_VGG16.shape[1:]))<br>VGG16_model.add(Dense(133, activation=&#39;softmax&#39;))</pre><p>Similarly, after compiling the model and testing the model at the test dataset, you should get accuracy around 43%</p><h3>STEP-5 Create your own CNN to classify Dog breeds using Transfer learning</h3><p>For this section, let’s use ResNet50. The steps that were followed in the previous step remains the same but this time the model will be different and hopefully, the accuracy too. The ResNet50 contains 50 convolutional layers and hence, is very powerful in image classification problems as discussed earlier too.</p><h4>Model Architecture</h4><pre>ResNet50_model_transfer = Sequential()<br>ResNet50_model_transfer.add(GlobalAveragePooling2D(input_shape=train_ResNet.shape[1:]))<br>ResNet50_model_transfer.add(Dense(133, activation=&#39;softmax&#39;))</pre><pre>ResNet50_model_transfer.summary()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/657/1*03D9qw8rbTwLJ35dxh31PQ.png" /><figcaption>Architecture image by author</figcaption></figure><p>After running it with 20 epochs and batch size = 20, you will observe a test accuracy of 82%.</p><p>Other models that you can try are the following:</p><ul><li><a href="https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/DogVGG19Data.npz">VGG-19</a></li><li><a href="https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/DogInceptionV3Data.npz">Inception</a></li><li><a href="https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/DogXceptionData.npz">Xception</a></li></ul><h3>STEP-6 Write your Algorithm</h3><p>Now, we will combine all the above steps to convert it into a complete algorithm that would do the following:</p><ul><li>if a <strong>dog</strong> is detected in the image, return the predicted breed.</li><li>if a <strong>human</strong> is detected in the image, return the resembling dog breed.</li><li>if <strong>neither</strong> is detected in the image, provide output that indicates an error.</li></ul><p>The following are the series of functions that does the above tasks:</p><pre>from PIL import Image</pre><pre>def dog_classifier(img_path):<br>    if dog_detector(img_path):<br>        print(&#39;Lemme guess, Hey! You are a Dog!&#39;)<br>        image = Image.open(img_path)<br>        plt.imshow(image, interpolation=&#39;nearest&#39;)<br>        plt.axis(&#39;off&#39;)<br>        plt.show()<br>        breed = Resnet_predict_breed(img_path)<br>        print(&#39;OMG! You are {}\n\n&#39;.format(breed))<br>    <br>    elif face_detector(img_path):<br>        print(&#39;Lemme guess, Hey! You are Human!&#39;)<br>        image = Image.open(img_path)<br>        plt.imshow(image, interpolation=&#39;nearest&#39;)<br>        plt.axis(&#39;off&#39;)<br>        plt.show()<br>        breed = Resnet_predict_breed(img_path)<br>        print(&#39;Hahahha! You look like {}\n\n&#39;.format(breed))<br>    <br>    else:<br>        print(&#39;Sorry, you are neither a dog nor a human&#39;)<br>        image = Image.open(img_path)<br>        plt.imshow(image, interpolation=&#39;nearest&#39;)<br>        plt.axis(&#39;off&#39;)<br>        plt.show()</pre><h3>STEP-7 Testing Time!</h3><p>Let’s test some images and see the results:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/430/1*EH02pAFMYLbgrllz-aWgwg.png" /><figcaption>Image by author</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/343/1*HRSt7Kdk35T9O676A9KKiQ.png" /><figcaption>Image by author</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/437/1*yITsj4kmMp3Sih-PgptlLg.png" /><figcaption>Image by author</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/484/1*kZ27vAmHeGyMc44I8sam-Q.png" /><figcaption>Image by author</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/508/1*C8xpsCGXKj8xjZ6cpJBTmw.png" /><figcaption>Image by author</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/429/1*qdI8Teu95Rf5KAEreTnjJg.png" /><figcaption>Image by author</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/475/1*SRjhcmPvYw_dBVSYRWmFiA.png" /></figure><h4>Areas of improvement</h4><ul><li>There are a few cat breeds which resembles one of the dog breeds which we can see in the last picture. However, the model predicted it to be a dog and hence, we can improve this feature using better hyperparameters and/or growing deeper CNN.</li><li>Detecting a dog and a human together in an image even if they are not facing the image would be a great feature to add. We can do this by Data augmentation in the training set by adding the images when a dog is not facing the camera and similarly for the human as well.</li><li>I tried an animated dog image which the model did not recognise. Predicting the breed of an animated image of a dog would be quite interesting to see.</li></ul><h3>Conclusion</h3><p>I hope you enjoyed going through my article and must have learnt something new today. The whole code of this project is uploaded onto my <a href="https://github.com/mrinal1704/Dog-Breed-Classifier-using-CNN"><strong>github</strong></a>. Please feel free to have a look at it.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6052edfab487" width="1" height="1" alt=""><hr><p><a href="https://medium.com/swlh/dog-breed-classifier-using-convolutional-neural-networks-6052edfab487">Dog Breed Classifier Using Convolutional Neural Networks</a> was originally published in <a href="https://medium.com/swlh">The Startup</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[10 problems to practice almost all SQL concepts]]></title>
            <link>https://medium.com/data-science/10-problems-to-practice-almost-all-sql-concepts-37545e7c5219?source=rss-7cf002511db6------2</link>
            <guid isPermaLink="false">https://medium.com/p/37545e7c5219</guid>
            <category><![CDATA[programming]]></category>
            <category><![CDATA[data-analysis]]></category>
            <category><![CDATA[towards-data-science]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Mrinal Gupta]]></dc:creator>
            <pubDate>Mon, 06 Jul 2020 08:41:24 GMT</pubDate>
            <atom:updated>2020-07-06T15:51:01.415Z</atom:updated>
            <content:encoded><![CDATA[<h3>Top 10 problems to practice almost all SQL concepts</h3><h4>Covers all SQL concepts of JOIN, Aggregates, Window functions, and Subqueries</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ipRIWc6d0q0K928I" /><figcaption>Photo by <a href="https://unsplash.com/@alexacea?utm_source=medium&amp;utm_medium=referral">Alexandru Acea</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3>Introduction</h3><p>I recently completed all 117 questions of SQL in 25 days on Leetcode which is probably one of the most popular websites to practice your coding skills in various programming languages. The website beautifully categorizes all the questions in three categories namely Easy, Medium, and Hard where the level of difficulty handsomely rises with each subsequent level. After completing all of them, I decided to highlight 10 questions which covers almost all the concepts ranging from Basic to Advanced SQL that you can practice in order to brush up your SQL programming skills. Additionally, all of these questions have been asked in interviews from almost all the big tech companies.</p><h4>The following is the breakdown of SQL skills tested in every question:</h4><ul><li><strong>Q1 Average Salary: </strong>CTE<strong>, </strong>Aggregates in Window functions, CASE WHEN, Date functions such as DATE_PART, INNER JOIN</li><li><strong>Q2 Find Quiet students in results— </strong>Subqueries, MIN, MAX, Window functions, Window Alias, INNER JOIN, ALL keyword</li><li><strong>Q3 Human Traffic of Stadium — </strong>LEFT JOIN with Subqueries, CTE, ROW_NUMBER</li><li><strong>Q4 Number of Transactions per Visit —</strong>RECURSIVE CTE, COALESCE, COUNT</li><li><strong>Q5 Report contiguous dates (MySQL)— </strong>Date_sub, ROW_NUMBER</li><li><strong>Q6 Sales by Day of the week — </strong>Pivot table, CASE WHEN</li><li><strong>Q7 Department Top 3 Salaries— </strong>DENSE_RANK</li><li><strong>Q8 Restaurant Growth — </strong>PRECEDING for moving average, OFFSET</li><li><strong>Q9 Shortest distance in a Plane — </strong>CROSS JOIN, SQRT, POW</li><li><strong>Q10 Consecutive Numbers —</strong>LAG, LEAD</li></ul><p>So, let’s get to the business!</p><ol><li><strong><em>Given two tables below, write a query to display the comparison result (higher/lower/same) of the average salary of employees in a department to the company’s average salary.</em></strong></li></ol><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b5f8866f84014d770c858ddd99ccd62e/href">https://medium.com/media/b5f8866f84014d770c858ddd99ccd62e/href</a></iframe><p><strong><em>Solution 1:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e333a7af87b5aa14f38a5c2cc9765e0c/href">https://medium.com/media/e333a7af87b5aa14f38a5c2cc9765e0c/href</a></iframe><p><strong><em>2.</em></strong> <strong><em>Write an SQL query to report the students (student_id, student_name) being “quiet” in ALL exams. A “quite” student is the one who took at least one exam and didn’t score neither the high score nor the low score.</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/052921e037d0cbd8c0abf017615697b3/href">https://medium.com/media/052921e037d0cbd8c0abf017615697b3/href</a></iframe><p><strong><em>Solution 2</em></strong>:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/344c185eeccd96fd7ebdc80eaac30342/href">https://medium.com/media/344c185eeccd96fd7ebdc80eaac30342/href</a></iframe><p><strong><em>3. Write a query to display the records which have 3 or <br>more consecutive rows and the amount of people more than 100(inclusive).</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a3a6004f0f006c4ac46fbb64adf2c236/href">https://medium.com/media/a3a6004f0f006c4ac46fbb64adf2c236/href</a></iframe><p><strong><em>Solution 3:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/61a85701703338f962eb34a9114dda0f/href">https://medium.com/media/61a85701703338f962eb34a9114dda0f/href</a></iframe><p><strong><em>4. Write an SQL query to find how many users visited the bank and didn’t do any transactions, how many visited the bank and did one transaction and so on.</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/044e66f63e7a2316d46683715fd1003c/href">https://medium.com/media/044e66f63e7a2316d46683715fd1003c/href</a></iframe><p><strong><em>Solution 4:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d2de7cc3a744a02fd725f0fd20f14131/href">https://medium.com/media/d2de7cc3a744a02fd725f0fd20f14131/href</a></iframe><p><strong><em>5. Write an SQL query to generate a report of period_state for each continuous interval of days in the period from 2019–01–01 to 2019–12–31.</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/fc4b83a4b2b4d268c8c6b3a80624a536/href">https://medium.com/media/fc4b83a4b2b4d268c8c6b3a80624a536/href</a></iframe><p><strong><em>Solution 5:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c4f695c755f9bb1722cf28b8a26263c2/href">https://medium.com/media/c4f695c755f9bb1722cf28b8a26263c2/href</a></iframe><p><strong><em>6. Write an SQL query to report how many units in each category have been ordered on each day of the week.</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/84a5747682b8d30d35cea495157e5c0c/href">https://medium.com/media/84a5747682b8d30d35cea495157e5c0c/href</a></iframe><p><strong><em>Solution 6:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/93ff5f4b1b4e5354e597a0503cde51c3/href">https://medium.com/media/93ff5f4b1b4e5354e597a0503cde51c3/href</a></iframe><p><strong><em>7. Write an SQL query to find employees who earn the top three salaries in each of the department. For the above tables, your SQL query should return the following rows (order of rows does not matter).</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f563c243bab1b6a5f4f3fc37e521b3a0/href">https://medium.com/media/f563c243bab1b6a5f4f3fc37e521b3a0/href</a></iframe><p><strong><em>Solution 7:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/4a8edacae76049073c4929c478f14084/href">https://medium.com/media/4a8edacae76049073c4929c478f14084/href</a></iframe><p><strong><em>8. Write an SQL query to compute moving average of how much customer paid in a 7 days window (current day + 6 days before) .</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/76faea6e2eaa476d1f57cabc8db256b9/href">https://medium.com/media/76faea6e2eaa476d1f57cabc8db256b9/href</a></iframe><p><strong><em>Solution 8:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/23c2d5db9ac862068e160027773f5dee/href">https://medium.com/media/23c2d5db9ac862068e160027773f5dee/href</a></iframe><p><strong><em>9. Write a query to find the shortest distance between these points rounded to 2 decimals.</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b75d95b9491e4f014532867c0186d837/href">https://medium.com/media/b75d95b9491e4f014532867c0186d837/href</a></iframe><p><strong><em>Solution 9:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0494d689dceb83b59429a499c96d1abe/href">https://medium.com/media/0494d689dceb83b59429a499c96d1abe/href</a></iframe><p><strong><em>10. Write an SQL query to find all numbers that appear at least three times consecutively.</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/224c6436eb5a528beeebe613daf8b01b/href">https://medium.com/media/224c6436eb5a528beeebe613daf8b01b/href</a></iframe><p><strong><em>Solution 10:</em></strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/e1b306fec69c48fad748ed800ef8fca6/href">https://medium.com/media/e1b306fec69c48fad748ed800ef8fca6/href</a></iframe><p>That’s a wrap! I hope you liked the questions and were able to practice some of the most important concepts of SQL. If you want to practice more questions like these, feel free to go on to my <a href="https://github.com/mrinal1704/SQL-Leetcode-Challenge"><strong>Github</strong><em> </em></a>page where I have uploaded all the 117 solutions.</p><h3>Thank you!</h3><p>If you like my work, please follow me on Medium for reading more articles in near future.</p><ul><li>Read my other articles on <a href="https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c">Feature Engineering</a> &amp; <a href="https://towardsdatascience.com/learn-how-to-automate-the-basic-steps-of-data-analysis-45e118048172">Automating basic data analysis</a>.</li><li>Would love to connect with you on <a href="https://www.linkedin.com/in/mrinal-gupta-5319a9ab/">LinkedIn</a>.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=37545e7c5219" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/10-problems-to-practice-almost-all-sql-concepts-37545e7c5219">10 problems to practice almost all SQL concepts</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Learn how to automate the basic steps of Data Analysis]]></title>
            <link>https://medium.com/data-science/learn-how-to-automate-the-basic-steps-of-data-analysis-45e118048172?source=rss-7cf002511db6------2</link>
            <guid isPermaLink="false">https://medium.com/p/45e118048172</guid>
            <category><![CDATA[programming]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[automation]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[towards-data-science]]></category>
            <dc:creator><![CDATA[Mrinal Gupta]]></dc:creator>
            <pubDate>Wed, 17 Jun 2020 21:15:57 GMT</pubDate>
            <atom:updated>2020-06-17T23:53:58.893Z</atom:updated>
            <content:encoded><![CDATA[<h3>Learn how to automate the basic steps of data analysis</h3><h4>Are you also bored with writing df.shape, df.info() again and again?</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Qb0uxS_nGGgQl1st" /><figcaption>Photo by <a href="https://unsplash.com/@markusspiske?utm_source=medium&amp;utm_medium=referral">Markus Spiske</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3>Introduction</h3><p>Are you also bored of writing<em> df.shape, df.info(), plt.plot(kind=’bar’), df[‘column_name’].nunique()</em>, and many other basic functions again and again to get the basic insights from any data all the time. I am sure you all must have started to find this process monotonous too. After reading this article, you will see how you can automate these basic functions in five basic steps by developing your own Python package in a matter of a few minutes.</p><h4>Preview</h4><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fk9s5Z4OK8os&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dk9s5Z4OK8os&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/ef5a84f5c6169b62560db9c26d5dc72f/href">https://medium.com/media/ef5a84f5c6169b62560db9c26d5dc72f/href</a></iframe><h3>Let’s get started</h3><p>To begin developing your own customised Python package, the following are the steps that we need to perform:</p><h4>STEP 1 — Creation of the Python script file</h4><p>This file will contain the Python code necessary to run the basic data analysis. To demonstrate, let us automate the steps such as calculation of -</p><ul><li>Dimension of the dataset</li><li>The data types of all the columns</li><li>Number of Unique values</li><li>Percentage of NA values</li><li>Plot the bar chart for all categorical columns</li><li>Plot the histogram for all numeric columns to see the distribution of the data</li><li>Make a heatmap to show the null values</li></ul><p>The following is the snippet of the code that I wrote:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/980/1*ydJ9g9gI5gkzROOdA8WYvg.png" /><figcaption>Image by Author</figcaption></figure><p>The name of the file should be the name of the package that you want it to be called as such as Pandas, Numpy, etc and should be unique. In our case, I have named it ‘Mrinal’.</p><h4>STEP 2 Create a Setup.py file</h4><p>This file is necessary to install the package and contains the information like the package name, author name etc. This file resides outside the folder which contains the Python script file from Step 1 and other files discussed later.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/435/1*mooZx0KxgPLlgM60bCd8UQ.png" /><figcaption>Image by Author</figcaption></figure><p>The above image shows the code to be written in the Setup.py. Some things to be noted here are that the name of your package should be unique as if you want to publish to pypi.org later then you can’t use any matching name which is already present in the website. For example, you cannot create a package named ‘Pandas’ or ‘Numpy’ as they are already in the library.</p><h4>STEP 3 Create an __init__.py file</h4><p>This file tells Python that the folder contains a package. It should be present in the same folder along with the Python script file created in Step 1.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/352/1*lQBD6tX2_n7vOurHUWHMMg.png" /><figcaption>Image by Author</figcaption></figure><p>The above code is referencing the name of the class that we created in the Python script which was ‘Insights’ and the name of the package that is ‘Mrinal’ in our case. The ‘.’ is mandatory in Python3 and later versions.</p><h4>STEP 4 Arrange the files in the right folder</h4><p>For this step:</p><ul><li>Create a folder which you can name anything that you want as it wouldn&#39;t affect the installation in any way. Let’s name it ‘My first Python package’ for reference</li><li>Store the Setup.py file inside this folder</li><li>Create another folder inside it and name it the same that you gave to the name of the package, in our case it is ‘Mrinal’ and whenever you want to import the package, you would be writing ‘From Mrinal import Insights’.</li><li>Store the Python script file named ‘Mrinal.py’ and the ‘__init__.py’ file inside the newly created folder</li></ul><h4>STEP 5 Pip Install</h4><ul><li>Open the command prompt</li><li>Use ‘cd’ command to navigate to ‘My first Python package’ folder</li><li>Type ‘Pip install .’</li><li>This would install your package</li><li>Then open any IDE such as Jupyter Notebook and type: ‘From Mrinal import Insights’</li><li>Create a class object, for instance, insight_1 = Insights(). You can also look at the preview video.</li><li>Then call the ‘automate_analysis()’ function just like in the video. You would see how those repeated steps are now automated and now you have to just call this function which would do all the work.</li></ul><h3>Congratulations!</h3><p>You built your first python package on your own and would be saving a lot of time in future by not writing those functions again and again. Similarly, you can add more functions and classes to add more content to your package and make your process of data analysis smoother.</p><h3>Resources</h3><ul><li>You can also download all the code files from <a href="https://github.com/mrinal1704/My-First-Python-Package">my GitHub</a> page</li><li>If you want to upload your package to pypi.org then you can go to this <a href="https://dev.to/prahladyeri/python-checklist-publishing-a-package-to-pypi-4jlc">link</a></li></ul><p>If you like this article then do read my another article on <a href="https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c">how you can develop the critical skills needed to perform Feature Engineering for a strong Machine Learning model</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=45e118048172" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/learn-how-to-automate-the-basic-steps-of-data-analysis-45e118048172">Learn how to automate the basic steps of Data Analysis</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Art of engineering features for a strong Machine learning model]]></title>
            <link>https://medium.com/data-science/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c?source=rss-7cf002511db6------2</link>
            <guid isPermaLink="false">https://medium.com/p/a47a876e654c</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[feature-engineering]]></category>
            <category><![CDATA[anti-money-laundering]]></category>
            <category><![CDATA[towards-data-science]]></category>
            <dc:creator><![CDATA[Mrinal Gupta]]></dc:creator>
            <pubDate>Sun, 31 May 2020 08:29:49 GMT</pubDate>
            <atom:updated>2020-06-04T18:18:46.248Z</atom:updated>
            <content:encoded><![CDATA[<h3><strong>The art of engineering features for a strong machine learning model</strong></h3><h4>The most critical process for any data science problem that you should learn.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xTcU74Gqaqi-KTxv5up7uw.jpeg" /><figcaption>Photo by <a href="https://unsplash.com/@franckinjapan?utm_source=medium&amp;utm_medium=referral">Franck V.</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3><strong>What you’ll learn?</strong></h3><ul><li>Develop the critical thinking skills required for feature engineering</li><li>Feature engineering for an anti-money laundering algorithm</li></ul><h3>Introduction</h3><h4><strong>Feature Engineering? [1]</strong></h4><p>A feature is a numeric representation of raw data. In structured data, they are the independent variables on which one of the variables is dependent. The features that are already present in a dataset are commonly known as data fields and the ones which are created through domain knowledge are known as candidate variables or expert variables. This process of encoding information into a form of a new variable is known as feature engineering.</p><h4><strong>Why do we need more Features?</strong></h4><p>A machine learning model’s performance is directly associated with how accurately the independent features capture the right information about the problem at hand. As a result, to deal with any problem we should create as many variables as we can so that later, we can select the most important features for our model and hence, enhancing the model performance. However, this process of creating new features is a tedious job and requires a good understanding of the problem with some domain knowledge. In this article, I am going to describe an example to demonstrate how you can create various candidate variables for an <strong>anti-money laundering machine learning model</strong>.</p><p>We will start first by understanding the problem and then applying that knowledge to perform feature engineering. The following are the series of steps described below.</p><h3><strong>STEP 1</strong></h3><h4><strong>Understanding the problem</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*cldVG-_d0EZpr4ga" /><figcaption>Photo by <a href="https://unsplash.com/@alschim?utm_source=medium&amp;utm_medium=referral">Alexander Schimmeck</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Money laundering is an illegal process of turning the “dirty” money (money obtained from illegal businesses like selling drugs) into “clean” money (legitimate money) either through an obscure sequence of banking transfers or through commercial transactions.</p><p>The three broad stages of Money laundering are:</p><p><strong>Placement — </strong>It is that stage when the “dirty” money is put in the legitimate financial system. The most common way of achieving it is through smurfing, which involves sending small amounts of money to bank accounts that are below anti-money laundering reporting thresholds and later depositing it to the same sender.</p><p><strong>Layering </strong>— This is the second stage and one of the most complex stages which involves making the money as hard to detect as possible, and further moving it away from the source. The money is purposefully transferred so fast such that the bank cannot detect it.</p><p><strong>Integration </strong>— The final stage involves putting the “clean” money back into the economy. One of the most common ways is to buy a property in the name of a shell company which shows a legitimate transaction.</p><p>Because of the space constraints, I have mentioned only the broader definition of the problem just for demonstration. However, one should be doing proper research on the problem by reading various research papers, patents, and many more.</p><h3><strong>STEP 2</strong></h3><h4><strong>Breakdown the problem into smaller fragments for effective variable creation [2]</strong></h4><p>After researching the problem, you should highlight all the insights that you have developed. For instance, I have written a few of them below that you can easily get from the Step-1:</p><ul><li>Substantial increases in cash deposits of any individual or business without apparent cause</li><li>Deposits subsequently transferred within a short period out of the account and to a destination not normally associated with the customer</li><li>Accounts dominated by cash transactions rather than using cheques or letters of credit</li><li>A<strong> </strong>large number of individuals making payments into the same account without an adequate explanation</li><li>Large cash withdrawals from a brand-new account, or from an account which has just received an unexpected large credit from abroad</li></ul><p>Similarly, the better understanding that you develop about the problem the more insights you will get and hence, better features would be there to enhance the performance of your model. Therefore, all the above insights from the problem should be accounted for while creating the candidate variables to inject more information into the model.</p><h3><strong>STEP 3</strong></h3><h4><strong>Understand the Dataset</strong></h4><p>To progress further, let’s assume we have a hypothetical dataset with the following data fields with their description for developing an anti-money laundering model. In a real scenario, the data scientists working for the banks can easily get such data with the following data fields. You can also take a look at <a href="https://www.kaggle.com/x09072993/aml-detection">the public dataset</a> available.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/708/1*xpDIOG6AGt82CKmBJ4BxKQ.png" /><figcaption>Image by author</figcaption></figure><h3><strong>STEP 4</strong></h3><h4><strong>Building Candidate Variables</strong></h4><p>Here comes the most interesting, the most crucial, and the most difficult part of any Data science problem aka Feature engineering:</p><p><strong>4.1. Concatenate two or more different data fields to form a new categorical variable</strong></p><p>For our first set of variables, you can think of joining two or more data fields together to make a new variable. To understand this, let’s make pairs of “Origin_acct”, “Destination_acct”, and the “Transaction_type” as shown in the table below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/926/1*wl7hlDHIaYKfuVBoyt0T2A.png" /><figcaption>Image by author</figcaption></figure><p>In the table above, you can see a new column named “Origin_acct-Destination_acct” containing the concatenated values of their respective data fields. Similarly, other columns can also be seen with the same approach.</p><h4>Why?</h4><p>Concatenation of “Origin_acct” with “Destination_acct” would help in policing the process of “Smurfing” where multiple intermediate accounts transfer small amounts to a single sender multiple times. Additionally, in one of the problems discussed earlier, it was observed that these criminals prefer cash transactions rather than the forms of debit and credit such as cheques, bills of exchange etc. Therefore, concatenating with the “Transaction_type” would also help in giving another dimension of learning to our algorithm to know more about the nature of transactions and track whether the number of cash transactions for that particular account has increased or not (discussed in 4.2). Such activities are strictly not normal and you would see how other numerical candidate variables (discussed later) linked to these concatenated fields would certainly help us in going in the right direction.</p><p><strong>4.2. Frequency Candidate Variables [1]</strong></p><p>The frequency variables would encode the number of transactions being done by each feature(shown in figure) which would help in capturing the information such as an increase in the number of transactions for a particular pair of accounts which could be a signal of suspicious activity. The figure below shows different combinations of frequency variables. For example, you can calculate the number of times the origin_acct was used on the same day (0), in the last 1 day, in the last 3 days, and so on.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/834/1*mYWR-fxwH29w_0eUNfPxHA.png" /><figcaption>Image by author</figcaption></figure><p>A higher value calculated for a time period means there is something abnormal in the behaviour of that account. In the table below you can see how the frequency variable would look like for one of the Origin accounts:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/714/1*mBaeVEr6zGBqg7LT5Fp9jw.png" /><figcaption>Image by author</figcaption></figure><p>The column Origin_frequency_0 starts with 1 (assuming it is for the first time used for transaction) and the number remains 1 for the next day because on 05/02/2014 it is seen for the first time. Similarly, you can deduce how other numbers were calculated.</p><p><strong>4.3. Amount Variables</strong></p><p>The amount variables would help in calculating the average, maximum, median, and the total amount of the transaction from each account over the past 0, 1, 3, 7, 14, and 30 days (0 indicates the same day) which would help in tracking the third stage namely Integration where a large sum of money is withdrawn from a bank account without any adequate reason possibly for buying a property. Hence, would help the model in identifying any abnormality in the transaction amounts. For instance, there would be one column which contains the total amount transacted for a Destination account over the last 3 days. Similarly, other combinations can be formed as shown below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/813/1*eX58aioaE-tNNRjRAJOaKw.png" /><figcaption>Image by author</figcaption></figure><p>The table below shows a pair of amount variables:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/858/1*vP-JXfBPcEdq2ROU_Di0lQ.png" /><figcaption>Image by author</figcaption></figure><p>In the table above, the column “Origin_acct-total_Amount_3_days” contains the total amount transacted by the Origin account with #4586524 over the past 3 days. This is why the total remains 98.4 in the last row because the account was not used in the last 3 days. The other column calculates the actual amount transacted on the same day divided by the total amount over the last 3 days.</p><p><strong>4.4. Time-since Variables [1]</strong></p><p>To encapsulate the information of how fast the transactions are taking place for accounts, these variables can be very handy. It calculates the time between when an account was last used for transaction and the time of the current transaction. The faster the subsequent transactions for a single entity, the higher would be the probability of fraud. Hence, this would help in tracking the 2nd stage namely Layering. The following table shows an example of the time since variable:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/515/1*BMwWBPdidd2KR7G1TwQ1dQ.png" /><figcaption>Image by author</figcaption></figure><p><strong>4.5. Velocity-change candidate Variables [1]</strong></p><p>This last set of variables can track the sudden change in the normal behaviour of an account by calculating how the number of transactions or the amount transferred in the past day (0 &amp; 1 day) has changed over the other set of periods (7, 14, &amp; 30 days). The formula for the same is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/872/1*OX434fd87kzBML6XCtsNwA.png" /><figcaption>Image by author</figcaption></figure><p>Hence, if there is an unexpected change in the number of transactions or in the average amount for that account then our model would be able to learn that change. The following table shows an example of velocity-change variables:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/752/1*B_bCs_ABBlfV1Sxz7FYOLA.png" /><figcaption>Image by author</figcaption></figure><h3>Summary</h3><p>You saw how we all were able to encode more and more information about the given problem through many candidate variables using only the given data fields and without any external data. To summarise, you learned the following variables with the respective information encoded in it:</p><p><strong>Concatenated Variables- </strong>Helped in linking the origin account, destination account, and transaction type with each other that assisted in tracking the problem of smurfing and the higher cash withdrawals</p><p><strong>Frequency Variables- </strong>Helped in learning how frequently the account is used</p><p><strong>Amount Variables- </strong>Helped in learning about the magnitude of the amount of transactions.</p><p><strong>Time-since Variables- </strong>Helped in learning the speed of transactions</p><p><strong>Velocity-change Variables- </strong>Helped in identifying a sudden change in the behaviour of accounts</p><p>I know the above-discussed problem seems very specific to only fraud detection models but, trust me, it would surely help you in developing those critical thinking skills required for creating expert variables for any data science problem. I hope you found it helpful and worth reading. Cheers!</p><h3><strong>References</strong></h3><p>[1] Gao, J.X., Zhou, Z.R., Ai, J.S., Xia, B.X. and Coggeshall, S. (2019) Predicting Credit Card Transaction Fraud Using Machine Learning Algorithms. Journal of Intelligent Learning Systems and Applications, 11, 33–63. <a href="https://doi.org/10.4236/jilsa.2019.113003">https://doi.org/10.4236/jilsa.2019.113003</a></p><p>[2] Guideline on Combating Money Laundering and Terrorist Financing. <a href="https://www.imolin.org/doc/amlid/Trinidad&amp;Tobago_Guidlines%20on%20Combatting%20Money%20Laundering%20&amp;%20Terrorist%20Financing.pdf">https://www.imolin.org/doc/amlid/Trinidad&amp;Tobago_Guidlines%20on%20Combatting%20Money%20Laundering%20&amp;%20Terrorist%20Financing.pdf</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a47a876e654c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c">The Art of engineering features for a strong Machine learning model</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>