Nerd For Tech
Published in

Nerd For Tech

SQL for Data Science is way more than SELECT *

A beautiful view of the sky in the night with brigth stars
Photo by Vincentiu Solomon on Unsplash

So you have finished your Data Science course, you have learnt Python, Statistics, Linear Algebra, Machine Learning, Deep Learning, you are mastering the main libraries to explore data, you practiced a lot with Kaggle data, made an amazing portfolio and got a Data Scientist job!
You are very excited to arrive on your first week at work, open the .csv or .xlsx files with pandas and start exploring the data! But, wait…
“Where are the .csv/.xlsx files???”

Nowadays all the companies store their data in structured or non-structured databases at a Cloud (AWS, Google Cloud, Azure) so it’s essential for a Data Scientist to master SQL and NoSQL languages to extract correctly the information from those databases.
Unfortunately, everything about SQL that was teached on your Data Science course was simple queries like the famous SELECT * FROM table!Well, it does work, but believe me, when you would be working with petabytes of information you won’t want to import two different tables to merge with pandas (In fact depending on you PC configuration you will not be able to).

For this reason I decided to write this article with some tips of SQL I learnt during the past 6 months working as Data Scientist.

1. Subqueries (Select statement with where clause calling another Select statement)

Imagine you work on a petshop and you want to know the Name, Age and Address of all the customers that have a dog, but customer information are in a table and pets information are in another table. You could bring both tables and merge with Python, but you are better than this! You can use a subquery to extract all the customer names that have a dog from table2 and then use those customer names as parameter of a WHERE clause to extract the desired information as example below:

SELECT column_name, column_age, column_address
FROM table1
WHERE column_name IN (SELECT column_name
FROM table2
WHERE column_pet_type = 'dog')
ORDER BY column_age DESC

The second select extracts the name of all customers that have a dog and use those names as parameters of WHERE clause on the first SELECT. Make sure that the subquery SELECT statement will return a single column otherwise the first SELECT statement will not work.
This way the time required to extract information from database will be faster than extracting all information and then joining tables.

2. Temporary Tables

A temporary table in SQL Server is a database table that exists temporarily on the server. It stores a subset of data from a normal table for a certain period of time. Let’s see an example:
Suppose you want the ID of all customer with a specific name and a specific age, unfortunately, your table does not have the age, but contains the birth date. You can transform the birth date column into age and store this information into a Temporary table and then use this Temporary table into the new query.

SELECT column_id, 
DATEDIFF(hour, column_birth_date, GETDATE())/8766 AS Age
INTO #TemporaryTable
FROM table1
SELECT column_id
FROM #TemporaryTable
WHERE column_name = 'Daniel'
AND Age = 33

In the first part of the query I am transforming the column birth_date into Age using the function DATEDIFF and then storing the information INTO #TemporaryTable.
In the second part I’m using the Temporary table I have created to query column ID with a specific Age extracted from Temporary Table.

3. Parameterized Queries

A parameterized query (also known as a prepared statement) is a means of pre-compiling a SQL statement so that all you need to supply are the “parameters” that need to be inserted into the statement for it to be executed. It’s commonly used to avoid SQL injection attacks.
SQL injection is a web security vulnerability that allows an attacker to interfere with the queries that an application makes to its database. It generally allows an attacker to view data that they are not normally able to retrieve. In many cases, an attacker can modify or delete this data, causing persistent changes to the application’s content or behavior.
In the example below I’m passing as parameter to my query “name” and “age”, the syntax used is for Microsoft Server SQL (%(parameter)s), for other databases check their documentation to see proper syntax.

import pandas as pd
import pymssql
query = """SELECT column_name, column_address,
column_email, column_age
FROM table1
WHERE column_name = %(name)s
AND column_age = %(age)s"""
conn = pymssql.connect(host, user, password, database)
dataframe = pd.read_sql(query, conn,
params = dict(name = 'Daniel', age = 33))
#Or if you execute the query using cursor
cur = conn.cursor()
cur.execute(query, params = dict(name = 'Daniel', age = 33))


If you want to improve your code performance, deep dive on SQL fundamentals, this article only covers a very tinny part of what SQL can provide users.
I hope you enjoyed the article and learnt something new along the way.
If so, be sure to let me know in the comments.




NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Recommended from Medium

Review of a New Golang Course: Master Go

Download In #&PDF ZooKeeper: Distributed Process C

How To Use Debouncing and Throttling in JavaScript

People on film set

My TIIDELab Experience 2.0(Second Month)

5 Tools I Have Been Introduced To at TribalScale

Flutter — A Future Game Changer


Extend the ZSH Prompt with GIT information

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Daniel Rosa Mille

Daniel Rosa Mille

More from Medium

Starting My Data Journey

Into The Dataverse banner image

Things you need to know Before Starting QlikView?✍️

Installation of Apache Hadoop 3.2.1 on Ubuntu

Data Engineering Project Retail Store — Part 4 — Analyzing the Data