Analytics Vidhya
Published in

Analytics Vidhya

SQL for Data Models — 1

Data modeling is a way to organize and join data together for the purpose of data analysis. Data modelling is what we use to organize information for multiple tables and how they relate to each other.
This helps tremendously in providing structure to the information in the system.
Usually a data model represents a business process and it can also help you understand a business process. A lot of times you’ll work with a business person in understanding the data and how it fits together.
But at the same time, that business person will learn a lot from the data modeler to better understand how their business actually works together by seeing the data and how it interacts with each other.
The thing to remember about data models is that it should always represent a real world problem as closely as possible.

Example:

Data Model

Queries

  1. Select Statement

The majority of what data scientists are doing with SQL is retrieving data. To be able to do that and get you started, the first statement to use is the SELECT statement. A basic SELECT statement, tell a database what table you want your data FROM, SELECT either all or particular columns from a table in a query, and limit the amount of data that is returned in a query. With the SELECT statement you’re going to specify two pieces of information, what you want to select and where you want it from.

Syntax:

SELECT column1, column2, …
FROM table_name

The way you create tables, update them and insert them is heavily dependent on the relational database management system you’re using.

Here we’re using SQLite.

2. Temporary Tables

Another option we have is to create a copy, essentially of another table or pull a subset from another table. We can create a whole table from this, or we can create what is called a temporary table.

First of all, the most important thing to know about these temporary tables is that these will be deleted when the current client session is terminated. That’s why they’re called temporary tables.

However, these are really good to use because they are lot faster than creating a real table. So if you have complex queries and you want to simplify it a bit by creating a subset and then joining to that subset and driving a new calculation from that, then temporary tables are a great option.

Syntax:
CREATE TEMPORARY TABLE new_tbl SELECT * FROM orig_tbl LIMIT 0;

The way you create tables, update them and insert them is heavily dependent on the relational database management system you’re using.

Here we’re using SQLite.

3. Filtering

So we’ve spoken about the basics of actually acquiring data from a table using select and from commands.

But that’s only part of the story, because most of the databases contain thousands or even millions of records. Often we don’t want to look at all that data.

I know I don’t.

In this section, we’re going to go over filtering with SQL.

Filtering is extremely important because it allows us to narrow the data we want to retrieve. Filtering is also used when you’re doing analysis to get very specific about the data you want to analyze as part of your model.

Filtering SQL is important to understand as there’s some huge benefits when we’re doing it directly with SQL instead of relying on the client application to do it.

First of all, when we filter our data down, it will often reduce the number of records we’re retrieving. Instead of just going and grabbing a whole table and pulling every column and row from it, we can get really specific about the data we want to obtain from that table. And subsequently, that reduces how much data we’re pulling in from the system. Reducing the amount of data you’re analyzing will, of course, speed up the query performance, which, in turn, will speed up our overall processing.

It also helps when we add the filtering at the database level because this reduces the strain on the client-side of the application which will also allow it to run better.

You have to select your columns and then choose which table you want the columns from. And then you add where along with your column name operator value. There is a couple different operators that you could use. You could use equals, not equals, greater than, less then, greater than or equal to, less than or equal to, between, and is null.

SELECT ProductName, UnitPrice from products where ProductName = ‘coffee’

The above example will return all coffee products along with their unit price. The number of records that we are going to get depends on how many ‘coffee’ products we have in the products table.

But maybe we want to look at products whose prices are greater than or equal to 75. So this time we’re going to look at the records where the unit price is greater than or equal to 75.

SELECT ProductName, UnitePrice from products where UnitPrice > = 75

This will return all of the records with unit price greater than or over a value of 75.

Another way we can filter is by looking for non-matches, maybe you don’t have just a list of one product that you want to go after, such as coffee, maybe you have a whole bunch of products. It would be easy just to say give me everything except blank.

So basically, we want to pull all records except ‘coffee’.

SELECT ProductName, UnitPrice from products WHERE ProductName <> ‘coffee’

Another cool thing that you can do is to filter for a range of values. This is a little bit different because it doesn’t have an operator. What it does use is between and, and.

SELECT ProductName, UnitPrice from products WHERE UnitPrice BETWEEN 15 AND 50

Another example is filtering for null value. In this example, what we’re doing is filtering for something that is null.

SELECT ProductName, UnitPrice from products WHERE ProductName IS NULL

Does the column representing the product names have some type of information for every record?

4. Math Operators

So far we managed to get the data, filter the data, and so forth, but as you’re probably aware, there’s so much more we can actually do with our data.

In this section, we’re going to use basic math calculations with our data. As previously explained, the more we can push down to the database and have the processing power at the database level, the better. We’re usually working with larger amounts of data, so we really want to reduce that strain on the client application. The closer we can do this to the source, the better we’ll be in the long run.

So, we’ll start by doing the simple ones, addition, subtraction, multiplication and division.

In this example, what I want to do is to get the total units on in order. And I want to have, that times by the units price to get the total order cost.

SELECT ProductName, UnitPrice, UnitsOnOrder, UnitsOnOrder * UnitPrice AS Total_Order_Cost from products

One important thing to note here, is to remember the order of operations, as you may have previously studied in math in the past.
The idea that things in parentheses are handled first, then powers are exponents, multiplication, division, addition, and subtraction.
In the United States, the popular mnemonic device, “Please excuse my dear Aunt Sally”, is often used.

So, for this example, I’m now just going to combine division with subtraction. In parentheses, I’m just going to put what operator I want to happen first. For this, I have my unit price minus my discount, divided by the quantity.

SELECT ProductID, Quantity, UnitPrice, Discount, (UnitPrice — Discount) / Quantity AS Total_Cost from OrderDetails

So, these are pretty simple examples, but again really powerful when you start to add these together.

5. Aggregate Functions

Are we having fun yet?

I know I am, but it’s going to get even more fun in this section because it’s time to really take a deep dive into data analysis by going over some of the aggregate functions found in SQL.

Aggregate functions provide various ways to summarize your data, which in turn really helps you to analyze your data and see what you have.
Use various aggregate functions including AVERAGE, COUNT, MIN, MAX, and SUM to summarize and analyze data, not to forget DISTINCT function and how it can be useful.

Aggregate functions are used for all sorts of things and they can be really helpful in finding the highest or lowest values, total number of records, average value, etc.
It really builds off of some of those math operators that are listed in the previous section.

Except they’re just pre-built in functions, so it makes it super easy to start to aggregate and summarize some of your data, a lot of times in descriptive statistics we are getting to know and understand our data.

In this example, we AVERAGE the values for UnitPrice as we are selecting them from the table products.

SELECT AVG(UnitPrice) AS avg_price FROM products

Moving on, the COUNT function is really helpful when we want to get an idea of the contents of a table. This is just helpful to understand how many records are in a table, or how many records contain information.

SELECT count (*) AS total_customers from customers

If you do a COUNT with a star and brackets, you’re going to count all the rows in a table, and this could be all the values or no values because this is just counting any row in a table.

You could also then count an individual column just by selecting count and then the column name. This will then count the rows for that specific column and would ignore the null values.

SELECT COUNT(CustomerID) AS total_customers FROM customers

Here we may get the same results or we may get different results if there are no values in the customer ID column.

Okay, another aggregate function we have is the MIN and MAX, again, you’re starting to understand your data, you just got a new table, you want to see what some of this data looks like.

It’s always great to get a range of your data, pulling something like what’s the minimum value, what’s the maximum value, starts to give you an idea of what are the distributions, and all of this can be done on the database.

SELECT MAX(UnitPrice) AS max_prod_price FROM products

SELECT MAX(UnitPrice) AS max_prod_price, MIN(UnitPrice) AS min_prod_price from products

Another thing to note is that the null values will be ignored with the MIN and MAX functions.

Finally, we have the SUM aggregate function, you can use this in a similar fashion as with the other aggregate functions.

For this example, I want to look at the unit price in units in stock, I want to get a total for that but I also want to add all of those together.

SELECT SUM(UnitPrice) AS total_prod_price FROM products

SELECT SUM(UnitPrice * UnitsInStock) AS total_price FROM products WHERE SupplierID = ‘23’

Now I have the total price for all of the products with the unit price and the units in stock.

Now you can see how with the math operators in some of these aggregate function you can really start to get a better understanding of your data, and now even start to do some analysis and enter your own data.

One more important thing to use with aggregate functions which is the word, distinct. If the word distinct isn’t specific in a statement, SQL will always assume you want all the data.

For example, you may have a customer who’s in a table multiple times. If you’re simply doing a count on your customer IDs and you don’t distinguish to count just the distinct customer IDs, you may be getting duplicate records in there, and this is really helpful to run queries where you’re counting distinct and to see where there even might be some potential duplicates in a column.

There are some things to keep in mind when using DISTINCT with our aggregate function of count. You can’t use DISTINCT on the COUNT function with the star.

Just keep this in mind, not only for when and where you can use DISTINCT, but also to think through when you want to use DISTINCT and when you don’t.

I hope this was helpful :-), if you like to understand more about processing data from multiple tables, please check here my next article.

For SQL sample code, you can check my github repository below:
LeenaKH123/SQL-Queries (github.com)

References:
1. Learn SQL Basics for Data Science | Coursera
2. SQL Tutorial (w3schools.com)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store