Getting Started with SQL

Andrew Jamieson
Analytics Vidhya
Published in
8 min readNov 16, 2020

How to use basic commands

If Data Scientist really is the sexiest job in the world, SQL is the decidedly unsexy tool that you’ll use to do most of your work. Based on job descriptions, SQL is even more in demand than other popular programming languages like Python and Java.

TL,DR — go to the summary table of basic commands at the end

What? by Rhys Kentish on Unsplash

What is SQL

SQL is the most commonly used language for interacting with databases. Why? It is relatively easy to understand and allows us to access data directly where it is stored.

SQL can be pronounced in two different ways. You can say the letters individually “S”, “Q”, “L”, or like the word “Sequel”. Either way is fine, although be warned … you may come across some people that adamantly insist that their pronunciation is the correct one. Even the big tech companies seem to disagree on this: Microsoft and Oracle databases use “Sequel”, while IBM and most open source databases use “S.Q.L”

SQL was originally developed in the ’70s by IBM researchers and it was first called SEQUEL. However, an aircraft company already held the trademark, so they dropped the vowels and renamed it SQL. SQL is also sometimes referred to as Structured Query Language … but that’s even more of a mouthful.

There are different types of SQL Databases. Some of the most popular are Oracle Database, SQL Server, DB2, PostgreSQL, and MySQL. These all use SQL, but there are some minor variations between them.

More on the history here

What are data, databases, tables?

Data are a collection of facts relating to an object. Data could be numbers, words, measurements, or even an image, a sound, or a video.

A database is an organized collection of data. It is a way to store information so that it can be accessed later. There are many different types of databases. Some examples are relational, object-oriented, NoSQL, graph … In this blog, we’ll be discussing relational databases — where the data are stored in tables that are related to each other.

What’s a table? This is where data is grouped in rows and columns. If you are familiar with excel or Google sheets, tables are similar. All the data in a column must be the same type of data. Data types are strings (a sequence of characters), numbers, dates, and Boolean (which means true or false). There is more to it than that. For example, you might want to specify whether your data has a decimal or not (integer vs float), or its size (tiny vs medium vs big). Check out more here

Table (not a data table) by Abel Y Costa on Unsplash

Why do we use tables?

Storing data in tables allows us to store data efficiently and to access it easily.

Imagine that we wanted to track players over a basketball season. We could try to build one large table that not only contained every play during every game, but all the player information, and team information as well. This would quickly get very complicated. We’d have a lot of repeated information. Every time LeBron James made a play we’d also have to list his position, his height, his weight. And imagine if something changed. If a played grew taller, put on weight, or even renamed themselves, then we’d have to change our data many different times

If instead, we had a game table with statistics for each play, a team table with the season record and team roster, and a player table with name and individual statistics, it would be a lot easier to keep track of our data and to make changes.

How to use SQL

SQL is written in statements. The statement that we’ll be using is SELECT — this allows us to access and display data. SELECT statements are known as a query. SQL is not case sensitive but it is best practice to capitalize SQL keywords so that they stand out from other parts of the query.

We also want to make our code as readable as possible. This will often mean a new line for each keyword, or to prevent the code from getting too long. We also use indents and whitespace to make it clear that code relates to a keyword.

It is also a good practice to end SQL statements with a semicolon. The semicolon is the sign for the end of your statement. It is a requirement to use this in some SQL databases.

First Steps Photo by Christian Chen on Unsplash

First steps

To view all the data in a table we would write:

SELECT * FROM table;

SELECT is where we choose what columns we want to display. Here we separate the column names by commas, or in the above case we use “*” to display all the columns

FROM is where we choose the tables that have the columns that we want.

That’s it? With just three words we can see the whole table. SQL is easy.

Not so fast. This is fine if we only have one table, and it has just a handful of rows. In reality, data is always more complex … so there are a few more tricks we have to learn. There are some things in common though. With a few exceptions, every query will always have a SELECT and a FROM statement.

Some databases are huge — millions or even billions of rows. It would take a long time to retrieve all this data. Fortunately, if we only want to see what the data in the table looks like, we don’t have to retrieve everything. We can use the LIMIT command to return only the number of rows that we specify.

SELECT *
FROM table
LIMIT 10;

Maybe we’d like to see who were the highest-scoring players from a game. We can sort the data using the ORDER BY clause on the column that we would like to sort. To sort the highest scorers at the top, we’d follow the column name with DESC for descending. In our SQL query ORDER BY comes after we have applied filters — which we’ll learn about next.

Filtering by Marc Babin on Unsplash

Filtering our results

It’s one thing to sort the data, but what if we have something specific that we’re searching for — for example back to our basketball example we’d like to find all the players who scored more than 30 points in a game. To do that we can filter the data using the WHERE clause. WHERE gives us a subset of the data that meets the condition that we specify. We can use numeric filters like equal to, not equal to, greater than, less than; or non-numeric filters — LIKE, IN, NOT …

LIKE

LIKE allows us to filter on similar values, rather than exact values. What makes LIKE especially powerful is that we can use wildcards as a substitute for a character or series of characters. The wildcard “%” represents any sequence of characters. It can be applied before or after the values that we are filtering on. The wildcard “_” is a substitute for a single character. WE can also use ILIKE, which works in the same way as LIKE, but ignores the character’s case.

SELECT last_name, height
FROM players
WHERE last_name ILIKE ‘jam%’;

IN

IN allows us to filter on a list of values that we would like to include. In the example below

SELECT *
FROM games
WHERE month IN (‘May’, ‘June’);

NOT

We put NOT before a conditional statement to select rows where the statement is false. NOT is often used with LIKE and IN.

SELECT *
FROM games
WHERE month NOT IN (‘May’, ‘June’);

AND

AND is used to filter on rows that satisfy two or more conditions. AND makes the filter more restrictive because the data has to satisfy more conditions. In this way, it works in the opposite way to an English sentence. For example, if we wanted data from LeBron James and Anthony Davis we can’t apply a filter of:

WHERE last_name is ‘James’ AND ‘Davis’

because there are no players with the last name of both James and Davis.

SELECT team,
FROM games
WHERE month NOT IN (‘May’, ‘June’)
AND home_score <= 100;

OR

OR is used to filter on rows that meet any of the conditions. OR makes the filter more expansive. From the previous example to get both LeBron James and Anthony Davis we would use OR. We often use AND and OR together.

SELECT *
FROM players
WHERE team LIKE ‘%Lakers’
AND (last_name = ‘Davis’ OR last_name = ‘James');

BETWEEN

This is used to filter rows where the values in the column are within a specific range. While we could replicate the same thing using AND / OR, using BETWEEN is more intuitive. Note that BETWEEN is inclusive, it includes the values that we specify.

SELECT *
FROM games
WHERE date BETWEEN ‘2020–01–01’ AND ‘2020–04–01’;

NULL

We can use IS NULL or to identify rows that do not have data. This is a useful step in checking for missing data. Note we can’t use = NULL.

SELECT *
FROM players
WHERE last_name IS NULL;

That covers the basic steps of working with data in one table. Our next step is to learn to join tables, but that’s for another blog.

Summary by Aaron Burden on Unsplash

Summary

Keyword: Code Example (How to use)

SELECT: SELECT column1, column2 (Choose the columns you want to analyze)

FROM: FROM table (Choose the table that has the columns you want)

LIMIT: LIMIT 5 (Limits the rows of data return based number you specify)

ORDER BY: ORDER BY column1 (Sorts the data using column specified)

WHERE: WHERE Column1 > 3 (Filters the result to values greater than specified)

Logical Operators

LIKE: WHERE first_name LIKE ‘%Bron%’ (Returns rows when the column contains the specified text ‘Bron’)

IN: WHERE month IN (‘Aug’, ‘Sep’) (Specifies a list of values to include, in this case only the months of ‘Aug’ and ‘Sep’)

NOT: WHERE month NOT IN (‘Aug’, ‘Sep’) (NOT reverses conditionals)

AND: WHERE points > 20 AND rebounds > 10 (Filter on rows where all the specified conditions are true)

OR: WHERE points > 10 OR assists > 10 (Filter where at least one condition is true)

BETWEEN: WHERE minutes_played BETWEEN 20 AND 30 (Select values within a range)

IS NULL: WHERE last_name IS NULL (Filter on rows with no value in that column)

Resources

More on SQL here:

This blog only used basketball as an example. If you want to actually explore this here are some great resources:

--

--

Andrew Jamieson
Analytics Vidhya

Andrew has an analytics background with over 20 years of experience in various industries, working with world-leading brands.