Introduction to SQL Using Python: Filtering Data with the WHERE Statement

Erika D
Analytics Vidhya
Published in
11 min readOct 27, 2019

Structured Query Language, more popularly known as SQL is a powerful Relational Database language. This language allows you to access and manipulate data stored in a SQL database. This tutorial will show you how to access the power of SQL using Python and how data can be filtered using the WHERE statement.

The first step is to import the necessary libraries. You will need to import sqlite3, this will allow you to connect to a SQL database and run SQL queries. You will also need to import pandas. Pandas will allow you to view the queried results in a clean, easy-to-read data frame format.

Import necessary libraries

For this tutorial, the Football Delphi database will be used. You can download the database here, https://www.kaggle.com/laudanum/footballdelphi. Take some time to read the description to learn more about the database and tables we will be using.

Next you will need to connect to the database and create a cursor object. Later we will call the cursors execute method to run SQL queries.

Connect to database and create cursor object

You should now be connected to the Football Delphi database. This is a relational database. A relational database includes tables. Each table is its own unique dataset that stores information. Each table is organized by columns and rows. This database includes four tables:

  • Unique_Teams
  • Matches
  • Teams
  • Teams_in_Matches

To run a SQL query and view it in an easy to view data frame format we will use the following code:

SQL query in Python format

For the first query, we will preview the contents of the table, Unique_Teams. To do this we can run the following query:

Unique_Teams table

The SELECT statement tells the database what columns we are trying to pull from the dataset. The asterisk, (*), tells the database that we want to select all the columns available in the table. The FROM statement tells the database that we want to select data from the table, Unique_Teams. The SQL query is ended by a semicolon, (;). This tells the database that you have ended your query similar to the way a period ends a sentence.

You can practice this on your own by trying to query all the contents of the Matches table and comparing your query to the answer below.

Matches table

This is the table we will use for the rest of the tutorial. There are 9 columns in the Matches table. The Kaggle page this database was downloaded from describes each column as follows:

  • Match_ID (int): unique ID per match
  • Div (str): identifies the division the match was played in (D1 = Bundesliga, D2 = Bundesliga 2, E0 = English Premier League)
  • Season (int): Season the match took place in (usually covering the period of August till May of the following year)
  • Date (str): Date of the match
  • HomeTeam (str): Name of the home team
  • AwayTeam (str): Name of the away team
  • FTHG (int) (Full Time Home Goals): Number of goals scored by the home team
  • FTAG (int) (Full Time Away Goals): Number of goals scored by the away team
  • FTR (str) (Full Time Result): 3-way result of the match (H = Home Win, D = Draw, A = Away Win)

Perhaps we only want to select the column, Match_ID. To do this we will just write Match_ID in our select statement instead of the asterisk. The code below shows how this is done:

View Match_ID

Try selecting only the column HomeTeam from the Matches table and compare your code to the query below:

SELECT HomeTeam From Matches;

You can select multiple columns by typing the name of each column in the select statement and separating each column name by a comma, (,). You can see the names of both the Home Teams and Away Teams in each match below:

SELECT HomeTeam, AwayTeam FROM Matches;

Practice selecting multiple columns by selecting Match_ID and Date from the Matches table and compare your query to the code below:

SELECT Match_ID, Date FROM Matches;

THE WHERE STATEMENT

There are 24,625 rows in the Matches table. This makes it difficult to find specific information. Perhaps we only wanted information from the 2015 Season. To do this, we can add a WHERE statement to our SQL query. The WHERE statement allows us to select information that meets a certain condition. To query the information only pertaining to the 2015 season from the Matches table, we can use the query below:

SELECT * FROM Matches WHERE Season = 2015;

Now we only have 992 rows and each row is from a 2015 season match. This allows us to see the data from the 2015 season much easier than scrolling through the entire table as queried before. Just as the SELECT statement allows us to return specified columns from the data table, the WHERE statement allows us to return specific rows from the dataset that meet a certain condition, like Season = 2015. To practice using the WHERE statement, write a query that selects all the columns from the Matches table where the Full Time Home Goals (FTHG) is equivalent to 5 and compare your query to the query below:

SELECT * FROM Matches WHERE FTHG =5;

So far we have only used numerical data in the WHERE statement. In order to use string data in the WHERE statement, we need to wrap quotation marks around the data we are using as a condition. To query all the data where Arsenal is the Home Team we can use the query below:

SELECT * FROM Matches WHERE HomeTeam = ‘Arsenal’;

To practice using string data as a condition in the WHERE statement, select all the columns that are in the D2 division and compare your query to the one below:

SELECT * FROM Matches WHERE Div = ‘D2’;

Comparison Operators

So far, we have only used the equal sign (=) as an operator in the WHERE statement. In addition to the equality sign you can use other operators including:

  • > (greater than)
  • < (less than)
  • >= (greater than or equal to)
  • <= (less than or equal to)
  • != (not equal to)

To see all the data from the Matches table where the Away Team scored 3 or more times, we can use the following query:

SELECT * FROM Matches WHERE FTAG ≥ 3;

To practice using conditional operators in the WHERE statement, query all the data from the Matches table, where the Home Team did not win.

SELECT * FROM Matches WHERE FTR != ‘H’;

AND & OR: Using Multiple Conditions in the WHERE Statement

So far we have only put one condition in the WHERE statement but it is possible to put more than one condition in the WHERE statement. To do this you can add an AND or OR clause into your SQL query to further filter your results. When using the AND clause in the WHERE statement, each statement on either side of the AND clause must be true for an observation to be returned. To see how this works, the query below will show all observations from the Matches table where Man United was the Home Team and the final outcome of the game was a draw.

SELECT * FROM Matches WHERE HomeTeam = ‘Man United’ AND FTR = ‘D’;

When using the OR clause in a WHERE statement, rows will only be included in the results when either the first condition is true or the second condition is true. If both conditions are true for an observation then that observation will also be returned. To see how the OR clause works, view the following query that shows all the rows from the Matches table that were in either the 2012 season or the 2013 season:

SELECT * FROM Matches WHERE Season = 2012 OR Season = 2013;

To practice using the AND clause in the WHERE statement, select all the rows from the Matches table where the Home Team scored more than 3 points and the Away team won the game and compare to the query below:

SELECT * FROM Matches WHERE FTHG > 3 AND FTR = ‘A’;

To practice using the OR clause in the WHERE statement, select the HomeTeam, AwayTeam, FTHG, FTAG, FTR from the Matches table where the Home Team scored more than 7 goals or the Away Team scored more than 7 goals and compare to the query below:

SELECT HomeTeam, AwayTeam, FTHG, FTAG, FTR FROM Matches WHERE FTHG > 7 OR FTAG > 7;

BETWEEN & NOT BETWEEN: Filter Rows that Fall Into a Range

Now, we will introduce two more clauses that can be included in the WHERE statement, BETWEEN and NOT BETWEEN. With a BETWEEN clause, you can filter your results based on the condition that the value in the column being filtered has to be between two values in order to be returned. For example, if we wanted to select all the matches from the 2012 season to all the matches from the 2015 season, we would use the following query:

SELECT * FROM Matches WHERE Season BETWEEN 2012 AND 2015;

The NOT BETWEEN clause works in the opposite way. If we were to write NOT BETWEEN in place of BETWEEN in the query above, it would only show the matches that were not included in the 2012–2015 season range. So the matches prior to the 2012 season would be included and the matches after the 2015 season would have been included in the returned result but none of the matches between the 2012 season and the 2015 season would have been included. To query the matches that have a Match_ID that is not between 25 and 46,750, we could use the following query:

SELECT * FROM Matches WHERE Match_ID NOT BETWEEN 25 AND 46750;

To practice on your own, select FTHG, FTAG, FTR from all the matches where the number of goals scored by the Home Team is between 7 and 9 AND the number of goals scored by the Away Team is not between 0 and 3. When you have done this, compare your query to the one below:

SELECT FTHG, FTAG, FTR FROM Matches WHERE (FTHG BETWEEN 7 AND 9) AND (FTAG NOT BETWEEN 0 AND 3);

IN( ): Selecting Values from a List

Sometimes you may only want to select data that is included in a certain list. This is when the IN operator comes into use. You can use the IN operator in the WHERE statement if you want to only return rows when the condition is included in a list. For example, the following query will return all the matches where Chelsea, Hull or Watford was the Home team:

SELECT * FROM Matches WHERE HomeTeam IN(‘Chelsea’, ‘Hull’, ‘Watford’);

Practice using the IN operator to select the AwayTeam from the Matches table when the Away team is either Liverpool, Man City or Swansea and compare your query to the one below:

SELECT AwayTeam FROM Matches WHERE AwayTeam IN(‘Liverpool’, ‘Man City’, ‘Swansea’);

LIKE Operator: Filtering String Data

Next, we will introduce the LIKE operator. This operator is useful when you want to filter string data. To see how it works, look at the query below. This query will return all the matches where the Home Team name begins with “A”.

SELECT * FROM Matches WHERE HomeTeam LIKE ‘A%’;

After the LIKE statement we wrapped in quotation marks, A%. The A tells us that the statement must begin with an “A”. The percent sign is used like a wild card. Anything after the A is not required to be any specific value. To see how this works view the following:

  • WHERE HomeTeam LIKE ‘Be%’ (Returns rows where HomeTeam begins with “Be”)
  • WHERE HomeTeam LIKE ‘%ty’ (Returns rows where HomeTeam ends with “ty”)
  • WHERE HomeTeam LIKE ‘%a%’ (Returns rows where HomeTeam includes “a”)

To practice using the LIKE statement to return all matches where the Away Team name includes the term “City” and compare to the following code:

SELECT * FROM Matches WHERE AwayTeam LIKE ‘%City%’;

You can also use the term NOT LIKE. If you wanted to return all the matches where the Date did not begin with 2017 you could use the following query:

SELECT * FROM Matches WHERE Date NOT LIKE ‘2017%’;

To practice on your own, select all columns from the Matches table where the Date does not begins with 2016 and the HomeTeam name ends with a “y” and the AwayTeam name does not includes an “e”. Compare your query to the one below:

SELECT * FROM Matches WHERE (Date NOT LIKE ‘2016%’) AND (HomeTeam Like ‘%y’) AND (AwayTeam NOT LIKE ‘%e%’);

IS NULL & IS NOT NULL: Dealing with Missing Values

Sometimes, data may be missing in some rows. These values are known as NULL data. You can check if any of the data is missing for a specific column using the clause IS NULL in the WHERE statement. To see if the outcome of any match is missing from the Matches table we can use the following query:

SELECT * FROM Matches WHERE FTR IS NULL;

We received a length mismatch error. This is because there are now observations where the outcome of a game is unknown. That is good. Practice using the IS NULL clause to check whether any of the Home Team Goal Count is missing from any of the observations in the Matches table and compare your query to the one below:

SELECT * FROM Matches WHERE FTHG IS NULL;

There should be no values missing. The statement IS NOT NULL works in the opposite way. This will return all values where a column does not have a NULL value. The following query shows all the observations from the Matches table where there is not a null value in the HomeTeam column.

SELECT * FROM Matches WHERE HomeTeam IS NOT NULL;

Practice using the NOT NULL clause by querying all the columns from the Matches table where Date does not have a null value and compare to the query below:

SELECT * FROM Matches WHERE Date IS NOT NULL;

An important thing to note when using IS NULL and IS NOT NULL, a value of zero is not equivalent to a NULL value. A NULL is a missing value and a value of 0 is not a missing value.

This is the end of this tutorial. We have covered how to connect to and query through a SQL database with python. We also discussed what the WHERE statement is and the following filtering techniques that can be used with the WHERE statement:

  • Comparison Operators
  • AND & OR
  • IN BETWEEN & NOT BETWEEN
  • IN
  • LIKE & NOT LIKE
  • IS NULL & IS NOT NULL

I encourage you to keep practice using these techniques to gain a deeper understanding of how to filter a SQL table using the WHERE statement.

--

--