Observing the Cosmos With SQL
Solving a really difficult problem in PostgreSQL
Cloudy nights, those nights I can’t observe the stars, are filled with reading books, watching movies, and solving SQL problems. One problem I came across this week was really difficult! I mean really difficult. I want to adapt it to an astronomy scenario and talk about the solution, here.
Problem Statement
So, what is this really difficult problem? Suppose we have two RDBMS tables, a table named observations_info
and another named users_info
.
- The
observations_info
table holds all attempted observations at many observatories. The first five rows follow.
- The
users_info
table holds all information about users (both astronomers and observatories). The first five rows follow.
And, some middle rows look like:
The task is to write a SQL query to find the cancellation rate (defined as when it is cloudy) of observing sessions as well as information about the canceled sessions.
This problem is difficult because there are a lot of pieces that need to be fit together… and work together. Also, I use new functionality that I’ve encountered, the FILTER
clause.
The Data
Let’s talk about the data a little bit. Like I said before, the observations_info
table holds all attempted observations at many major observatories by astronomers at a certain observatory. Here is an explanation of all the columns.
id
— anINT
. This is a unique key that identifies each record.astronomer_id
—anINT
. This is a foreign key that matches a unique id in theusers_info
table. It is an id of the astronomer using the telescope.observatory_id
— anINT
. This is a foreign key that matches a unique id in theusers_info
table. It is an id of the observatory with the telescope.location_id
— anINT
. This is a unique key that identifies each location of the observatory.weather_status
—aTEXT
. This field hasclear
if the weather was clear on observation night, orcloudy
if it was not clear.observation_date
—aTEXT
. This field represents the date of the observation.
Let’s look at the table, again:
The users_info
table that we will be querying holds all information about users (both astronomers and observatories). Here is the column information:
id
— anINT
. This holds the unique id of each record in the table. It matches theobservatory_id
or theastronomer_id
in the other table.successful
—aTEXT
. This captures if the observing event was successful or not.user_role
— aTEXT
. This field displays the role of each user — an astronomer or an observatory.
A quick view of the table, again:
The Solution
Ok… after all that information about the data, let me state the problem, again: The task is to write a SQL query to find the cancellation rate of observing sessions as well as information about the canceled sessions. My solution to this problem (in PostgreSQL) looks like this:
SELECT
oi.observation_date
, SUM( CASE WHEN oi.weather_status='cloudy' THEN 1 ELSE 0 END) AS cloudy
, COUNT(oi.observation_date) AS total
, ROUND( CAST( SUM(CASE WHEN oi.weather_status='cloudy' THEN 1 ELSE 0 END) AS NUMERIC) / COUNT(oi.observation_date), 3) AS cancellation_rate
, ARRAY_AGG(a.id || ', ' || o.id) FILTER (WHERE oi.weather_status='cloudy' ) AS idsFROM observation_info oi
INNER JOIN user_info a ON oi.astronomer_id = a.id
INNER JOIN user_info o ON oi.observatory_id = o.id
GROUP BY oi.observation_date;
Woah, hold up! There’s a lot going on in that query. Let’s break-it-down and start with the fields I am going to display from the query:
- the observation date,
- the number of cloudy nights,
- the total count of observation days with the same date,
- the rate of cloudy days that caused a cancellation, and
- the ids of the astronomer and observatory that didn’t have a successful observing session.
Here are the field names in the table:
I’ll get back to the details about how this information is produced, later.
Next, I want to consider the FROM
component of the query:
FROM observation_info oi
INNER JOIN user_info a ON oi.astronomer_id = a.id
INNER JOIN user_info o ON oi.observatory_id = o.id
This part of the query combines three tables, the observation_info
table, and information from the users_info
used by two tables. These are aliased as oi
, a
, and o
, respectively.
Digging deeper into the above code, I am joining on foreign keys to get the extra information about the astronomer and the observatory. If I want to display the first record (Note: this is in expanded display):
Now, let’s break down the calculations in the SELECT
statement:
SUM(CASE WHEN oi.weather_status='cloudy' THEN 1 ELSE 0 END) AS cloudy
This “counts” the cloudy days (by using a SUM
aggregate function and aCASE
statement).
COUNT(oi.observation_date)
and
GROUP BY oi.observation_date
These two pieces of code count all the dates (cloudy or clear) that have the same observation date.
In the following piece of code, I calculate the rate of cloudy days, rounded to three decimal places:
ROUND( CAST( SUM(CASE WHEN oi.weather_status='cloudy' THEN 1 ELSE 0 END) AS NUMERIC) / COUNT(oi.observation_date), 3) AS cancellation_rate
In this last piece of code, I create an array that displays the astronomer and observatory ids associated with the date.
ARRAY_AGG(a.id || ', ' || o.id) FILTER (WHERE oi.weather_status='cloudy' ) AS ids
I want to tell you about the FILTER
clause, which is new to me. This clause provides a much simpler way to deal with aggregate functions in this case (instead of creating a subquery). The FILTER
clause follows an aggregate function and filters the instances with the cloudy nights.
So, the result of this long query is three rows of information:
For the second record with an observation_date
of 2017–06–30, there where two cloudy nights out of eight observations on that date. The cancellation rate is calculated as 0.250 (which is what we expect from the previous numbers). The ids of the astronomer and the observatory, respectively, are in an array and read 6, 11 for one cloudy observation and 5, 12 for another observation on that same date.
QED.
That’s it for this tutorial, folks! Another set of tools for your SQL (flavor: PostgreSQL) toolbox.
For your reference, I have a series of SQL tutorials that cover basic and intermediate queries. Take a look!
- Black holes, planets, and SQL — The first tutorial that introduces SQL and covers
SELECT
,WHERE
, andLIMIT
. - Touring the Solar System with SQL — Which covers common aggregate functions and subqueries.
- Searching for moons with SQL — Here, I cover more complex queries with aggregate functions.
- Classifying black holes with SQL — A tutorial with a
CASE
function. - Joining Constellations with SQL — Where I introduce the idea of joining two tables.
- Solar System classifications with SQL — Here, I cover the array aggregate function, OFFSET, and the topic of Table vs Expanded Displays.
- Many (to many) relations among the stars — A tutorial that covers relationships in a database.
- Creating tables in SQL — Which covers two methods of creating tables.
- A window to the solar system — Covering window functions.
- LAG functionality on SQL Data and Finding seasonality in sales data — Tutorials that use the
LAG
andLEAD
functionality. - The Nth largest object — Here, I write about creating PostgreSQL functions.
Like these SQL posts? Have a question? Let’s chat in the comments!