Observing the Cosmos With SQL

Solving a really difficult problem in PostgreSQL

Karen Warmbein
The Startup
6 min readAug 16, 2020

--

Photo by Paxton Tomko on Unsplash

Cloudy nights, those nights I can’t observe the stars, are filled with reading books, watching movies, and solving SQL problems. One problem I came across this week was really difficult! I mean really difficult. I want to adapt it to an astronomy scenario and talk about the solution, here.

Problem Statement

So, what is this really difficult problem? Suppose we have two RDBMS tables, a table named observations_info and another named users_info.

  • The observations_info table holds all attempted observations at many observatories. The first five rows follow.
  • The users_info table holds all information about users (both astronomers and observatories). The first five rows follow.

And, some middle rows look like:

The task is to write a SQL query to find the cancellation rate (defined as when it is cloudy) of observing sessions as well as information about the canceled sessions.

This problem is difficult because there are a lot of pieces that need to be fit together… and work together. Also, I use new functionality that I’ve encountered, the FILTER clause.

The Data

Let’s talk about the data a little bit. Like I said before, the observations_info table holds all attempted observations at many major observatories by astronomers at a certain observatory. Here is an explanation of all the columns.

  • id— an INT. This is a unique key that identifies each record.
  • astronomer_id —an INT. This is a foreign key that matches a unique id in the users_info table. It is an id of the astronomer using the telescope.
  • observatory_id — an INT. This is a foreign key that matches a unique id in the users_info table. It is an id of the observatory with the telescope.
  • location_id — an INT. This is a unique key that identifies each location of the observatory.
  • weather_status—a TEXT. This field has clear if the weather was clear on observation night, or cloudy if it was not clear.
  • observation_date—a TEXT. This field represents the date of the observation.

Let’s look at the table, again:

The users_info table that we will be querying holds all information about users (both astronomers and observatories). Here is the column information:

  • id — an INT. This holds the unique id of each record in the table. It matches the observatory_id or the astronomer_id in the other table.
  • successful —a TEXT. This captures if the observing event was successful or not.
  • user_role— a TEXT. This field displays the role of each user — an astronomer or an observatory.

A quick view of the table, again:

The Solution

Ok… after all that information about the data, let me state the problem, again: The task is to write a SQL query to find the cancellation rate of observing sessions as well as information about the canceled sessions. My solution to this problem (in PostgreSQL) looks like this:

Woah, hold up! There’s a lot going on in that query. Let’s break-it-down and start with the fields I am going to display from the query:

  • the observation date,
  • the number of cloudy nights,
  • the total count of observation days with the same date,
  • the rate of cloudy days that caused a cancellation, and
  • the ids of the astronomer and observatory that didn’t have a successful observing session.

Here are the field names in the table:

I’ll get back to the details about how this information is produced, later.

Next, I want to consider the FROM component of the query:

This part of the query combines three tables, the observation_info table, and information from the users_info used by two tables. These are aliased as oi, a, and o, respectively.

Digging deeper into the above code, I am joining on foreign keys to get the extra information about the astronomer and the observatory. If I want to display the first record (Note: this is in expanded display):

Now, let’s break down the calculations in the SELECT statement:

This “counts” the cloudy days (by using a SUM aggregate function and aCASE statement).

and

These two pieces of code count all the dates (cloudy or clear) that have the same observation date.

In the following piece of code, I calculate the rate of cloudy days, rounded to three decimal places:

In this last piece of code, I create an array that displays the astronomer and observatory ids associated with the date.

I want to tell you about the FILTER clause, which is new to me. This clause provides a much simpler way to deal with aggregate functions in this case (instead of creating a subquery). The FILTER clause follows an aggregate function and filters the instances with the cloudy nights.

So, the result of this long query is three rows of information:

For the second record with an observation_date of 2017–06–30, there where two cloudy nights out of eight observations on that date. The cancellation rate is calculated as 0.250 (which is what we expect from the previous numbers). The ids of the astronomer and the observatory, respectively, are in an array and read 6, 11 for one cloudy observation and 5, 12 for another observation on that same date.

QED.

That’s it for this tutorial, folks! Another set of tools for your SQL (flavor: PostgreSQL) toolbox.

For your reference, I have a series of SQL tutorials that cover basic and intermediate queries. Take a look!

Like these SQL posts? Have a question? Let’s chat in the comments!

--

--