Solar System classifications with SQL

Another tutorial on intermediate SQL Queries

Karen Warmbein
The Startup
6 min readJun 4, 2020

--

Image Credit: Encyclopedia Britannica

What are the largest objects in the solar system? Well, it depends on how I define ‘largest’ — some people mean the largest as the most massive thing (like the Sun) or the object where the diameter is the biggest (again, the Sun). But, does the largest object need to be spherical? Do we count orbits of planets or Kuiper Belt objects in this idea of ‘largest’? What about the path of human-made satellites like Voyager?

Philosophy aside, I define ‘largest’ as the width, or the diameter, of a single object in the solar system. For example, the Sun is the largest object, not the Oort Cloud. In fact, the Sun is the first object listed in solar_system_20, a relational database table I built with the following columns:

  • ss_name — the unique name of the Solar System object
  • diameter— the equatorial width of the object (in miles)
  • classification — the type of object (like star or planet)
  • ss_location — the location of the object in the Solar System

Recall, this table has facts about the 20 largest objects in the Solar System. I used solar_sytem_20 in two previous tutorials, Touring the Solar System with SQL and Searching for Moons with SQL, to show hands-on examples of aggregate functions, subqueries, aliases, and the GROUP BY and HAVING commands in SQL.

I’m going to query the table again in this tutorial. Again in this post, I won’t show the entire table filled with data, but encourage you to read through this post to explore the table. This mimics a real-world scenario where you have a big dataset to query and you can’t possibly view all the records.

To get started, let’s remind ourselves what the first few records look like.

SELECT *
FROM solar_system_20
LIMIT 2;

Result:

The first two records displayed as rows in the table.

And, how many classifications are there?

SELECT classification, COUNT(*) 
FROM solar_system_20
GROUP BY classification;

Result:

There are four distinct classifications in this table.

Ok, there are four types of objects (from the column classification) in the table: moon, planet, dwarf planet, and star with 9, 8, 2, and 1 objects in each bin, respectively.

Distinct

Suppose we want to eliminate duplicate records from a set of results. For example, say that we want to obtain the classifications without the count of objects. Enter the DISTINCT keyword.

SELECT DISTINCT classification
FROM solar_system_20;

Result:

Four distinct types in the classification column.

Instead of counting the number of classifications on your hand (here we count 1… 2… 3… 4), SQL can count for you by combining SELECT DISTINCT with an aggregate function.

SELECT COUNT(DISTINCT classification)
FROM solar_system_20;
Again, four distinct types in the classification column.

If we don’t use DISTINCT:

SELECT COUNT(classification)
FROM solar_system_20;

Result:

What?!?!

which doesn’t make any sense.

Offset

Recall that, at the beginning of this post, we found the first two rows of the table. We can also do the reverse and view the last row of the table. Or, if we want to be more data-savvy and use our new SQL skills, we can find the smallest two objects in the table.

SELECT *
FROM solar_system_20
ORDER BY diameter
LIMIT 2;

Result:

Records for the smallest two objects in the table.

But what if we wanted to display one or multiple records for middle-range diameters? Enter OFFSET. Let’s see this in action: find the 12th, 13th, and 14th largest objects in solar_system_20.

SELECT *
FROM solar_system_20
ORDER BY diameter
LIMIT 3
OFFSET 11;

Result:

Records for middle-range diameters.

Table vs Expanded Displays

Before I discuss the last topic in this post, I need to prep you. The following images have a different view format. So far in this post, the SQL views have been in a table format, with all the information in single rows.

I’m going to switch to an extended view for query results in the next topic. Why? The extended display does not show all the information crowded into one line, but separates one record into multiple lines. This is especially important when text begins to wrap to the next line in a table display — it becomes difficult to read!

Read on to see examples of this different display.

Array Aggregate Function

The array aggregate function collects a set of elements and combines them into an array. I think the best way to describe this is with an example. From the examples above, we know that there are four classifications for 20 total objects. Let’s create a view that displays all the objects in each classification.

SELECT
classification,
COUNT(classification) AS ss_count,
ARRAY_AGG(ss_name) AS class_name,
ARRAY_AGG(diameter) AS class_diameter
FROM solar_system_20
GROUP BY classification;

Result:

The first expanded display.

What does this mean? The names of the objects in each classification are grouped together in an array. Same with the diameter of the objects. So, in the “moon” classification, there are nine objects with the names in an array and the nine diameters in an array.

ARRAY_AGG() is an aggregate function (like COUNT()) but the function argument can take more than one input. For example, let’s list the name of the object and each diameter together in the array.

SELECT
classification,
COUNT(classification) AS ss_count,
ARRAY_AGG((ss_name, diameter) ORDER BY ss_name) AS info
FROM solar_system_20
GROUP BY classification;

Result:

Another extended display.

In fact, we have all 20 objects from solar_system_20 listed in this view! Now we know the largest objects, their diameters, and their classifications in the Solar System.

This brings up the question: can we slice these arrays? Why yes we can! Just add parentheses around arr_agg() and add [lower_bound:upper_bound] after the right parenthesis.

SELECT
classification,
COUNT(classification) AS ss_count,
(ARRAY_AGG((ss_name,diameter) ORDER BY ss_name))[2:3] AS names FROM solar_system_20
GROUP BY classification;

Here, we are asking for the 2nd and 3rd objects listed in each classification’s array. The result?

The result of a complex but short query.

Notice that the star classification only has one object so the names column is NULL — meaning there aren’t 2 or 3 objects to list.

SELECT DISTINCT, OFFSET, and ARR_AGG() — more techniques to add to your SQL toolbox. Stay tuned for next week's post on using SQL with Python’s Pandas package!

For your reference, I have a series of SQL tutorials that cover basic and intermediate queries. Take a look!

Like these SQL posts? Have a question? Let’s chat in the comments!

--

--