Celestial Bodies Database: Data Preparation (Part 1)

Sri Hartini
7 min readSep 12, 2022

--

Photo by Bryan Goff on Unsplash

I need to create a celestial bodies database using PostgreSQL for this freeCodeCamp project. If the term “celestial bodies” is unfamiliar to you, it refers to the planets, stars, moons, and other natural things found in space. And, without a doubt, the first step in creating a database with multiple tables is data preparation.

In this article, I’ll demonstrate how I prepare data before beginning to write SQL queries. Every piece of information in the tables that I will build has been gathered from the internet, so any corrections to the information provided here are greatly appreciated.

Rules for celestial bodies database

Let’s look at the guidelines we need to follow for this project in order to build acceptable tables.

These are the requirements that we must fulfill while creating the tables:

  • The database should have at least 5 tables.
  • There must be a table of the galaxy (≥ 6 rows), star (≥ 6 rows), planet (≥ 12 rows), and moon (≥20 rows).
  • Each table should have at least 3 rows and 3 columns.
  • The galaxy, star, planet, and moon tables should each have at least 5 columns.
  • A primary key should be included in each table. The naming standard for primary key columns should be table_name_id. The moon table, for example, should include a primary key column called moon_id.
  • Each table should have a name column.
  • The INT data type must be used in at least two columns that are not the primary or foreign keys.
  • The NUMERIC data type must be used at least once.
  • The TEXT data type must be used at least once.
  • The BOOLEAN data type must be used in at least two columns.
  • At least 2 columns per table should not accept NULL values.
  • At least 1 column from each table should be required to be UNIQUE.
  • Each star should have a foreign key that references one of the rows in galaxy.
  • Each planet should have a foreign key that references one of the rows in star.
  • Each moon should have a foreign key that references one of the rows in planet.
  • Each foreign key column should have the same name as the column it is referencing.

Finding data sources

I was never interested in astronomy before this assignment. I didn’t know much about the names of stars, moons (except our moon), or even the name of the galaxy we live in. I’m sorry. However, I am quite familiar with the names of planets. So, the first question that came to my mind was, “What are things that I can classify as celestial bodies?” There are seven of them, as it turns out: stars, planets, satellites (our moon is Earth’s natural satellite), comets, asteroids, meteors and meteorites, and galaxies.

So, we can make seven tables about celestial bodies, but let’s keep things simple. You may recall from the rules that I am required to create a table of galaxies, stars, planets, and moons. They are four tables in total, and I require at least five. Let’s start with the four; perhaps there’s a way to figure out what’s needed in the other tables.

In my search for websites that can provide me with tables, I came across a website called “Little Astronomy,” which I found to be both interesting and useful. Here are the links I chose to start web scraping with:

When you click on the links, you’ll notice that the table of galaxy and star names contains a column called “Constellation,” which describes the constellation in which the galaxy is located or the group of stars to which the star belongs.

It inspires me to create its own table, the constellation table. However, there isn’t much information about constellations on this website other than their names, so I went to another website, IAU designated constellations from Wikipedia. Moreover, I find a planet table on this website, Planet from Wikipedia.

So, that’s all the information I have for now. We can always find additional sources as we work on the tables. Let’s get started on the first table right away.

We can begin at any table, but I prefer to begin with the moon table. Why? Because the rules state that the moon table should refer to the planet table, which refers to the star table, which refers to the galaxy table. So I believe I can begin with the first table in this chain. That’s what I think, but surely you can start building from any table.

Creating the first table: moon table

The web scraping codes provided below were inspired by this Medium article. I like its simplicity and detailed explanation of the article, so you might want to check it out as well.

Hmm, there should be at least five columns, not four (including moon id, which we will add later). Let’s add one more column.

I find this website helpful. It includes many additional columns, but I’ll only include one that I find interesting. I’ll go with “Diameter (km)”, then. Let us now retrieve another table.

But wait. While I could obtain all of the rows and columns, I only require the Satellite Name and Diameter (km) columns. If you look at the table on the website, you'll notice that some of the <tr> tags have eight <td> in them, while others only have seven. This is due to the fact that they combine rows in Planet (moons) column for moons from the same planet. So, be careful when writing the codes.

Now we just need to join those two tables together.

There are three moons that don’t appear on the df_moon's table. However, the row with indexes 166 and 167 appears to have a name that is similar which raises suspicions.

According to the website, Dysnomea is Eris’s moon, MK2 is Makemake’s moon, and Hi’iaka is Haumea’s moon. So, let’s check the moons of these planets from the df_moon table.

# First, we first check what planets are in the table
df_moon.planet.unique()
## Output
## array(['Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune', 'Orcus', 'Pluto', 'Salacia', 'Haumea', 'Quaoar', 'Varda', 'Gonggong', 'Eris'], dtype=object)

Planet Makemake doesn't exist in the df_moon table in the first place, so let's add it to the table with a little change in the moon's name.

We will now find out about the moon of Eris and Haumea that are listed in the df_moon and df_moon_diameter table.

df_moon[df_moon['planet'].isin(['Eris', 'Haumea'])]
df_moon_diameter[df_moon_diameter['name'].isin(['Dysnomea', 'Hi’iaka'])]

Since they actually refer to the same moon, we can change Dysnomea to Dysnomia, but what's wrong with Hi'iaka, the moon of Haumea?

print(df_moon.loc[161, 'name']) # Output: Hiʻiaka
print(df_moon_diameter.loc[43, 'name']) # Output : Hi’iaka

Great. They do differ in one character as well. Let’s give it the same name as it appears in the df_moon table.

df_moon_diameter.loc[41, 'name'] = 'Dysnomia' 
df_moon_diameter.loc[43, 'name'] = 'Hiʻiaka'

The diameter_in_km column of the moon_outer_join table would now be the only one that has null values.

If we tried to remove the rows with null values or use an inner join, the same table would be returned to us in this case. I’m fine with having multiple null values in a single column, but I’d rather not have too many of them. Additionally, as shown by the code below, if we use an inner join, we will only end up with 10 names for planets, whereas if we use an outer join, we can have 15 names. I made the decision to keep all of the names of the planets with moons. How can we then maintain the names of 15 planets while minimizing the number of null values?

In my opinion, the names of planets that don’t exist in planet_inner should be kept. Or, to put it another way, we will remove rows with null values if the planet name is not what we want to keep.

I think it’s safe now to create the next table, the planet table (see Part 2).

Later, using the planet table we created, we will replace the planet column in the df_moon table with the planet_id.

But before we move on, consider clapping, following, and subscribing to show your support for my writing. I’ll come back with more projects so we can all learn. See you in Part 2!

--

--