Data-driven astronomy with IBM Netezza

Discovering exoplanets by analyzing open data powered by Netezza Performance Server

7 min readSep 22, 2022

Motivation

The Kepler mission is the first mission dedicated to the search for exoplanets the size of Earth and smaller in the habitable zone of other stars¹. The search for exoplanets was based on watching for transits — events in which a planet passes in front of its star as seen from the point of view of the Earth. A maddened galaxy world has appeared to us, full of planets smaller than Earth and even planets orbiting two stars². Let’s take a closer look at the data collected during this mission using Netezza Performance Server (NPS)¹⁰. Doing research on thousands of objects requires an efficient way in discovering their properties.

Introduction

We will use the data provided in the NASA Exoplanets Archive¹. The data contains parameters for all targets observed on about a 100-square-degree patch of the sky near the constellation Cygnus. The data is split into quarters over 90 days because the Kepler telescope had to be rotated every 90 days in order for its solar panels to face the Sun.

We organized the necessary data in two files: stellar.parquet⁴ and exoplanets.parquet³ and stored them in a data lake. Amazon AWS S3⁷ was chosen for this purpose because it is perfect for storing large unstructured files. It guarantees high data availability and allows for trouble-free scaling of storage resources thanks to high data durability. The advantage of using Amazon AWS S3 with IBM NPS is the ease of connection, which is established by entering the appropriate credentials in the query.

Data analysis

Let’s take a closer look at the collected data. The first step is to allow working with external data. In order to do so, ensure that the settings enable_external_datasource and enable_parquet_optimization are set to true, see NPS documentation¹¹. Then we have to provide information on where the data is placed and how to authenticate to our data source. We will keep our data source under the name keplers3. The create external datasource command can be used to achieve that:

Now we can fetch the data to build the external table from the external data source keplers3. Let’s call it stellar. Here create external table command is needed:

Discovering the schema and sampling some data is the logical next step. While Netezza automatically discovers the schema, one may interactively look at it using show schema like this:

The stellar table contains a few more columns but we showed only those which we will need for further analysis.

Each target of observation has its own identification number called kepler id, as listed in the Kepler Input Catalog (KIC). The kepler id is unique to a target and there is only one kepler id per target. In our stellar table, the ID does not repeat, so we can be sure that there are no duplicate objects⁶. Knowing that, we can select spherical coordinates expressed as a pair of right ascension ra[deg] and declination dec[deg], without a distance coordinate and create our own map of observed targets based on that:

We used python driver for IBM Netezza⁵ to connect to the database, access the data obtained above in the Jupyter notebook and plot full Kepler’s Field of View:

Many exoplanets have been verified using the transit method and statistical validation. We took information about them from NASA Exoplanet Archive and put it into a separate table called exoplanets³.

To visualize the distribution of exoplanets, we will use the coordinates of the stars with respect to which they orbit⁶. So, we run following query:

Note the special meaning of NULL in the query above⁶. It can mean ‘not applicable’ while false positives do not receive planet names or ’not yet known’ and ‘value undefined’.

One of the most interesting pieces of information about transiting planets is their size. This measurement requires transit depth and the size of the star it is transiting⁶.

Then we placed new points on our observation map. The results obtained were plotted considering the radius prad of the exoplanets — the colored dots on the graph below represent confirmed exoplanets and the color intensity and size of the dots indicate how big the planets are compared to each other:

So now let’s analyze deeper exoplanet distribution in the radius–period plane. We need to join the radius computed above with the period for each confirmed exoplanet. Let’s do that it this way:

Querying a list of planets by their size is useful because it is a good indicator of what its surface looks like⁶. A giant planet the size of Jupiter is likely to be gaseous, while planets smaller than Earth are almost certainly going to be rocky. The Kepler mission showed that Earth-size planets are more common than we thought. The most common planets of all are the so-called Super-Earths or Mini-Neptunes. In the chart below you can see how many of them have a radius between Earth’s and Neptune’s radius.

Our next question is which Earth-sized planets are in the habitable zone of the host star⁶. A very important property of the exoplanet for accessing its habitability is the surface temperature. It determines whether liquid water could be found on the exoplanet surface. For this, we are going to calculate the equilibrium temperatures of palents — how much energy it receives from its star versus how much it radiates back into space⁸. Running one query

we calculated temperatures and showed them on the histogram below. It turned out that 363 of exoplanets have a temperature in the range of 273–373 K, which may indicate the presence of liquid water.

Now let’s ask ourselves another question: how many planets are in a multi-planet system? Planets sharing the same star will have the same kepid, but different koi_name values⁶. We can calculate that with the following query:

After executing the query above, we can see, that there are many systems with more than one exoplanet orbiting the star (yellow dots). Most of the systems have two planets, but there are also such examples with six or seven exoplanets.

One system turned out to be particularly interesting. 5 planets orbit around the red dwarf Kepler-186 (kepid = 8120608)⁶. The red dot on the graphs below shows where this system is located:

One of the exoplanets orbiting around Kepler-186-star is Kepler-186f with the radius of 1.11 times radius of Earth and it is the most distant of all five planets in the system¹². At the time of its discovery, it was the most Earth-like planet orbiting in the star’s habitable zone. The visualization below roughly shows the whole Kepler-186 system:

Conclusion

This article showed how much information can be read from the data collected during the Kepler mission. Working with such a large dataset was significantly simplified by using the functionality from IBM Netezza. This allowed for efficient execution of the necessary queries, downloading from the data lake only the columns and even only chunks of those columns that match our condtions. This was achieved by reading the statistics of each row group first. Additionaly, the query is executed in parallel on all SPUs which are available in NPS.

Appendix

All graphs were prepared in a Jupiter notebook using the matplotlib package. We gained access to the data returned by our query by configuring the database connection. Here is a very simple example of what it looks like:

Bibliography:

NASA Exoplanets Archive, https://exoplanetarchive.ipac.caltech.edu/
Kepler and K2 mission overview: https://www.nasa.gov/mission_pages/kepler/overview/index.html
Kepler Objects of Interest (KOI) Cumulative Table, DOI:10.26133/NEA4
Kepler Stellar Properties Table, DOI:10.26133/NEA6
Pure python driver for IBM Netezza: https://pypi.org/project/nzpy/
Murphy, T. and Murphy, S. Lecture 3 — Querying your data, Coursera: Data-driven Astronomy by The University of Sydney, https://www.coursera.org/learn/data-driven-astronomy
Amazon S3, https://aws.amazon.com/s3
Strauss, M., Writeup 3: Planetary Thermodynamics, Equilibrium Temperatures of Planets, https://www.astro.princeton.edu/~strauss/FRS113/writeup3/
Appache Parquet: https://parquet.apache.org/
IBM Netezza Performace Server: https://www.ibm.com/product/netezza
IBM SQL basics: https://www.ibm.com/docs/en/psfa/7.2.1?topic=documentation-netezza-sql-basics
NASA Ames Research Center, “NASA’s Kepler Discovers First Earth-Size Planet In The ‘Habitable Zone’ of Another Star”, Apr 17, 2014 https://www.nasa.gov/ames/kepler/nasas-kepler-discovers-first-earth-size-planet-in-the-habitable-zone-of-another-star

Acknowledgments:

“This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program.”

Co-author: Natalia Janik