Analyzing SpaceX Launch Statistics With Apache Pinot and Streamlit

How to perform an exploratory data analysis with Apache Pinot and visualize the results with Streamlit, along with your favorite visualization libraries.

Dunith Danushka

Published in

Tributary Data

9 min readOct 18, 2021

Elon Musk has been busy making rocketry at SpaceX for making interplanetary travel a reality.

Meanwhile, we, mere mortals, can analyze his launch statistics to see how far he’s been successful at it.

This article will discuss how you can use Apache Pinot to do exploratory analysis on a SpaceX launch data set and visualize the results using Streamlit.

Overview

Assume that you’ve been a data analyst. You are given a SpaceX launch statistics data set and asked to produce the following metrics.

Breakdown of the landing outcome
No. of launches by the customer
Launches by launch site
Payload mass variation over the past ten years

The results should be put on an interactive dashboard to make the visual interpretation easier for the business users.

What would be your toolset?

Apache Pinot — for fast ad-hoc analysis

As an analyst, you will be spending much time in your career doing exploratory analysis on large data sets. SQL has been your lingua franca for writing expressive and powerful queries.

Your expectation is fast and accurate results. That’s where Apache Pinot comes into play. Pinot is a real-time distributed OLAP datastore, purpose-built for low-latency high throughput analytics.

We will ingest the data set into Pinot and write SQL queries to produce the above metrics in this example.

Streamlit — for beautiful data-driven applications.

Once you get the metrics, the next challenge is to visualize them for the interpretation of business stakeholders. So you need a dashboarding solution here.

Streamlit allows you to build data-driven web applications with Python. On top of that, you’ll benefit from a great assortment of data visualization libraries in Python, such as matplotlib, seaborn, and Altair.

We will build a streamlit application to query data in Pinot and visualize them.

The winning combination

Pinot allows exploratory data analysis with SQL, a massive win for data analysts. Along with that, Streamlit helps you build great visualizations without being fluent in a bunch of front-end frameworks.

Pinot and streamlit work declaratively, meaning that you express what you want to do rather than instructing how to do that.

Isn’t that the analyst’s dream?

Sometimes, it’s easier to think in SQL terms-Meme credits to the Internet

Before we begin

Let’s get these products set up in your local workstation first.

Pinot is inherently a distributed system made of different components. For simplicity, let’s spin up a local single-node Pinot cluster.

Here, I’m running a Pinot binary with the launcher scripts. To get started, type this in a terminal.

cd apache-pinot-$PINOT_VERSION-bin
bin/quick-start-batch.sh

If everything goes well, you should be able to access the Pinot query console using http://localhost:9000.

If you want learn more about Pinot, refer to my article below.

Rise of the Streaming Databases — Episode 2 : Apache Pinot

How Pinot solves the toughest problems in the data analytics today with its low-latency, high throughput query…

medium.com

Follow this guide to install streamlit locally. I’m on a macOS, so I will first use Pipenv to create a virtual environment and install streamlit.

As usual, impatient readers can find the source code for the completed project from here

Step 1: Ingest the launch data set

I’ve sourced this CSV data set from Kaggle notebook, where it shows several SpaceX missions along with some key dimensions like the launch date and time, customer, payload, orbit, and the landing outcome.

I changed the column headers to remove the white spaces and make them lower case. A sample would look like this:

>head spacex.csvlaunch_date,launch_time,booster_version,launch_site,payload,payload_mass_kg,orbit,customer,mission_outcome,landing_outcome04-06-2010,18:45:00,F9 v1.0  B0003,CCAFS LC-40,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success,Failure (parachute)08-12-2010,15:43:00,F9 v1.0  B0004,CCAFS LC-40,"Dragon demo flight C1, two CubeSats, barrel of Brouere cheese",0,LEO (ISS),NASA (COTS) NRO,Success,Failure (parachute)

Before the analysis, we have to load this CSV file into Pinot. Once ingested, Pinot will create segments out of raw data so that it can perform faster queries.

Create the launches schema

Before loading data, Pinot requires you to define a schema for the data set. A schema declares the metadata such as data type (string, integer, etc.) and the column type (dimension, metric, date-time) for each field in the CSV file.

The following shows the mapping of the fields in the data set into their corresponding Pinot data types.

Once you figure out the mapping, create the launches-schema.json file with the following content.

{
   "schemaName":"launches",
   "dimensionFieldSpecs":[
      {
         "name":"booster_version",
         "dataType":"STRING"
      },
      {
         "name":"launch_site",
         "dataType":"STRING"
      },
      {
         "name":"payload",
         "dataType":"STRING"
      },
      {
         "name":"payload_mass_kg",
         "dataType":"FLOAT"
      },
      {
         "name":"orbit",
         "dataType":"STRING"
      },
      {
         "name":"customer",
         "dataType":"STRING"
      },
      {
         "name":"mission_outcome",
         "dataType":"STRING"
      },
      {
         "name":"landing_outcome",
         "dataType":"STRING"
      }
   ],
   "metricFieldSpecs":[
      
   ],
   "dateTimeFieldSpecs":[
      {
         "name":"launch_date",
         "dataType":"STRING",
         "format":"1:DAYS:SIMPLE_DATE_FORMAT:dd-MM-yyyy",
         "granularity":"1:DAYS"
      },
      {
         "name":"launch_time",
         "dataType":"STRING",
         "format":"1:MINUTES:SIMPLE_DATE_FORMAT:HH:mm:ss",
         "granularity":"1:MINUTES"
      }
   ]
}

Notice that the launch_date and launch_time fields are marked as dateTimeFields with the necessary formatting to extract them from the input string.

Create the launches table

Once the schema is in place, let’s create a table definition. You can think of a table as a logical abstraction that represents a collection of related data. A schema determines the structure of a table. A table further breaks down into segments where each segment holds a subset of the table’s data.

Go ahead and create the launches-table.json file with the following content.

{
   "tableName":"launches",
   "segmentsConfig":{
      "timeColumnName":"launch_date",
      "timeType":"DAYS",
      "replication":"1",
      "schemaName":"launches"
   },
   "tableIndexConfig":{
      "invertedIndexColumns":[
         
      ],
      "loadMode":"MMAP"
   },
   "tenants":{
      "broker":"DefaultTenant",
      "server":"DefaultTenant"
   },
   "tableType":"OFFLINE",
   "metadata":{
      
   }
}

Note that we have set the tableType as OFFLINE to nominate this as an offline table where we build the segments beforehand and ask Pinot to ingest them for this table. That is called batch ingestion.

Execute this command in a terminal to create a table and a schema inside Pinot.

./pinot-admin.sh AddTable \
> -tableConfigFile /path/to/launches-table.json \
> -schemaFile /path/to/launches-schema.json -exec

That will result in an output like this at the end.

{"status":"Table launches_OFFLINE succesfully added"}

Create the batch ingestion job

Let’s create an ingestion job to ingest our CSV file into Pinot.

Create a file called batch-job-spec.yaml and add the following content.

executionFrameworkSpec:
  name: standalone
  segmentGenerationJobRunnerClassName:
  org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner
jobType: SegmentCreationAndTarPush
inputDirURI: /path/to/rawdata/
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: /path/to/segments/
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: csv
  className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
  configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
tableSpec:
  tableName: launches
  schemaURI: 'http://localhost:9000/tables/launches/schema'
  tableConfigURI: 'http://localhost:9000/tables/launches'
pinotClusterSpecs:
  - controllerURI: 'http://localhost:9000'

The file above instructs Pinot where to find the raw data file and how to create segments. Make sure you replace the value of inputDirURI to point to the launch data set.

Run this command to commence the ingestion.

bin/pinot-admin.sh LaunchDataIngestionJob \
-jobSpecFile /path/to/batch-job-spec.yml

You should see the launches table inside the Pinot query console populated with data if everything goes well.

Launches table should show up in the Query Console

Step 2: Configure Streamlit

Now that we have our launch data set ingested into Pinot. Here comes the fun part, let’s create a streamlit application to connect to Pinot, issue queries, and visualize the results.

Go inside the directory where you want to keep the dashboard files and execute this to create a Python virtual environment and install streamlit there.

pipenv shell
pip install streamlit

Then, create the app.py file inside the directory with the following content.

import streamlit as st
import pandas as pd
import numpy as npst.title('SpaceX Launch Statistics')
st.markdown("Perform exploratory data analysis on SpaceX launch data set with Apache Pinot")

Verify the installation by:

streamlit run app.py

That should open a new browser tab and display the streamlit application in this URL:

http://localhost:8501/

Step 3: Install the pinotdb Python module.

Let’s try to query Pinot from our streamlit application. For that, we need the pinotdb module to be installed in the virtual environment.

pip install pinotdb

Step 4: Add metrics

Now that we have the data and dashboard application in place. Let’s start adding the metrics one by one.

4.1 Breakdown of landing outcomes

The data set has information about 101 launches in total. Out of that, we are interested to see a breakdown of different landing outcomes. For example, how many successful launches and failures, etc.

We can use the Plotly library to draw a pie chart to visualize this information as follows.

import plotly.express as px
import altair as alt
from pinotdb import connectconn = connect(host='localhost', port=8000, path='/query/sql', scheme='http')curs = conn.cursor()# Breakdown of the landing outcome
st.subheader('Breakdown of landing outcome')curs.execute("""
SELECT
 landing_outcome,
 count(landing_outcome) as frequency
FROM launches
GROUP BY landing_outcome
LIMIT 200
""")df = pd.DataFrame(curs, columns=[item[0] for item in curs.description])fig = px.pie(df, values='frequency', names='landing_outcome')
st.plotly_chart(fig, use_container_width=True)

Notice that Pinot’s broker port is set to 8000 as we are running Pinot in quick-start-batch mode. You can read the query result into a Pandas data frame and do further manipulations if needed. After generating a figure with Plotly, the data frame is visualized by streamlit.

It is clear that the majority of launches were successful.

4.2 Launches by customer

Curious to see the breakdown of launches by the customer?

Add the following code block.

# Launches by customer
st.subheader('Launches by customer')curs.execute("""
SELECT
 customer,
 count(customer) as total_launches
FROM launches
GROUP BY customer
ORDER BY total_launches DESC
""")df = pd.DataFrame(curs, columns=[item[0] for item in curs.description])chart = alt.Chart(df).mark_bar().encode(
 x='customer:N',
 y='total_launches:Q'
)st.altair_chart(chart,use_container_width=True)

That will result in the following bar chart.

Now we know that NASA has been the biggest paying customer of SpaceX.

4.3 Launches by launch site

SpaceX uses many launch sites for its missions. Let’s see how many sites have attributed to the launches along with the landing outcome.

# Let's calaculate launches per launch site
st.subheader('Launches per launch site')curs.execute("""
SELECT
 launch_site,
 landing_outcome,
 count(launch_site) as frequency
FROM launches
GROUP BY launch_site,landing_outcome
LIMIT 200
""")df = pd.DataFrame(curs, columns=[item[0] for item in curs.description])chart = alt.Chart(df).mark_bar().encode(
x='launch_site:N',
y='frequency:Q',
color='landing_outcome:N'
)st.altair_chart(chart,use_container_width=True)

The above renders a stacked bar chart where X and Y axes represent the launch site and the frequency. The color of each bar represents the landing outcome for a particular site.

4.4 Payload mass variation over the past 10 years

When the technology improves over time, SpaceX must’ve been able to carry more payload mass to orbit.

Let’s confirm the accuracy of that statement by plotting a line chart against the launch year and payload mass.

# Payload mass variation by yearst.subheader('Payload mass variation over the years')curs.execute("""
SELECT
 year(FromDateTime(launch_date,'dd-MM-yyyy')) as launch_year,
 payload_mass_kg
FROM launches
""")df = pd.DataFrame(curs, columns=[item[0] for item in curs.description])chart = alt.Chart(df).mark_line().encode(
x='launch_year:N',
y='sum(payload_mass_kg):Q'
)st.altair_chart(chart,use_container_width=True)

Notice that we’ve used a transformation function to extract out the year from the launch_date. Pinot supports a wide variety of data-time transformation functions OOTB.

The code results in this:

That confirms my statement on the payload capacity — it’s been growing since the first launch.

Let’s put it all together.

The completed app.py file looks like this:

Now your analysis is ready for sharing with a broader audience. Streamlit allows a cloud-based sharing option to deploy your application and share the dashboard URL with the audience.

Takeaways

Apache Pinot allows data analysts to perform ad-hoc querying on large data sets with low latency.
Pinot supports querying over SQL so that analysts can bring in their SQL expertise to Pinot without needing to learn anything new.
For those who want additional flexibility and power over packaged BI solutions, Streamlit is a good option.
Streamlit allows analysts to build data-driven web applications with Python. It is declarative and does not demand any UI skills from users.
Streamlit applications can query Pinot data with the pinotdb Python module and generate visualizations with popular visualization libraries.

Where next?

If you are interested in more visualization projects like this, please checkout these amazing posts by Kenny Bastani and Mark Needham.

Building a Climate Dashboard with Apache Pinot and Superset — Kenny Bastani
Analysing GitHub Events with Apache Pinot and Streamlit — Mark Needham

I stood on their shoulders to bring this article to you.

Visit here to learn more about Apache Pinot and try out Pinot recipes.