Using Python To Analyze Historical UFO Data

Published in

The Data Science Publication

10 min readJun 5, 2021

By Mr. Data Science

Throughout this article, we will analyze some data on UFO sightings. Recent press releases from the Pentagon have sparked new interest in the topic of UFOs/UAPs, so it is a trendy and interesting way to introduce some data science and data analytics concepts. However, we need to be realistic about what we can discover from publically available datasets on this topic. These datasets usually consist of eyewitness accounts; therefore, the data should be considered low quality from a scientific perspective. Science does not put as much faith in eyewitness accounts as the legal system does; reference article [1] describes some of the research done into the unreliability of eyewitness accounts. The observers are untrained, and in many cases, the events are not independently verified.

Background on recent UFO/UAP sightings:

In 2021 the Pentagon confirmed that leaked military videos showing unexplained flying objects were genuine. The videos and still photographs appear to show objects that can fly at speeds over 13,000 mph and can accelerate and change direction in ways that seem beyond the capabilities of the best human technology. These objects have no obvious propulsion systems, can accelerate to supersonic speeds without producing a sonic boom, and can travel both through the air and underwater. They also have the ability to disappear, not just visually but also to disappear from radar. These objects used to be called UFOs (Unidentified flying objects); these days, they are often referred to as UAPs (unexplained atmospheric phenomena). Some possible explanations include:

These objects are real and were built by humans.
These objects are real but were not built by humans.
They are unexplained atmospheric phenomena.
They are explainable phenomena, but the observer was confused by what they saw.

On the topic of alien life and UFOs, the science-fiction writer Arthur C. Clarke said: “Two possibilities exist: Either we are alone in the universe, or we are not, both are equally terrifying.” Points one and two above are both possible, but neither option has been shown to be true. Points 3 and 4 are also possible. Our scientific understanding of the atmosphere is not complete. There are strange phenomena like ball lighting that seem mysterious but are completely natural; perhaps other phenomena remain to be discovered and understood. The name change from UFO to UAP is perhaps an attempt to get away from the idea that the only possible explanation is alien spacecraft. Most people are not very familiar with the night sky; for example, every time the planet Venus becomes visible in the evening or morning sky, the number of ‘UFO’ reports increases. In the examples below, we’ll see that most UFO sightings are between 5 pm and midnight, not during daytime hours. So it is possible that many UFO reports are simply due to people seeing things they can’t explain.

Analyzing UFO Data With Python

In order to follow along with this article, you will need to install the following python libraries:

In the following sections, we will use UFO/UAP data to 1) explain why it is important to normalize data, 2) explore categorical distributions, and 3) go over some ideas for working with imperfect data.

Example 1 — The Importance Of Data Normalizaiton

The data used in this example is available on Kaggle.

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as pltdf1 = pd.read_csv('UFO/ufo_sighting_data.csv')D:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (5,9) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Importing the data generates a warning about some of the columns in the dataset, this is not an error, it is just a warning so we will ignore it. Lets plot the first few rows to see what the data looks like.

df1.head(2)

We can plot the USA data as a choropleth, giving each state a different color or color intensity to indicate the relative number of UFO sightings per state. If we just plot the data as it is we will get potentially misleading results. I can predict that CA, TX, FL, NY will probably have higher levels of UFO sightings, not because I have psychic powers but because these states have larger populations and the assumption more people = more sightings is probably true, at least to some extent. We will illustrate this problem then use normalization to get a better picture.

usa_map = gpd.read_file('UFO/us_state/tl_2020_us_state.shp')

To convert the output from value_counts() to a dataframe we can use:

df_states = df1['state/province'].value_counts().rename_axis('state').reset_index(name='counts')
df_states.head(2)

to match the state codes in the two dataframes we need to set df_states to upper:

df_states['state'] = df_states['state'].str.upper()df_states.head(2)

Now we can visualize the UFO counts per state:

merged = usa_map.set_index('STUSPS').join(df_states.set_index('state'))variable = 'counts'

fig, ax = plt.subplots(1, figsize=(10, 6))
plt.ylim(22, 51)
plt.xlim(-130, -65)
merged.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8');

If we plot a choropleth with state population will the visualization look the same as this map? I created a new csv file using population estimates available online.

state_pop = pd.read_csv('UFO/state_population.csv')

Now we have two dataframes, we can merge these using the common features: ‘state_code’ and ‘STUSPS’. This will add a new column to the dataframe with the population for each state

merged_pop = usa_map.set_index('STUSPS').join(state_pop.set_index('state_code'))variable = 'population'

fig, ax = plt.subplots(1, figsize=(10, 6))
plt.ylim(22, 51)
plt.xlim(-130, -65)
merged_pop.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8');

The maps are quite similar and demonstrate that in many cases a larger population does mean more sightings.

Now we can normalize the count numbers: X * (count of UFO sightings)/(state population) where X is some integer to reduce the number of zeroes in the normalized counts.

merged_1 = usa_map.set_index('STUSPS').join(df_states.set_index('state'))
merged_normalized = merged_1.set_index('NAME').join(state_pop.set_index('state'))merged_normalized['normalized_count'] = (merged_normalized['counts']/merged_normalized['population'])*1000

Plotting the normalized (per capita) data:

variable = 'normalized_count'

fig, ax = plt.subplots(1, figsize=(10, 6))
plt.ylim(22, 51)
plt.xlim(-130, -65)
merged_normalized.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8');

This map is much more interesting. CA has an average number of UFO sightings while places like WA and ME have relatively high counts. If we didn’t normalize the data this story would have remained hidden.

In the next example we will look at data types, in particular catagorical data.

Example 2 — Exploring Categorical Data Distributions

The dataset for this example is available on Kaggle. It is very similar to the first dataset but also includes a brief weather description.

df_weather = pd.read_csv('UFO/ufo_fullset.csv')df_weather.head(2)

Data can be grouped into different types, for example:

Continuous data
Categorical data

In this dataset, examples of continuous data include latitude, longitude, and eventTime. This data has maximum and minimum values but can take any value within those limits. Examples of categorical data include weather, shape, and sighting. This data can only take discrete values; the values could be text or numerical.

The way we visualize and understand how data is distributed differs depending on the data type. With categorical data, the bar chart is common. Pandas also provides a very useful function for categorical data: value_counts().

Using the pandas value_counts function

We can demonstrate some uses of value_counts() by answering some questions:

Were UFO sightings associated with particular weather?

To answer this question, we can apply the value_counts function to the ‘weather’ column:

df_weather['weather'].value_counts()clear            3206
mostly cloudy    3079
partly cloudy    2704
rain             2605
stormy           2162
fog              2123
snow             2121
Name: weather, dtype: int64

There is little difference between UFO sightings in clear weather and cloudy weather.

Are some UFO shapes more common?

df_weather['shape'].value_counts()circle      6047
disk        5920
light       1699
square      1662
triangle    1062
sphere      1020
box          200
oval         199
pyramid      189
Name: shape, dtype: int64

Circles and disks are by far the most common. If you want to get fractions of the total rather than absolute number use the normalize=True parameter.

df_weather['shape'].value_counts(normalize=True)circle      0.335982
disk        0.328925
light       0.094399
square      0.092344
triangle    0.059007
sphere      0.056673
box         0.011112
oval        0.011057
pyramid     0.010501
Name: shape, dtype: float64

So circles and disks together account for over 66% of all sightings.

What is the weather like when the different shapes are seen?

This requires the combination of the groupby and value_counts functions

df_weather.groupby('shape')['weather'].value_counts()shape     weather      
box       partly cloudy      78
          rain               70
          clear              22
          mostly cloudy      21
          stormy              7
          snow                2
circle    clear            1103
          mostly cloudy    1061
          partly cloudy     823
          stormy            792
          rain              786
          snow              751
          fog               731
disk      clear            1100
          mostly cloudy    1029
          fog               789
          partly cloudy     787
          snow              753
          stormy            740
          rain              722
light     clear             298
          partly cloudy     281
          mostly cloudy     273
          rain              272
          stormy            199
          snow              190
          fog               186
oval      rain               72
          partly cloudy      60
          mostly cloudy      30
          clear              27
          snow                7
          stormy              3
pyramid   rain               73
          partly cloudy      70
          mostly cloudy      22
          clear              14
          stormy              7
          snow                3
sphere    mostly cloudy     181
          partly cloudy     173
          rain              166
          clear             164
          fog               127
          stormy            106
          snow              103
square    clear             302
          mostly cloudy     283
          partly cloudy     255
          rain              239
          snow              201
          stormy            199
          fog               183
triangle  rain              204
          mostly cloudy     179
          clear             176
          partly cloudy     176
          snow              111
          stormy            109
          fog               107
Name: weather, dtype: int64

Oval, pyramid and triangle shapes are interesting in that they are most common when it is raining.

Value_counts can be plotted:

df_weather.groupby('shape')['weather'].value_counts().plot(kind = 'bar');

We also saw in the first example how the output from a value_counts function could be converted to a dataframe and then merged with other dataframes. The documentation has more information on this useful function. The next example looks at a common issue in data science — dealing with less than perfect data.

Example 3 — Working With Imperfect Data.

Incomplete, corrupted, or misleading data is the default situation for data scientists much of the time. In this example, we’ll analyze the first UFO dataset further; this will require dealing with some problems. Sometimes the column of data you need is missing, but it may be possible to create a column based on the existing columns:

The dataframe contains a date in the format mm/dd/yyyy. We want to count the number of sightings per year and plot them. To do so, we will perform the following:

step 1 — Create a year column by extracting the year from the date
step 2 — Use the value_count function
step 3 — Visualize these results

df1['date_documented'] = pd.to_datetime(df1['date_documented'])df1['year'] = pd.DatetimeIndex(df1['date_documented']).yeardf1['year'].value_counts(sort=False).plot(kind = 'barh');

In this case using the sort=False parameter keeps the years in their correct chronological order.

We can also get the day of week from the date:

df1['day'] = df1['date_documented'].dt.dayofweek

then we can look at the distribution of sightings by day of the week:

df1['day'].value_counts().plot(kind='bar');

In this case 0 = Monday. So in this dataset, UFO sightings are more common at the start of the week than at the weekend. This result might be telling us more about the people reporting the sightings than the events themselves.

Sometimes you don’t have to look very hard to find bad data; let’s look at the ‘length_of_encounter_seconds’ column.

column = df1['length_of_encounter_seconds']
max_value = column.max()
min_value = column.min()
mean_value = column.mean()
median_value = column.median()
print(max_value)
print(min_value)
print(mean_value)
print(median_value)97836000.0
0.001
9017.225634092296
180.0

The maximum value of 97,836,000 seconds corresponds to about 3 years. This seems unlikely. The minimum value of 0.001 seconds also seems unlikely; how can an eyewitness measure such accuracy? Its always good practice to take some time to look at the data and consider if it seems reasonable.

The next issue is inconsistent data:

When trying to extract the hour from the Date_time column, this error was generated: ParserError: hour must be in 0..23: 10/11/2006 24:00

In the data, midnight is sometimes represented as 24 and sometimes as 00.

There are different ways to deal with this; we’ll create a time column by extracting the hour part of the Date_time string then replace 00 with 24 to make everything consistent.

df1['time'] = df1['Date_time'].str[-5:-3]df1['time'] = df1['time'].replace({'00':'24'}, regex=True)

Now we can visualize the number of sightings distribution over time.

df1['time'].value_counts().sort_index(ascending=True).plot(kind='bar');

So a lot of UFOs are spotted between 5 pm (17) and midnight (24). One way to interpret this is to say that most people are asleep between midnight and 8 am so fewer sightings; also, people are less familiar with the night sky, so there are more UFO sightings after 5 pm. In other words, many of the ‘lights’ and ‘flashes’ seen at night might be due to things like shooting stars (meteors), aircraft, or artificial satellites; explainable phenomena that are unfamiliar to the observer.

A Quick Overview Of What You’ve Learned

If you’ve made it this far, you should:

Know how to apply a function to columns in a dataset
Understand the importance of normalizing a dataset
Be able tot investigate catagorical variables using the value_counts function
Be aware of some of the issues associated with bad and missing data

I hope you’ll take the UFO analysis we did a step further and integrate it with more open-source data on the internet. If you have any feedback or suggestions for improving this article, we would love to hear from you!

References:

Hurley, G., The Trouble with Eyewitness Identification Testimony in Criminal Cases, 06/01/2021, link

Connect With Mr. Data Science:

MrDataScience.com, GitHub, Medium,