Using Python To Analyze Historical UFO Data

Mr. Data Science
The Data Science Publication
10 min readJun 5, 2021

--

By Mr. Data Science

Photo by Brooke Denevan on Unsplash

Throughout this article, we will analyze some data on UFO sightings. Recent press releases from the Pentagon have sparked new interest in the topic of UFOs/UAPs, so it is a trendy and interesting way to introduce some data science and data analytics concepts. However, we need to be realistic about what we can discover from publically available datasets on this topic. These datasets usually consist of eyewitness accounts; therefore, the data should be considered low quality from a scientific perspective. Science does not put as much faith in eyewitness accounts as the legal system does; reference article [1] describes some of the research done into the unreliability of eyewitness accounts. The observers are untrained, and in many cases, the events are not independently verified.

Background on recent UFO/UAP sightings:

In 2021 the Pentagon confirmed that leaked military videos showing unexplained flying objects were genuine. The videos and still photographs appear to show objects that can fly at speeds over 13,000 mph and can accelerate and change direction in ways that seem beyond the capabilities of the best human technology. These objects have no obvious propulsion systems, can accelerate to supersonic speeds without producing a sonic boom, and can travel both through the air and underwater. They also have the ability to disappear, not just visually but also to disappear from radar. These objects used to be called UFOs (Unidentified flying objects); these days, they are often referred to as UAPs (unexplained atmospheric phenomena). Some possible explanations include:

  1. These objects are real and were built by humans.
  2. These objects are real but were not built by humans.
  3. They are unexplained atmospheric phenomena.
  4. They are explainable phenomena, but the observer was confused by what they saw.

On the topic of alien life and UFOs, the science-fiction writer Arthur C. Clarke said: “Two possibilities exist: Either we are alone in the universe, or we are not, both are equally terrifying.” Points one and two above are both possible, but neither option has been shown to be true. Points 3 and 4 are also possible. Our scientific understanding of the atmosphere is not complete. There are strange phenomena like ball lighting that seem mysterious but are completely natural; perhaps other phenomena remain to be discovered and understood. The name change from UFO to UAP is perhaps an attempt to get away from the idea that the only possible explanation is alien spacecraft. Most people are not very familiar with the night sky; for example, every time the planet Venus becomes visible in the evening or morning sky, the number of ‘UFO’ reports increases. In the examples below, we’ll see that most UFO sightings are between 5 pm and midnight, not during daytime hours. So it is possible that many UFO reports are simply due to people seeing things they can’t explain.

Analyzing UFO Data With Python

In order to follow along with this article, you will need to install the following python libraries:

In the following sections, we will use UFO/UAP data to 1) explain why it is important to normalize data, 2) explore categorical distributions, and 3) go over some ideas for working with imperfect data.

Example 1 — The Importance Of Data Normalizaiton

The data used in this example is available on Kaggle.

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
df1 = pd.read_csv('UFO/ufo_sighting_data.csv')D:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (5,9) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

Importing the data generates a warning about some of the columns in the dataset, this is not an error, it is just a warning so we will ignore it. Lets plot the first few rows to see what the data looks like.

df1.head(2)
png

We can plot the USA data as a choropleth, giving each state a different color or color intensity to indicate the relative number of UFO sightings per state. If we just plot the data as it is we will get potentially misleading results. I can predict that CA, TX, FL, NY will probably have higher levels of UFO sightings, not because I have psychic powers but because these states have larger populations and the assumption more people = more sightings is probably true, at least to some extent. We will illustrate this problem then use normalization to get a better picture.

usa_map = gpd.read_file('UFO/us_state/tl_2020_us_state.shp')

To convert the output from value_counts() to a dataframe we can use:

df_states = df1['state/province'].value_counts().rename_axis('state').reset_index(name='counts')
df_states.head(2)
png

to match the state codes in the two dataframes we need to set df_states to upper:

df_states['state'] = df_states['state'].str.upper()df_states.head(2)
png

Now we can visualize the UFO counts per state:

merged = usa_map.set_index('STUSPS').join(df_states.set_index('state'))variable = 'counts'

fig, ax = plt.subplots(1, figsize=(10, 6))
plt.ylim(22, 51)
plt.xlim(-130, -65)
merged.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8');
png

If we plot a choropleth with state population will the visualization look the same as this map? I created a new csv file using population estimates available online.

state_pop = pd.read_csv('UFO/state_population.csv')

Now we have two dataframes, we can merge these using the common features: ‘state_code’ and ‘STUSPS’. This will add a new column to the dataframe with the population for each state

merged_pop = usa_map.set_index('STUSPS').join(state_pop.set_index('state_code'))variable = 'population'

fig, ax = plt.subplots(1, figsize=(10, 6))
plt.ylim(22, 51)
plt.xlim(-130, -65)
merged_pop.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8');
png

The maps are quite similar and demonstrate that in many cases a larger population does mean more sightings.

Now we can normalize the count numbers: X * (count of UFO sightings)/(state population) where X is some integer to reduce the number of zeroes in the normalized counts.

merged_1 = usa_map.set_index('STUSPS').join(df_states.set_index('state'))
merged_normalized = merged_1.set_index('NAME').join(state_pop.set_index('state'))
merged_normalized['normalized_count'] = (merged_normalized['counts']/merged_normalized['population'])*1000

Plotting the normalized (per capita) data:

variable = 'normalized_count'

fig, ax = plt.subplots(1, figsize=(10, 6))
plt.ylim(22, 51)
plt.xlim(-130, -65)
merged_normalized.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8');
png

This map is much more interesting. CA has an average number of UFO sightings while places like WA and ME have relatively high counts. If we didn’t normalize the data this story would have remained hidden.

In the next example we will look at data types, in particular catagorical data.

Example 2 — Exploring Categorical Data Distributions

The dataset for this example is available on Kaggle. It is very similar to the first dataset but also includes a brief weather description.

df_weather = pd.read_csv('UFO/ufo_fullset.csv')df_weather.head(2)
png

Data can be grouped into different types, for example:

  • Continuous data
  • Categorical data

In this dataset, examples of continuous data include latitude, longitude, and eventTime. This data has maximum and minimum values but can take any value within those limits. Examples of categorical data include weather, shape, and sighting. This data can only take discrete values; the values could be text or numerical.

The way we visualize and understand how data is distributed differs depending on the data type. With categorical data, the bar chart is common. Pandas also provides a very useful function for categorical data: value_counts().

Using the pandas value_counts function

We can demonstrate some uses of value_counts() by answering some questions:

Were UFO sightings associated with particular weather?

To answer this question, we can apply the value_counts function to the ‘weather’ column:

df_weather['weather'].value_counts()clear            3206
mostly cloudy 3079
partly cloudy 2704
rain 2605
stormy 2162
fog 2123
snow 2121
Name: weather, dtype: int64

There is little difference between UFO sightings in clear weather and cloudy weather.

Are some UFO shapes more common?

df_weather['shape'].value_counts()circle      6047
disk 5920
light 1699
square 1662
triangle 1062
sphere 1020
box 200
oval 199
pyramid 189
Name: shape, dtype: int64

Circles and disks are by far the most common. If you want to get fractions of the total rather than absolute number use the normalize=True parameter.

df_weather['shape'].value_counts(normalize=True)circle      0.335982
disk 0.328925
light 0.094399
square 0.092344
triangle 0.059007
sphere 0.056673
box 0.011112
oval 0.011057
pyramid 0.010501
Name: shape, dtype: float64

So circles and disks together account for over 66% of all sightings.

What is the weather like when the different shapes are seen?

This requires the combination of the groupby and value_counts functions

df_weather.groupby('shape')['weather'].value_counts()shape     weather      
box partly cloudy 78
rain 70
clear 22
mostly cloudy 21
stormy 7
snow 2
circle clear 1103
mostly cloudy 1061
partly cloudy 823
stormy 792
rain 786
snow 751
fog 731
disk clear 1100
mostly cloudy 1029
fog 789
partly cloudy 787
snow 753
stormy 740
rain 722
light clear 298
partly cloudy 281
mostly cloudy 273
rain 272
stormy 199
snow 190
fog 186
oval rain 72
partly cloudy 60
mostly cloudy 30
clear 27
snow 7
stormy 3
pyramid rain 73
partly cloudy 70
mostly cloudy 22
clear 14
stormy 7
snow 3
sphere mostly cloudy 181
partly cloudy 173
rain 166
clear 164
fog 127
stormy 106
snow 103
square clear 302
mostly cloudy 283
partly cloudy 255
rain 239
snow 201
stormy 199
fog 183
triangle rain 204
mostly cloudy 179
clear 176
partly cloudy 176
snow 111
stormy 109
fog 107
Name: weather, dtype: int64

Oval, pyramid and triangle shapes are interesting in that they are most common when it is raining.

Value_counts can be plotted:

df_weather.groupby('shape')['weather'].value_counts().plot(kind = 'bar');
png

We also saw in the first example how the output from a value_counts function could be converted to a dataframe and then merged with other dataframes. The documentation has more information on this useful function. The next example looks at a common issue in data science — dealing with less than perfect data.

Example 3 — Working With Imperfect Data.

Incomplete, corrupted, or misleading data is the default situation for data scientists much of the time. In this example, we’ll analyze the first UFO dataset further; this will require dealing with some problems. Sometimes the column of data you need is missing, but it may be possible to create a column based on the existing columns:

The dataframe contains a date in the format mm/dd/yyyy. We want to count the number of sightings per year and plot them. To do so, we will perform the following:

  • step 1 — Create a year column by extracting the year from the date
  • step 2 — Use the value_count function
  • step 3 — Visualize these results
df1['date_documented'] = pd.to_datetime(df1['date_documented'])df1['year'] = pd.DatetimeIndex(df1['date_documented']).yeardf1['year'].value_counts(sort=False).plot(kind = 'barh');
png

In this case using the sort=False parameter keeps the years in their correct chronological order.

We can also get the day of week from the date:

df1['day'] = df1['date_documented'].dt.dayofweek

then we can look at the distribution of sightings by day of the week:

df1['day'].value_counts().plot(kind='bar');
png

In this case 0 = Monday. So in this dataset, UFO sightings are more common at the start of the week than at the weekend. This result might be telling us more about the people reporting the sightings than the events themselves.

Sometimes you don’t have to look very hard to find bad data; let’s look at the ‘length_of_encounter_seconds’ column.

column = df1['length_of_encounter_seconds']
max_value = column.max()
min_value = column.min()
mean_value = column.mean()
median_value = column.median()
print(max_value)
print(min_value)
print(mean_value)
print(median_value)
97836000.0
0.001
9017.225634092296
180.0

The maximum value of 97,836,000 seconds corresponds to about 3 years. This seems unlikely. The minimum value of 0.001 seconds also seems unlikely; how can an eyewitness measure such accuracy? Its always good practice to take some time to look at the data and consider if it seems reasonable.

The next issue is inconsistent data:

When trying to extract the hour from the Date_time column, this error was generated: ParserError: hour must be in 0..23: 10/11/2006 24:00

In the data, midnight is sometimes represented as 24 and sometimes as 00.

There are different ways to deal with this; we’ll create a time column by extracting the hour part of the Date_time string then replace 00 with 24 to make everything consistent.

df1['time'] = df1['Date_time'].str[-5:-3]df1['time'] = df1['time'].replace({'00':'24'}, regex=True)

Now we can visualize the number of sightings distribution over time.

df1['time'].value_counts().sort_index(ascending=True).plot(kind='bar');
png

So a lot of UFOs are spotted between 5 pm (17) and midnight (24). One way to interpret this is to say that most people are asleep between midnight and 8 am so fewer sightings; also, people are less familiar with the night sky, so there are more UFO sightings after 5 pm. In other words, many of the ‘lights’ and ‘flashes’ seen at night might be due to things like shooting stars (meteors), aircraft, or artificial satellites; explainable phenomena that are unfamiliar to the observer.

A Quick Overview Of What You’ve Learned

If you’ve made it this far, you should:

  • Know how to apply a function to columns in a dataset
  • Understand the importance of normalizing a dataset
  • Be able tot investigate catagorical variables using the value_counts function
  • Be aware of some of the issues associated with bad and missing data

I hope you’ll take the UFO analysis we did a step further and integrate it with more open-source data on the internet. If you have any feedback or suggestions for improving this article, we would love to hear from you!

References:

  1. Hurley, G., The Trouble with Eyewitness Identification Testimony in Criminal Cases, 06/01/2021, link

Connect With Mr. Data Science:

MrDataScience.com, GitHub, Medium,

--

--

Mr. Data Science
The Data Science Publication

I’m just a nerdy engineer that has decided to help people around the world learn about data science!