More Than Meets The Eye: Unsupervised Learning on UFO Reports — Part I

8 min readMar 5, 2018

Using unsupervised text clustering on a UFO report dataset to identify latent topics.

This is Part I of a two-part series. Part I walks through the text vectorization and clustering algorithm from python’s scikit-learn library and reveals latent topics in reports.

I’ve been looking to get more practice with clustering and natural language processing and recently learned about the National UFO Reporting Center (NUFORC). So I decided to take a look at some of the most heavily reported events in the United States.

This article shares the data science techniques I used, what works and doesn’t, and key insights on this dataset. What I found was interesting and challenged some of my assumptions on this dataset.

The Dataset

The National UFO Reporting Center is a non-profit organization that has collected reports since the 70s, their dataset includes historical newspaper reports dating back to 1762. I subsampled the dataset to analyze US reports occurring between years 1947–2017. Out of a total of 94,057 reports, I found:

788 (0.82%) were investigated by NUFORC.
265 (0.28%) included hyperlinks to video footage on YouTube.
293 (0.31%) of reports mention the word ‘abducted’.

Due to its size, you could take a variety of approaches to trying to understand this dataset. I decided to focus my analysis on the top 10 most heavily reported events.

I first ran an unsupervised learning algorithm to cluster the types of reports into four categories. In Part 2, I take a closer look at five of the 10 dates and run a sentiment analysis on the broader dataset to understand what type of language witnesses use while writing these reports.

Most Reported Events

Not surprisingly, I found the data was skewed towards more recent years (2014–2017). I chose to account for population growth by factoring in yearly estimates of online users in the US and removed ‘approximate bucket dates,’ where the exact dates were unknown. After these corrections, the following top 10 dates emerged as highly reported (1035 reports total).

Notice that of the top 10, three dates have significantly higher popularity scores than the rest. Interestingly, six are in the 1990s and two are on the 4th of July.

Topic Analysis with Unsupervised Learning

I wanted to take a deeper look at this data, to identify any distinct themes. The reports typically have a paragraph which describes the event in detail. To analyze this I applied basic natural language processing (TF-IDF) and unsupervised learning (K-means clustering) to group similar reports together into topics.

K-means Clustering on Text Explained…

Feel free to skip this part, this is an overview of how the algorithm works for text analysis.

This process creates a number of matrices:

A term-frequency matrix (TF) which counts the number of times a certain term (i.e. word or combination of words) is mentioned in a report.
An inverse-document-frequency matrix (IDF) which is a (logarithmically scaled) inverse fraction to represent a term’s frequency in the whole dataset. It is the total number of documents in corpus divided by the number of documents containing the term.
These two matrices are then multiplied to create a vector for each report. Each vector represents the report as a point in space, i.e the mathematical ‘location’ of the report relative to other reports.

Once we have our TF-IDF matrix of vectors, we can compare their similarity to each other by using K-means clustering. This algorithm randomly chooses k number of vectors to initialize the centroids. Think of these centroids as the comparison reference point (vector). For each iteration, the following happens:

Each report point is compared to all the centroids by Euclidian distance (straight line distance) and assigned, or re-assigned, to the closest centroid.
Once all points have been assigned to a centroid, the average vector or “middle point” of the cluster is calculated from its cluster members, and this becomes the new centroid point for comparison in the next iteration.

The two steps are repeated many times. At early iterations, the centroids will be closer to one another, but with successive iterations, they will adapt and move until the best fit is found by the algorithm.

Visualization of K-means Clustering method using 4 clusters. Image reference: vinhkhuc

For this project, I used Scikit-learn’s implementation of k-means which uses 300 iterations as the default, it runs the clustering model separate 10 times (with 10 different seeds) and chooses the best output based on the inertia of the model.

Clustering Method and Results

I removed common stop words (e.g. ‘in’, ‘the’, ‘from’, ‘are’, ‘!’) and custom stop words (e.g. ‘looked’, ‘shape’, ‘like’, ‘ noticed’, ‘thought’). I ran a text vectorizer on the report text and applied TF-IDF to the dataset.

For this analysis, I ran a TF-IDF vectorizer using n-grams ranging from a single word up to a string of 4 words, then reduced the dimensionality down to 12,000 features using L1 normalization (LASSO).

Even though 3 clusters would have been sufficient for this analysis (clusters seemed to be more distinct and non-repeating) I chose 4 so I can demonstrate the typical case of unintentional cluster behavior of the k-means algorithm. In general, though, you’d use an elbow plot to decide how many clusters are optimal for your clustering problem.

Here are the top 50 n-gram results of the 4 clusters…

Cluster 1: 4th July fireball, fireworks and flashing lights of various colors

['looking' 'show' 'quickly' 'round' 'clouds' 'behind' 'wife' 'circular' 'watched' 'across sky' 'minute' 'fireball' 'still' '30' 'approximately' 'line' 'night' 'slow' 'west east' 'speed' 'glowing' 'fast' 'direction' 'watching' 'green' 'seemed' 'later' '10' 'across' 'fly' '4th' 'noticed' 'july' 'sound' 'flying' 'moved' 'two' 'traveling' 'ball' 'slowly' 'high' 'flashing' 'formation' 'minutes' 'disappeared' 'north' 'objects' 'fireworks' 'red' 'orange']

Cluster 2: Blue white light and US Navy missile launch references

['contact' 'provides contact information' 'provides contact'
'information pd' 'contact information pd' 'provides' 'contact    information' 'clouds' 'slowly' 'shaped' 'glowing' 'leaving' 'shape' 'looked like' 'light sky' 'away' 'ball' 'craft' 'blue light' 'bright white' 'disappeared' 'left' 'minutes' 'large' 'smoke' 'us' 'trail' 'bright light' 'white light' 'note us' 'note us navy' 'us navy' 'us navy missile' 'us navy missile launch' 'note us navy missile' 'green' 'note navy missile launch' 'note navy missile' 'note navy' 'cloud' 'navy missile launch pd' 'navy missile' 'navy missile launch' 'missile launch pd' 'navy' 'launch pd' 'missile launch' 'missile' 'blue' 'launch']

Cluster 3: Moving objects in formation of various colors

['aircraft' 'shaped' 'witnessed' 'around' 'seemed' 'behind' 'slowly'
'sound' 'direction' 'fire' 'meteor' 'red' 'area' 'said' 'three' 'fast' 'noticed' 'flying' 'wife' 'seconds' 'moved' 'orange' 'ufo' 'craft' 'thought' 'night' 'speed' 'large' 'tail' 'shape' 'traveling' 'fireball' 'across' 'green' 'horizon' 'summary' 'formation' 'blue' 'fireworks' 'north' 'two' 'objects']

Cluster 4: Descriptive words beginning with letters: L, F, and G (!)

['longer' 'long' 'little' 'line' 'light sky' 'length' 'left' 'look' 'end' 'heard' 'headed' 'flew' 'flashing' 'first thought' 'fireworks' 'firework' 'fireball' 'fire' 'flight' 'feet' 'far' 'faded' 'extremely' 'ever' 'event' 'evening' 'even' 'fast' 'fly' 'flying' 'followed' 'head' 'happened' 'half' 'ground' 'green' 'got' 'gone' 'going' 'yellow' 'go' 'glowing' 'glow' 'get' 'front' 'friend' 'formation' 'following' 'heading' 'information']

What most jumped out at me, were the first two clusters. Cluster 1 references the 4th July, while Cluster 2 heavily references a US Navy missile launch. I was curious to see if these clusters represented specific events, especially regarding the missile launch, so I plotted the cluster groups by their event date.

Pretty interesting! As you can see in the chart, the words used in Cluster 1 were mostly dated the 4th of July 2014, rather than 4th of July 1997, identified in the Most Reported Events chart above. The words used in Cluster 2 were almost entirely from reports dated the 7th of November 2015. This indicates the clusters were potentially influenced by a class imbalance. These two dates (2014 & 2015) account for 52% of the reports that make up the top 10 dates subset.

Class Imbalance: similar to an unfair game of tug-of-war!

To fix the class imbalance problem, you would want to downsample the data from these two groups so that the number of reports being analyzed into clusters is more balanced. Read more about resolving class imbalance.

What’s cool about this chart though is that the language used by witnesses is similar enough for an unsupervised model to pick up on these different events, without knowing the report dates.

Cluster 3 is an example of an unintended clustering outcome. It can barely be seen on the chart, the majority of clusters are from 7th November 2015, however, they do not reference the missile launch or blue/white lights. This may indicate a separate, less reported event that also occurred on this date, or it could so happen that these reports are part of a little pocket which is very similar to Cluster 2.

Cluster 4 is spread out across all the dates, which makes sense, given that the words in that cluster are mostly generic, descriptive words. This is a good (and typical) example of another unintentional clustering outcome, where instead of a topical group, the algorithm has found reports which share a similarity in language structure.

ufo by Nook Fulloption from the Noun Project

I identified the top 10 most heavily reported UFO events in the United States, which flagged three dates for a closer look later and used clustering and text vectorization to identify two more dates for a targeted analysis. In the next part, I’ll take a deeper look at these five dates, investigate explanations for the events that took place and run a sentiment analysis on the whole dataset to understand what type of language witnesses are using in these reports.

Check out Part II of the analysis.

Hi, I’m Katie Lazell-Fairman. I’m a data scientist based in New York City. Check out my other data projects on Github. Got question on this post or are you curious about this project? Comment below or feel free to contact me!