UFO sightings, nuclear plants and a data science starter kit.

Brenda Leyva
MCD-UNISON
Published in
8 min readDec 12, 2020

There’s a myth going around that UFO sightings are more likely to occur near nuclear facilities. How does one go about exploring this idea?

source

Disclaimer: I am in no way, shape or form getting involved in the debate about UFOs being actual E.T. stuff. I do, however, see this topic as a folklore of sorts and find it very entertaining, this is only for demonstrative purposes on data manipulation and analysis.

Taking on a project like this can be overwhelming, in order to avoid such negative feelings, it is always better to take things step by step, so that’s how we are going to go over things.

  1. Have a question:

Everything starts with a question, it makes things easier in the long run and keeps you focused during the analysis so that it’s easier to identify findings and to know when to stop, to be able to complete the process and deliver a good product. In regards to this topic, I went of off the following:

Is it possible to establish a clear relation between UFO sightings and the location of nuclear facilities?

2. Getting your data:

A beginner friendly source of datasets I can recommend is Kaggle. This is where I ran into my first dataset and was able to collect my second one too. We need information on UFO sightings and nuclear plants in order to explore possible answers to our question.

3. Get comfortable with your dataset(s):

I chose to begin exploring my data with python on jupyter notebooks and created a tidy final dataset that I used on R markdown for my dashboard.

Some libraries to use for this are:

Basic libraries for python data tasks.

Read your csv and understand your data:

A useful thing to do is to write a function that will create an overview of your data as follows:

I worked with dfply to clean out the missing values and took only the data from the UK. This dataset is really big, contains information from several countries and due to time availability I chose to work on the section corresponding to only one country.

The same was done with the second dataset:

This one had only a few columns we needed so the rest were dropped.

And then filtered out all the countries except for United Kingdom.

Once the datasets included only the one country and no missing or null values they were written on a new csv:

4. Work with your data to improve it:

Both datasets have latitude and longitude on them but the location in regards to city, county and country isn’t really clear or is missing. This can be fixed with a process called reverse geocoding, there are many ways to do this and consists on using your latitude and longitude columns to populate the information about cities, states, countries from the address attached to the coordinates.

Here are two of the ways I found, both of them worked but had their details to consider:

HEREpy uses the HERE maps API to handle a lot of geocoding situations, reverse geocoding being one of them. You can use it for a big dataset without issues. The response comes as json and can be parsed and handled easily.

Nominatim doesn’t require an API key but has a query limit that you need to be aware of. I have found that works well if you have less than a thousand rows of data. It returns a list with the address items separated by commas.

First I ran them both to confirm everything was working and to take a good look at the response’s format.

The problem with the HEREpy function was that within the json response not every address had the same attributes so the loop that populated lists of the different ones constantly stopped for error. Instead I decided to go with the Nominatim function given that my dataset was reduced when I selected only one country and the amount of lines I had could be passed on a loop, the thing with Nominatim’s response is that it always gives back a list and the country and state/county where always the same element number, with the city there were variations here an there of taking the street name instead but the loop didn’t constantly stop as it always took a piece of data.

At the end of this process we have the same two datasets with more detail in regards to the locations.

The rest of the data manipulation that took place was mostly regarding the format, renaming a few columns and rearranging them in a way that made more sense.

The next issue with this dataset was that the duration column had very inconsistent information, cleaning this was a little more challenging than expected. The first thing was to take a look at the values included in the column and then separating with split() to get mostly the numbers on a first column, and then the describing word like minutes, hours, seconds on a second column.

Cleaning consisted mostly on detecting common typos and similar words and having them be the same label, all those mimutes, mnuits, and minuites, became “min”.

This is not ideal and I don’t encourage it, but at the moment this got me through to the other side.

The next step is simply to merge the two first columns back together and making that your new duration column on the dataframe.

After this I dropped the “latlong” columns on the dataset that I had created for the reverse geocoding algorithms and saved them as new tidy csv files.

The goal here is to have one dataset that combines the two and in a tidy way in order to use it to create a product afterwards, in this case a dashboard. So the last step was to detect those cities with UFO sightings that also had nuclear plants.

Doing that we can see that 1,459 UFO sightings occurred in cities that do not have a nuclear plant.

I saved this as a final tidy dataset.

5. Build a product:

Building a dashboard with R Markdown and Flex is a pretty simple task once you are familiar with the environment. Once Flex is installed you can create a new markdown with a template as Flex Dashboard.

And the rest consists on tweaking a few options in regards to columns and rows on the dashboard to add the content, more details on this can be accessed here.

Some libraries I loaded to work on this markdown (some at the get go, some as needs appeared):

The first thing I knew I wanted to create was a map that showed where the UFO sightings were reported and where the nuclear plants are located. I added stonehenge in the mix just as fan service.

The resulting map is really pretty and interactive thanks to plotly.

I continued to create some visual aids, like a bar plot for cities with UFO sightings and nuclear plants and those with UFO sightings and no nuclear plant. And something to show the distribution of most common duration times of the UFO events as well as reported shapes.

A really fun part came with the text analysis, this allowed me to create a bar plot of most common words within the comments of the events and a wordcloud to match. I want to give credit to Sanil Mhatre for sharing this sentiment analysis with R, this made it really easy to reproduce some text analysis elements.

Once the contents of the dashboard are ready, one can simply go to the knit option on RStudio and knit to html for a beautiful, functional dashboard.

There are a few more details to the process that took place in order to achieved the goal of the project, but the intention behind this post is to share an overview of what a project with data that isn’t very tidy may involve.

I decided to share all, the good the bad and the worst on my GitHub page, maybe it will be useful for someone out there.

All the credit goes to Flex and RStudio for having such cool functionality.

6. Now you have an educated guess.

After all of that, where did my question go? Is there a relation between the UFO sightings and the location of nuclear facilities? According to what I found there is no evidence to backup that claim. In fact, from what I have gathered, the opposite might be true. I did see that events had shorter times around nuclear plants and even though the majority of sightings described the shape of a “light” , around nuclear plants the most popular shape was a triangle, so there’s that.

Maybe the findings would be much different if we were to analyze the situation worldwide but for now, I wasn’t able to find a pattern that would agree with these findings, or these ones, or even these ones.

This process will look different for everybody, every project and every context, but the general idea is the same. To summarize:

  1. Have a question
  2. Get your data from trustworthy sources
  3. Get comfortable with your data
  4. Clean your datasets and adapt them to your needs
  5. Use tidy datasets to create a product
  6. Present your findings

** Much success on your projects and happy data science-ing! **

--

--

Brenda Leyva
MCD-UNISON

Former business administration professional turned physicist, turned data scientist with a unique approach to problem solving and data analysis.