Big Data for Big Sharks

The Exploratory Data Analysis

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time', 'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23'], dtype='object')
Look at all that missing data 😱
          FLORIDA     NORTH CAROLINA      WA (AUS)     NSW (AUS)
Summer: 35% 75% 31% 48%
Winter: 7% 2% 16% 9%
Spring: 26% 3% 32% 18%
Autumn: 32% 20% 22% 25%
Don’t try this at home!

The Machine Learning Bit

No offense Brazil.
A subsection of the embedding visualisation showing the Florida cluster — even the little sharks got clustered together! #smort
╔═══════════════════╦══════════════╦═══════════════╦═══════════════╗
║ Model ║ DL4ML Feats. ║ One-hot Enc. ║ Categorical ║
╠═══════════════════╬══════════════╬═══════════════╬═══════════════╣
║ Nearest Neighbors ║ 0.730.73 ║ 0.69 ║
║ Linear SVM ║ 0.760.76 ║ 0.69 ║
║ RBF SVM ║ 0.61 ║ 0.68 ║ 0.74
║ Decision Tree ║ 0.73 0.730.73
║ Random Forest ║ 0.68 ║ 0.62 ║ 0.79
║ AdaBoost ║ 0.73 ║ 0.76 ║ 0.78
║ Naive Bayes ║ 0.69 ║ 0.69 ║ 0.71
║ Neural Net ║ 0.81 ║ 0.78 ║ 0.76 ║
╚═══════════════════╩══════════════╩═══════════════╩═══════════════╝
The two t-SNE dimensions for Location which bull sharks in orange and other sharks in blue — shows bull sharkers all over the place! (The clusters aren’t labeled, but they show e.g. Florida, NSW, Brazil etc)
Class vs. Principal Component 1 plot to show the distinction between bull sharks and other sharks as well as unknown sharks which potentially fall within the bull shark mapping
Data points with unknown species but predicted to be attributed to bull sharks.

The Watery End…

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store