Okay, so disclaimer here: the data isn’t really ‘big’ by any means and come to mention it, only some of the sharks are ‘big’ as well. But hey, a title like ‘Small and Messy Data for Sharks of Various Sizes’ is a little less catchy, isn’t it! Why would anyone want to look into this dataset? Well, sharks might be a bit scary, but they’re also pretty important and from a data science perspective, this dataset is a unique challenge given that it isn’t standardised well and contains a combination of numerical and text data. So this article aims to do two main things. Firstly, it’ll look at some nifty “deep learning for machine learning” techniques in Python, especially looking at the tradeoff between maximum data cleaning compared to using newer methods of text processing and embeddings to avoid having to feature engineer, even on structured data. And secondly, it’ll hopefully uncover some interesting sharky insights buried in this curious hodge-podge of data.
The Jupyter Notebook and associated helper functions can be found on Github, here. The processed dataset (with both the original columns and additional ‘corrected’ columns — with cleaned and aggregated entries) can be found on Kaggle, here.
The Exploratory Data Analysis
I won’t go into too much depth here, because it’s all in the Jupyter Notebook, but here are the columns we have to work with:
Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time', 'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href', 'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23'], dtype='object')
Some interesting discoveries once it was all done and dug through include:
- It’s a mess. Never underestimate how long it takes to clean things up — if you want your organisation to do ‘big data’ or ‘machine learning’ or any of the related buzzwords, get your data hygiene in order now. This one looks good at the start (i.e. the first 50 or so rows) but every column needs work to be standardised. Dates aren’t perfectly formatted, age is sometimes a range, the species might be a weird description of the shark and because it’s a .csv file, the fatality information sometimes winds up in the Time column! And there is a lot missing. Using Missingno, we can visualised all the missing data in each column:
- The most likely activity related to any sharky encounter was surfing, BUT the activity most related to a fatal one was swimming.
- Contrary to what they say on the internet, the USA is actually the shark encounter capital of the world (not Australia, thank you very much)! Okay, okay, that’s just on numbers alone — 2229 and 1338. Accounting for coastal population (USA ~ 39% or 126 mil vs. AUS ~ 85% or 21 mil), Australia does, in fact, have 3.6 times as many encounters per coastal capita, so I guess we are pretty sharky after all. Just goes to show how easy it is to manipulate the numbers if you want to. A further break down of encounters per country for your viewing pleasure:
- The type of shark you encounter depends a lot on where you are. In Hawaii, it’s most likely to be a tiger shark, but in Florida (aside from the ones we don’t know) it’s more like to be a bull shark, blacktip or spinner. And yes, living up to its sharky reputation, in Australia, it’s the great white.
- The time of year seems to count a bit too. Hemisphere wide, more people swim in summer so overall the encounter numbers are high then, but in some parts of the world, shark encounters are just as likely in spring or autumn as they are in summer. Some examples (WA = Western Australia, NSW = New South Wales, for brevity)
FLORIDA NORTH CAROLINA WA (AUS) NSW (AUS)
Summer: 35% 75% 31% 48%
Winter: 7% 2% 16% 9%
Spring: 26% 3% 32% 18%
Autumn: 32% 20% 22% 25%
- Using some coarse data filter for some imputation (i.e. finding the non-surfing activities conducted solely in rivers), it looks like 127 incidents could almost certainly be attributed to bull sharks — which would put them in the second deadliest spot ahead of tiger sharks!
The Machine Learning Bit
While it seems tempting to try and ‘predict’ when and where there might be more encounters, this dataset doesn’t really contain all the information that you’d need for that task. It likely relies on the number of people in the water, what human activities were occurring around that time, prey migration, the weather, shark migrations, and sightings (Update: Found one that might work, here). Since I didn’t have those readily on hand, I aimed to investigate two problems — firstly, is it possible to predict whether an encounter was fatal based on the data available? And secondly, given the information available, is it possible to infer what kind of shark was responsible, which could be useful for data imputation.
So this section will largely cover the progression of feature selection and model tweaks to improve accuracy. One thing I wanted to investigate was the comparison between new semantic embeddings methods for feature extraction from text compared to the standard machine learning techniques involving feature engineering such as aggregation of text into categorical or one-hot encoded variables.
As a benchmark, I used the aggregated and corrected columns for country, activity, species and hemisphere (calculated with country name mapping), as well as age, sex and date split into year and month. All of these were painstakingly processed a la traditional data science, with spelling errors fixed up, similar entries aggregated and reduced to fit the most popular categories, mostly leveraging ‘expert knowledge’. I couldn’t use the Location or Area column because the processing/cleaning burden was just way too high. I mean, look at this junk:
Enter Deep Learning for Machine Learning!
Don’t want to spend the rest of your life massaging crappy text data into a form that can be readily and usefully ingested by a machine learning model? Well, it turns out there are some new, fancy deep learning methods that enable you to capture semantic relationships between data points without having to process the data AT ALL (caveat: you probably do still want to tidy up the continuous values lol). In this case, the trick is to string each of the column values (country, location, activity, species’) containing text together to form a ‘sentence’, e.g. ‘Australia, Brisbane River, riding an inflatable unicorn, 1.5m bull shark, stole his vegemite sandwich’. Next, grab the ELMo Tensorflow model and calculate text embeddings from these ‘sentences’, reduce with PCA to however many features you’d like for your ML model (because the raw embedding is 1024, so quite big). The code for using the model is on my Github and an example from the visualisation of the output is here:
The full HTML example is available here — recommend downloading and opening in browser for the full, interactive experience!
As you can see, this method works so well at properly clustering the encounters by geographical location, activity and species because it is w̶i̶t̶c̶h̶c̶r̶a̶f̶t̶ a pre-trained character level bi-directional LSTM — meaning it doesn’t care about your garbage spelling or missing characters to do a fine job of calculating context-aware sentence embeddings. Hoorah!
Cool, but how do these new ‘features’ for location, activity and species, hold up against the aggregated and heavily processed features. Short answer — effectively the same and in some cases, even better! The main benefit comes not just from improving model accuracy by capturing more relevant data but mostly from achieving the same goals without all the effort and time required to clean the dataset. Here’s a side by side comparison of the different ML methods and features. The number of features for each experiment were 110 (for the DL4ML), 217 (One-hot Encoded Features) and 8 (Categorical Features).
║ Model ║ DL4ML Feats. ║ One-hot Enc. ║ Categorical ║
║ Nearest Neighbors ║ 0.73 ║ 0.73 ║ 0.69 ║
║ Linear SVM ║ 0.76 ║ 0.76 ║ 0.69 ║
║ RBF SVM ║ 0.61 ║ 0.68 ║ 0.74 ║
║ Decision Tree ║ 0.73 ║ 0.73 ║ 0.73 ║
║ Random Forest ║ 0.68 ║ 0.62 ║ 0.79 ║
║ AdaBoost ║ 0.73 ║ 0.76 ║ 0.78 ║
║ Naive Bayes ║ 0.69 ║ 0.69 ║ 0.71 ║
║ Neural Net ║ 0.81 ║ 0.78 ║ 0.76 ║
Another thing to note is that, although there has been a lot of success using embeddings for categorical variables (as per Fast AI and the Rossmann dataset), this only works if your data is easily put into categorical variables, and not a heap of semi-structured, steaming text poop. A couple of extra bits to make the process work better:
- Picking the right number of classes for the data — in this case, rejecting any encounters that were ‘invalid’ (i.e. not involving a shark) or ‘unknown’ (i.e. fatality wasn’t reported or otherwise). The intuition is that if you couldn’t effectively reason as to how to distinguish between the classes, then don’t expect the model to either.
- Ditto for the features — in this case, ditch the rows where too many columns are ‘unknown’. Keeping them in is like asking someone to guess what you’re thinking when you haven’t told them anything at all.
- Lastly, sub-sampled the dataset to have an even distribution from the class labels. Helps prevent overfitting and false accuracy scores (created by the model just guessing the overrepresented class).
Semantic Embeddings for Data Imputation
Given the location of an encounter and the activity, can you guess what kind of shark it was? Or in the data imputation mindset, given that most encounters with, e.g. bull sharks, have a certain signature (like in rivers, not surfing, shallow-ish murky water) can we find other data points with an unknown shark species that probably belong to the same class? There were two comparisons for doing this. By literally string matching on the data frame to find Locations containing ‘river’ but not ‘beach’ and Activity != ‘surfing’, there are about 127 additional likely bull sharky culprits. But could there be more?
Visualising the semantic embeddings of location data alone with the shark species represented by the colour, there really isn’t a very clear cluster of ‘bull shark locations’ (which makes sense, because they’re in a lot of countries). However, this is actually more of a semi-supervised problem. We know which data is bull sharky and which isn’t, so using something like Linear Discriminant Analysis is more useful for reducing the semantic embedding feature set to the components that most distinguish between bull sharks and other sharks. And this is what we get! Cool!
There is at least some distinction between the location text related to bull shark incidents compared to other sharks, which can be seen where the yellow bull shark dots don’t overlap the dark blue other shark dots. Including the Activity data only marginally improves things (separates out about 20 additional points). Comparing the actual data frame entries from both the coarse string-matchy search and the LDA exercise shows that there is an overlap of about 50%, so not a perfect match, but we get other extremely likely incidents too — like ones in harbors or estuary-like places known for bull sharks as well as actual places with prior bull shark encounters, like Ballina.
The Watery End…
So there you have it. A great way to deal with unstructured text within a structured dataset and a bunch of cool facts about sharks. If you try this deep learning for machine learning method (jump on Github and grab the ‘ElmoEmbedder’ class), let me know how it goes! Future work in this area will include investigating interpretability methods for this type of modelling (i.e. deep learning -> machine learning) so we can see what text actually contributes, further investigation of the unsupervised learning for imputation and expanding the methods to even more multi-modal data. Looking forward to sharing those with you when they’re ready :)
Thanks to these lovely folks for their great content and inspiration too: