Ape-xploring the GBIF: A Monkey’s Safari through Python and Panda’s Data Analysis Habitat

Exploring and Enhancing North American Species Data with Pandas: Data Cleansing, Formatting and Optimization

Bruno Marquié
Nerd For Tech
23 min readOct 18, 2023

--

It’s Christmas time and the whole family is gathered around the dinner table, ready to gobble up some roast duck. The zoology major daughter is chatting with her neuropsychologist mom when all of a sudden, Darwin jumps into the conversation and starts quacking on about species instead of just thinking of the duck as a tasty meal. As a poor keyboard-pushing monkey, I feel left out and fowl-ly. But there’s hope! There are plenty of rounds of discussion during this holiday season, giving me ample time to go do some research and come back with some stuffing to say (pun intended). Maybe I’ll even be able to contribute something more substantial than just a handful of cranberry sauce.

Hey there, data lovers! As we all know, the first step in any data analysis journey is to get our hands dirty and clean up that messy data. It’s not the most glamorous part of the process, but it’s super important! Not only does it give us a chance to get to know our data set a little better, but it can also lead to some seriously serendipitous insights and wild hypotheses.

In this post, we’ll take a look at a GBIF dataset and see how we can tackle the data wrangling process using some tried-and-true tools and approaches. Get ready to roll up your sleeves and dive into the data!

Discovering North American Species with GBIF and Panda

The GBIF dataset is an incredible resource for anyone interested in species and biodiversity. It contains a vast amount of information on species from all around the globe, including details on their distribution, biology, and conservation status. Right now, we’re taking a closer look at the species that can be found in North America between 2016 and 2022.

This dataset includes a wide variety of animals, from the tiniest insects to the mightiest mammals. Whether you’re a scientist, a nature enthusiast, or just curious about the world around you, this dataset has something to offer. With it, we’ll try to learn more about the species that call North America home and how they’re faring over time. Our friend Panda is going to help us explore the dataset and see what insights we can uncover.

The GBIF dataset is constantly being updated with new information; it’s always worth checking back to see what’s new. So if you’re interested in North American species or just want to explore the world of biodiversity, join me on this adventure!

Through this post, I will provide some partial code snippets. The full notebook is available on Google Collab and Git.

Obtaining access to the raw data

Without further ado, let’s start by obtaining the GBIF dataset for North America from 2016 to 2022.

Here’s how to do it:

  1. Visit the GBIF website (https://www.gbif.org/) and click on the ‘Occurrences’ tab in the top menu.
  2. In the ‘Continent’ filter, enter ‘North America’.
  3. In the ‘Year’ filter, enter ‘2016–2022’.
  4. In the ‘License’ filter, select ‘CC0 1.0’. CC0 1.0 is similar to CC0, it is a public domain dedication, meaning the person who associated the work waives all rights to it, allowing anyone to use it without any conditions or non-commercial use.
  5. (Optionally, this URL should directly select the right dataset.)

6. On the central page, click on the ‘Download’ tab.

7. You’ll see a list of available download options. Select the ‘SIMPLE’ option to begin the download process.

I stored this zip file in a GitHub repository (GBIF.org (13 January 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.qq349x).

Pandas Dataframe Loading

All that’s left now is to use Panda to load it as a dataframe.

First, let’s import some libraries.

import pandas as pd
import numpy as np
import git
import os
import zipfile

Then, we can proceed with the actual loading.

# Load GBIF dataset
repo = git.Repo.clone_from("https://github.com/marqub/gbif-species-distribution-analysis.git", "gbif-species-distribution-analysis")

First look at the data

Now we have a dataframe that looks like this:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 50 columns):
# Column Dtype
--- ------ -----
0 gbifID int64
1 datasetKey object
2 occurrenceID object
3 kingdom object
4 phylum object
5 class object
6 order object
7 family object
8 genus object
9 species object
10 infraspecificEpithet object
11 taxonRank object
12 scientificName object
13 verbatimScientificName object
14 verbatimScientificNameAuthorship object
15 countryCode object
16 locality object
17 stateProvince object
18 occurrenceStatus object
19 individualCount object
20 publishingOrgKey object
21 decimalLatitude float64
22 decimalLongitude float64
23 coordinateUncertaintyInMeters object
24 coordinatePrecision object
25 elevation object
26 elevationAccuracy float64
27 depth float64
28 depthAccuracy float64
29 eventDate object
30 day float64
31 month float64
32 year float64
33 taxonKey float64
34 speciesKey float64
35 basisOfRecord object
36 institutionCode object
37 collectionCode object
38 catalogNumber object
39 recordNumber object
40 identifiedBy object
41 dateIdentified object
42 license object
43 rightsHolder object
44 recordedBy object
45 typeStatus object
46 establishmentMeans object
47 lastInterpreted object
48 mediaType object
49 issue object
dtypes: float64(10), int64(1), object(39)
memory usage: 1.7+ GB
From Wikimedia Commons, Roee Shpernik / CC BY-SA 4.0

The output of df.info() is useful:

  • There are 4,675,566 rows in the dataset, which can be thought of as observations. The dataset is 1.7+ GB in size, making it a substantial dataset. When working with datasets that have millions of rows, it can take a long time for a single line of code to execute, resulting in extended waiting times and less time spent on actual analysis.
  • I notice that the data type for most of the 50 properties is ‘Object’, which may not be efficient or optimal for processing purposes.
  • Some rows in the dataset are missing values for certain properties, so a decision needs to be made about how to handle these incomplete rows.
  • The names of the properties are not immediately clear to me, so I need to determine if they are relevant for my purposes and what they represent.

Let’s also take a quick look at a few rows of the dataset by using df.head().

  • I see some properties that seem to describe the species, even though it’s not my area of expertise. After some digging, I learn that the categories “kingdom,” “phylum,” “class,” “order,” “family,” “genus,” and “species” are part of a hierarchical classification system used to organize and name living organisms. Together, these categories form what is known as a “taxonomic hierarchy.” The classification of a species within this hierarchy is based on its characteristics and its evolutionary relationships with other species.
  • There are a bunch of IDs that I need to look at in more detail. They may not all be useful to me.
  • I also see some inconsistencies between properties, for example, we have several pieces of temporal information that are not in the same format or are incomplete.
  • The property issue contains a list of issues that are relevant to a particular row in the dataset. Examining this property can help me determine which elements of the dataset are valuable and which may need to be cleaned or modified.
tokens = df['issue'].str.split(';', expand=True).stack()
counts = tokens.value_counts()
print(counts)

I’ll comb through all these points, possibly finding even more hidden treasures. My goal is to get a thorough understanding of the dataset, optimizing and cleaning it as needed, and determining the best ways to put it to use in the future.

With a broad perspective on this dataset, I am particularly interested in gaining a deeper understanding of the distribution of species in a specific location. For example, focusing on the state level could be a good starting point.

  1. I am not particularly concerned with keeping track of the origin of the observations, such as the organization, people, or method used to record the observations.
  2. I want to use the observations solely to focus on the species.
  3. Additionally, I want to focus on properties that are statistically significant in order to extract and explain global trends and patterns.

Ultimately, if I plan to use these data to explain species distributions and correlate them over time and locations, the first step would be to clean and retain the right data. This will be a crucial starting point for computing basic statistics.

Cleaning and Removing Unnecessary Data with Pandas

My dataset has 50 columns, and some of them contain data that is incomplete, unreliable, duplicated, or otherwise not useful to me. I will begin by addressing these issues.

  • The properties verbatimScientificNameAuthorship and verbatimScientificName appear to be similar or derived from scientificName, so I will only keep scientificName. In addition, scientificName has a value for every row, while the other two do not. While I could use a regex to infer the missing values in the other properties from scientificName, I do not currently see a need to do so. Therefore, I will drop these two properties.
  • There are several ID fields: recordNumber, catalogNumber, taxonKey, and speciesKey. These are likely foreign keys to other datasets. I do not need them, and some of them are sometimes null. For speciesKey, I will prefer the field scientificName, which provides a more human-readable species identification. Therefore, I will drop these properties.
  • There are also several date properties: year, month, and day appear to correspond to eventDate, but they contain some null values when eventDate does not. Therefore, I will only keep eventDate. dateIdentified has many null values and may be unreliable, as indicated by the issue IDENTIFIED_DATE_UNLIKELY. However, null values in this field could simply mean that the species has not yet been identified, making it potentially useful. lastInterpreted is complete, so I will keep it. The process of documenting an observation likely progresses through observation, interpretation, and identification phases, with the latter two potentially repeated.
  • I am going to conduct some basic analysis on three fields: gbifID, datasetKey, and occurenceID. gbifID is a unique identifier provided by GBIF for each record in the dataset, which can be used to retrieve specific records or cross-reference with other datasets. It can also be helpful for tracking the source of the data and maintaining data integrity. datasetKey is an encrypted ID and I have chosen to remove it. The function of occurenceID is not clear to me, it could potentially identify the species, as occurenceStatus can be set to either PRESENT or ABSENT. But I already have speciesKey property, and since occurenceID can sometimes be null and I have multiple ways to identify a species, I will not use it.
# Iterate over the specified columns
for col in ['gbifID', 'datasetKey', 'occurrenceID']:
# Count the number of unique values in the column
uniq = df[col].nunique()
# Count the number of non-null values in the column
nonnull = df[col].count()
# Print the results
print(f"Column {col}: {uniq} unique values, {nonnull} non-null values")

At this point, there are already 11 columns that can be dropped:

df = df.drop(columns=[
"verbatimScientificNameAuthorship",
"verbatimScientificName",
"recordNumber",
"catalogNumber",
"taxonKey",
"speciesKey",
"year",
"month",
"day",
"datasetKey",
"occurrenceID"
])
df.info()

I have managed to reduce the size of the dataset by more than 300MB out of 1.7GB, while also improving the value of the data. This represents a reduction of approximately 20%, which is a good start!

Understanding the presence and distribution of null values in our dataset is crucial as it can reveal different insights. By analyzing the proportion of null values to the total number of rows for each column, we can gain further insights.

# some of the properties have low cardinality... If I can not extrapolate or makes sense of the values, better to drop them.
# Create a list of tuples containing the column name and its null value ratio
ratios = [(col, round(df[col].isnull().sum() / df[col].shape[0], 2)) for col in df.columns]
# Sort the list by the null value ratio
from operator import itemgetter
ratios = sorted(ratios, key=itemgetter(1), reverse=True)
# Print the sorted list
print("\n".join(f"{col} {ratio}" for col, ratio in ratios))

I will focus on columns that have a high ratio of null values, as they may not provide valuable information if they cannot be inferred or used meaningfully.

Let’s focus on all the properties with a ratio greater than or around 50%:

  • coordinatePrecision has a null value ratio of 1.0. In cases where this value is missing, I don’t believe that the location is 100% precise. Therefore, I don’t think this property is useful at the whole dataset level and I will discard it for now.
  • typeStatus has a null value ratio of 1.0, which is too high. I will drop it.
  • depthAccuracy has a null value ratio of 0.75. I will discard it for the same reason as coordinatePrecision.
  • depth has a null value ratio of 0.75. I will keep it, as it seems logical that some species are not found “underground”.
  • infraspecificEpithet has a null value ratio of 0.98. After some investigation, I will keep it. An epithet is a word or phrase added to a name in biological classification to specify a subspecies or variety. For example, in the scientific name Canis lupus, “lupus” is the specific epithet, indicating that this is a species of “wolf”. In the case of infraspecific epithets, it refers to lower levels of classification like subspecies or variety. It is used to specify a more specific type of organism within a species.
  • establishmentMeans has a null value ratio of 0.95. Apart from NaN, there are only two potential values: "Native" and "Introduced". Given that missing values in this property may not provide a clear indication of population growth and my goal is to have a broad perspective on the dataset rather than focusing on specific subsets identified by certain properties, I have decided to discard it for now.
  • mediaType has a null value ratio of 0.72. This rate is too high, and I don’t see how I could use this property, so I will get rid of it.
  • elevationAccuracy has a null value ratio of 0.82, which is too high. I will discard it for the same reason as coordinatePrecision.
  • elevation has a null value ratio of 0.77. I will keep it, similar reasoning as for Depth.
  • coordinateUncertaintyInMeters has a null value ratio of 0.6. I will discard it for the same reason as coordinatePrecision.
  • dateIdentified has a null value ratio of 0.57. I will keep it. I’ve explained why earlier.
  • rightsHolder has a null value ratio of 0.47. I don’t see any potential usage, so I will drop it.
  • individualCount has a null value ratio of 0.83. I will keep the property and assume that when the value is missing, it means that only one individual was observed.
  • identifiedBy has a null value ratio of 0.5. I don’t see any potential usage, so I will drop it.

At this point, I can clean up the dataset a bit more based on my intended use. I plan to perform statistical analysis on these data and extract relationships between species, as well as correlate them with their environment.
Therefore, I don’t need information like license, recordedBy, collectionCode, and institutionCode. Additionally, these last two have been flagged as untrustworthy (COLLECTION_MATCH_NONE, COLLECTION_MATCH_FUZZY, INSTITUTION_MATCH_FUZZY, AMBIGUOUS_INSTITUTION, INSTITUTION_MATCH_NONE).

As a result, my refined dataset now looks like this:

df = df.drop(columns=["coordinatePrecision",
"typeStatus",
"depthAccuracy",
"establishmentMeans",
"mediaType",
"elevationAccuracy",
"coordinateUncertaintyInMeters",
"rightsHolder",
"identifiedBy",
"license",
"recordedBy",
"collectionCode",
"institutionCode"])

I have now successfully retained only the necessary information in my dataset. As a result, the dataset size has been reduced from 235 MB to 122 MB and the number of columns has been halved.

Improving Data Clarity and Efficiency

Data Cleaning and Grouping — Renaming Columns for Clarity

I’ve looked at all the columns and now have a better idea of what they represent. I have identified three groups of properties: those related to species, observation, and the environment. To improve the clarity of the dataset, I will rename these properties to better convey their respective group or type of information.

I also change the names of the date properties to make them more uniform.

dateColumnNames = {"lastInterpreted":"lastInterpretationDate","dateIdentified":"identificationDate","eventDate":"eventObservationDate"}
df = df.rename(columns=dateColumnNames)

When using Pandas to read a CSV file, the data types of the columns are automatically determined and defaulted to the largest data type (such as int64, float64, or object) which can cause excessive memory consumption.

However, by examining the data more closely, we can optimize the memory usage and have access to data type specific operations.

Unifying Date Formats for Consistent Data Manipulation

Initially, there are 3 distinct date formats for the date properties. As meta_lastInterpretationDateis in Coordinated Universal Time (UTC), I will also convert the other two formats to UTC. This will enable us to manipulate these dates based on their month, day, or year level, keeping in mind that we have removed the properties year, month, and day.

dateColumnNames = ["meta_lastInterpretationDate", "meta_identificationDate", "env_eventObservationDate"]

Optimizing Categorical Data with Pandas’ ‘category’ Type

The “category” data type in pandas is a data type for categorical variables. Categorical variables are variables that take on a limited, fixed set of values.

One of the main benefits is that “category” dtype uses less memory compared to “object” dtype, as it maps the values to integers under the hood. It also allows for faster computation and aggregation of data.

If the ratio is small, meaning that there are relatively few unique values compared to the number of non-null values, it may be beneficial to convert the column to the “category” dtype.

# Get the object columns in the DataFrame
object_columns = df.select_dtypes(include=['object']).columns.tolist()

Based on these ratios and the column names, all these properties are good candidates for being converted to the category data type.

df[object_columns] = df[object_columns].astype('category')

Optimizing Numeric Properties for Storage Efficiency

Let’s optimize the storage of all float64 numeric properties. I will assume for now that they are all floats and use pandas to identify the most efficient storage method.

floats = df.select_dtypes(include=['float64']).columns.tolist()
df[floats] = df[floats].apply(pd.to_numeric, downcast='float')

I have observed that the properties env_depth and env_elevation are currently in feet. In order to work with metrics, I will convert them to meters. Based on the assumption that null values indicate 0 meters, I will also rename these properties for better understanding and coherence:

print(df[['env_depth', 'env_elevation']].describe().loc[['min', 'max']])

I noticed that the property meta_individualCount should be a numeric and that it has many missing values. I will assume that this means that only one individual was observed when the value is not present and update the rows accordingly.

print(df[['meta_individualCount']].describe().loc[['min', 'max']])

Maximizing the Value of My Dataset

To improve the efficiency of my analysis, I reduced the number of columns from 50 to 26. This resulted in a significant decrease in memory usage, being reduced by a factor of 4. Before proceeding with deeper analysis of these data, it was important to take the initial step of cleaning and preparing the data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
# Column Dtype
--- ------ -----
0 gbifID int64
1 species_kingdom category
2 species_phylum category
3 species_class category
4 species_order category
5 species_family category
6 species_genus category
7 species_species category
8 species_infraspecificEpithet category
9 species_taxonRank category
10 species_scientificName category
11 env_countryCode category
12 env_locality category
13 env_stateProvince category
14 meta_occurrenceStatus category
15 meta_individualCount float64
16 meta_publishingOrgKey category
17 env_decimalLatitude float32
18 env_decimalLongitude float32
19 env_elevationInMeters float32
20 env_depthInMeters float32
21 env_eventObservationDate datetime64[ns, UTC]
22 meta_basisOfRecord category
23 meta_identificationDate datetime64[ns, UTC]
24 meta_lastInterpretationDate datetime64[ns, UTC]
25 species_issue category
dtypes: category(17), datetime64[ns, UTC](3), float32(4), float64(1), int64(1)
memory usage: 414.6 MB

However, I don’t want to get caught up in minor details and spend too much time on unnecessary tasks, so I will stop here for now. I can always return to further data wrangling if I encounter any obstacles during my analysis.

At this point, I have:

  • A more comprehensive (to me) dataset: meaning of missing values, right data types, explicit names. It will help me during my analysis or to present compelling facts and reports.
  • A more compact dataset: I’ve gotten rid of all the unnecessary data that I don’t think will be useful to me. As a result, the memory usage is much lower, which will speed up my analyses.
  • Built up some domain knowledge: If it looks like a duck and quacks like a duck, it could be a common duck (Kingdom: Animalia, Phylum: Chordata, Class: Aves, Order: Anseriformes, Family: Anatidae, Genus: Anas, Species: A. platyrhynchos) or a wood duck (Kingdom: Animalia, Phylum: Chordata, Class: Aves, Order: Anseriformes, Family: Anatidae, Genus: Aix, Species: A. sponsa) or a roasted duck.
  • Figured out some potential statistical analysis and machine learning next steps to take.

Analysis of Species Distribution and Family Trends in the US by State

Now that I have a clean dataset and understand what it’s all about, I’m ready to start exploring it and discover some interesting insights!

What are the top 5 species kingdoms per state in the US?

To gain a better understanding of the distribution of species in a specific location, I plan to count the number of individuals observed by state and kingdom. To do this, I will sum up the values in the meta_individualCount property. I will focus on the top 5 states and display the counts using a bar chart with a logarithmic scale to account for the large differences between the top 5 kingdoms.

import matplotlib.pyplot as plt

What are the top 5 classes for the Animalia kingdom in these states?

It’s interesting to take a closer look at the animal kingdom in these 5 states across the US. I would like to compare the diversity in these states and see what the top 5 animal classes are in each one.

# create a dictionary to store the top 5 classes for each state
top_classes_per_state = {}

I decided to use a treemap to present this data. A treemap is a type of chart that uses nested rectangles to show hierarchical data. The size of each rectangle represents the quantity of data. This format makes it easy to see the relative abundance of different species classes within each state and compare the proportions between them. However, it doesn’t allow for comparison of the actual numbers between states (which a bar chart would have done). To address this, I included the number of observations for each rectangle for added context.

import seaborn as sns
import squarify

I won’t try to make too many conclusions, but from what I can see, insects are a big part of the observations in all states, particularly in Colorado and Texas. However, in Florida, sea life comes in second when you combine Elasmobranchi and Anthozoa. In California, Arves are more common (could this be due to it being a migration path?), and in New York you have to love frogs. This is just a high-level overview, but it raises a lot of potential for further analysis and makes me want to dive deeper into the dataset.

What’s going on with the frogs in New York?

I am curious about the high number of frogs in New York! I am unsure if this is typical or just a result of the dataset, but since I mentioned earlier that I wanted to investigate trends, I will quickly plot the number of Amphibia observations each year in New York.

Since I already converted env_eventObservationDate to a date object, I can now extract the year and use it to group the number of amphibia observations in New York by year.

# Create a new dataframe that only contains observations from New York and of Amphibia
df_ny_amphibia = df_us[(df_us['env_stateProvince'] == "New York") & (df_us['species_class'] == "Amphibia")]

Something looks definitly off and it’s worth further investigation, but that’s a task for another time.

From Cleaning to Querying: A Data Science Adventure

As you begin analyzing your dataset, you may discover that your data is not as clean as you initially thought. This can include issues such as duplicated states, incorrect spelling, and year data that falls outside of the expected range. It’s important to keep in mind that the data analysis process is iterative and requires going back and forth between cleaning, analyzing, and interpreting the data.

Now, as I wrap up this journey, I can say that it was a lot of fun using my data science skills to explore a new subject. Along the way, I also reminded myself (and hopefully you too!) how important it is to clean our data before diving into analysis. I may not have uncovered any groundbreaking insights, but it was a great journey overall and I hope you enjoyed it too.

Not too bad for a monkey who’s still learning the ropes!

Found this article useful? Follow me on Linkedin, Hackernoon, and Medium! Please 👏 this article to share it!

--

--

Bruno Marquié
Nerd For Tech

Architect @Citrix. Passionate about machine learning, search, distributed systems. Busy dad, rock climber, and trail runner during spare time.