Ape-xploring the GBIF: A Monkey’s Safari through Python and Panda’s Data Analysis Habitat

Exploring and Enhancing North American Species Data with Pandas: Data Cleansing, Formatting and Optimization

Published in

Nerd For Tech

23 min readOct 18, 2023

It’s Christmas time and the whole family is gathered around the dinner table, ready to gobble up some roast duck. The zoology major daughter is chatting with her neuropsychologist mom when all of a sudden, Darwin jumps into the conversation and starts quacking on about species instead of just thinking of the duck as a tasty meal. As a poor keyboard-pushing monkey, I feel left out and fowl-ly. But there’s hope! There are plenty of rounds of discussion during this holiday season, giving me ample time to go do some research and come back with some stuffing to say (pun intended). Maybe I’ll even be able to contribute something more substantial than just a handful of cranberry sauce.

— Discovering North American Species with GBIF and Panda
— — Obtaining access to the raw data
— — Pandas Dataframe Loading
— — First look at the data

— Cleaning and Removing Unnecessary Data with Pandas

— Improving Data Clarity and Efficiency
— — Data Cleaning and Grouping — Renaming Columns for Clarity
— — Unifying Date Formats for Consistent Data Manipulation
— — Optimizing Categorical Data with Pandas’ ‘category’ Type
— — Optimizing Numeric Properties for Storage Efficiency

— Maximizing the Value of My Dataset

— Analysis of Species Distribution and Family Trends in the US by State
— — What are the top 5 species kingdoms per state in the US?
— — What are the top 5 classes for the Animalia kingdom in these states?
— — What’s going on with the frogs in New York?

— From Cleaning to Querying: A Data Science Adventure

Hey there, data lovers! As we all know, the first step in any data analysis journey is to get our hands dirty and clean up that messy data. It’s not the most glamorous part of the process, but it’s super important! Not only does it give us a chance to get to know our data set a little better, but it can also lead to some seriously serendipitous insights and wild hypotheses.

In this post, we’ll take a look at a GBIF dataset and see how we can tackle the data wrangling process using some tried-and-true tools and approaches. Get ready to roll up your sleeves and dive into the data!

Discovering North American Species with GBIF and Panda

The GBIF dataset is an incredible resource for anyone interested in species and biodiversity. It contains a vast amount of information on species from all around the globe, including details on their distribution, biology, and conservation status. Right now, we’re taking a closer look at the species that can be found in North America between 2016 and 2022.

This dataset includes a wide variety of animals, from the tiniest insects to the mightiest mammals. Whether you’re a scientist, a nature enthusiast, or just curious about the world around you, this dataset has something to offer. With it, we’ll try to learn more about the species that call North America home and how they’re faring over time. Our friend Panda is going to help us explore the dataset and see what insights we can uncover.

The GBIF dataset is constantly being updated with new information; it’s always worth checking back to see what’s new. So if you’re interested in North American species or just want to explore the world of biodiversity, join me on this adventure!

Through this post, I will provide some partial code snippets. The full notebook is available on Google Collab and Git.

Obtaining access to the raw data

Without further ado, let’s start by obtaining the GBIF dataset for North America from 2016 to 2022.

Here’s how to do it:

Visit the GBIF website (https://www.gbif.org/) and click on the ‘Occurrences’ tab in the top menu.
In the ‘Continent’ filter, enter ‘North America’.
In the ‘Year’ filter, enter ‘2016–2022’.
In the ‘License’ filter, select ‘CC0 1.0’. CC0 1.0 is similar to CC0, it is a public domain dedication, meaning the person who associated the work waives all rights to it, allowing anyone to use it without any conditions or non-commercial use.
(Optionally, this URL should directly select the right dataset.)

6. On the central page, click on the ‘Download’ tab.

7. You’ll see a list of available download options. Select the ‘SIMPLE’ option to begin the download process.

I stored this zip file in a GitHub repository (GBIF.org (13 January 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.qq349x).

Pandas Dataframe Loading

All that’s left now is to use Panda to load it as a dataframe.

First, let’s import some libraries.

import pandas as pd
import numpy as np
import git
import os
import zipfile

Then, we can proceed with the actual loading.

# Load GBIF dataset
repo = git.Repo.clone_from("https://github.com/marqub/gbif-species-distribution-analysis.git", "gbif-species-distribution-analysis")

os.chdir("gbif-species-distribution-analysis/data")
with zipfile.ZipFile("gbif_data_2016to2022_northamerica.csv.zip", "r") as zip_ref:
    zip_ref.extractall(".")df = pd.read_csv("0233944-220831081235567.csv", sep='\t', on_bad_lines='skip')

First look at the data

Now we have a dataframe that looks like this:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 50 columns):
 #   Column                            Dtype  
---  ------                            -----  
 0   gbifID                            int64  
 1   datasetKey                        object 
 2   occurrenceID                      object 
 3   kingdom                           object 
 4   phylum                            object 
 5   class                             object 
 6   order                             object 
 7   family                            object 
 8   genus                             object 
 9   species                           object 
 10  infraspecificEpithet              object 
 11  taxonRank                         object 
 12  scientificName                    object 
 13  verbatimScientificName            object 
 14  verbatimScientificNameAuthorship  object 
 15  countryCode                       object 
 16  locality                          object 
 17  stateProvince                     object 
 18  occurrenceStatus                  object 
 19  individualCount                   object 
 20  publishingOrgKey                  object 
 21  decimalLatitude                   float64
 22  decimalLongitude                  float64
 23  coordinateUncertaintyInMeters     object 
 24  coordinatePrecision               object 
 25  elevation                         object 
 26  elevationAccuracy                 float64
 27  depth                             float64
 28  depthAccuracy                     float64
 29  eventDate                         object 
 30  day                               float64
 31  month                             float64
 32  year                              float64
 33  taxonKey                          float64
 34  speciesKey                        float64
 35  basisOfRecord                     object 
 36  institutionCode                   object 
 37  collectionCode                    object 
 38  catalogNumber                     object 
 39  recordNumber                      object 
 40  identifiedBy                      object 
 41  dateIdentified                    object 
 42  license                           object 
 43  rightsHolder                      object 
 44  recordedBy                        object 
 45  typeStatus                        object 
 46  establishmentMeans                object 
 47  lastInterpreted                   object 
 48  mediaType                         object 
 49  issue                             object 
dtypes: float64(10), int64(1), object(39)
memory usage: 1.7+ GB

From Wikimedia Commons, Roee Shpernik / CC BY-SA 4.0

The output of df.info() is useful:

There are 4,675,566 rows in the dataset, which can be thought of as observations. The dataset is 1.7+ GB in size, making it a substantial dataset. When working with datasets that have millions of rows, it can take a long time for a single line of code to execute, resulting in extended waiting times and less time spent on actual analysis.
I notice that the data type for most of the 50 properties is ‘Object’, which may not be efficient or optimal for processing purposes.
Some rows in the dataset are missing values for certain properties, so a decision needs to be made about how to handle these incomplete rows.
The names of the properties are not immediately clear to me, so I need to determine if they are relevant for my purposes and what they represent.

Let’s also take a quick look at a few rows of the dataset by using df.head().

I see some properties that seem to describe the species, even though it’s not my area of expertise. After some digging, I learn that the categories “kingdom,” “phylum,” “class,” “order,” “family,” “genus,” and “species” are part of a hierarchical classification system used to organize and name living organisms. Together, these categories form what is known as a “taxonomic hierarchy.” The classification of a species within this hierarchy is based on its characteristics and its evolutionary relationships with other species.
There are a bunch of IDs that I need to look at in more detail. They may not all be useful to me.
I also see some inconsistencies between properties, for example, we have several pieces of temporal information that are not in the same format or are incomplete.
The property issue contains a list of issues that are relevant to a particular row in the dataset. Examining this property can help me determine which elements of the dataset are valuable and which may need to be cleaned or modified.

tokens = df['issue'].str.split(';', expand=True).stack()
counts = tokens.value_counts()
print(counts)

------------------------------------------------------------
CONTINENT_DERIVED_FROM_COORDINATES                   4341570
GEODETIC_DATUM_ASSUMED_WGS84                         1976773
COORDINATE_ROUNDED                                   1603999
GEODETIC_DATUM_INVALID                               1134822
REFERENCES_URI_INVALID                                587791
INSTITUTION_MATCH_FUZZY                               326699
COLLECTION_MATCH_FUZZY                                266620
DIFFERENT_OWNER_INSTITUTION                           192224
OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT      184592
OCCURRENCE_STATUS_UNPARSABLE                          171696
RECORDED_DATE_INVALID                                 155747
TAXON_MATCH_HIGHERRANK                                149980
INSTITUTION_MATCH_NONE                                 96314
TAXON_MATCH_NONE                                       72634
INSTITUTION_COLLECTION_MISMATCH                        63484
AMBIGUOUS_COLLECTION                                   50552
COUNTRY_DERIVED_FROM_COORDINATES                       35926
COLLECTION_MATCH_NONE                                  34953
CONTINENT_INVALID                                      22258
AMBIGUOUS_INSTITUTION                                  21256
COORDINATE_PRECISION_INVALID                           17124
COORDINATE_UNCERTAINTY_METERS_INVALID                  13329
CONTINENT_DERIVED_FROM_COUNTRY                         12184
TAXON_MATCH_FUZZY                                      11491
PRESUMED_NEGATED_LONGITUDE                              7693
COORDINATE_REPROJECTED                                  5881
MODIFIED_DATE_UNLIKELY                                  4851
TYPE_STATUS_INVALID                                     4432
INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS       2553
RECORDED_DATE_UNLIKELY                                  1215
COORDINATE_INVALID                                      1147
COUNTRY_COORDINATE_MISMATCH                             1031
CONTINENT_COUNTRY_MISMATCH                               487
CONTINENT_COORDINATE_MISMATCH                            418
MULTIMEDIA_DATE_INVALID                                  343
INDIVIDUAL_COUNT_INVALID                                 292
OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD          216
COORDINATE_OUT_OF_RANGE                                  168
ELEVATION_MIN_MAX_SWAPPED                                 65
DEPTH_NON_NUMERIC                                         49
COUNTRY_MISMATCH                                          48
DEPTH_MIN_MAX_SWAPPED                                     28
MULTIMEDIA_URI_INVALID                                    25
FOOTPRINT_WKT_INVALID                                     18
IDENTIFIED_DATE_UNLIKELY                                  16
PRESUMED_NEGATED_LATITUDE                                 14
COUNTRY_INVALID                                           12
BASIS_OF_RECORD_INVALID                                    3
PRESUMED_SWAPPED_COORDINATE                                1
ELEVATION_NON_NUMERIC                                      1

I’ll comb through all these points, possibly finding even more hidden treasures. My goal is to get a thorough understanding of the dataset, optimizing and cleaning it as needed, and determining the best ways to put it to use in the future.

With a broad perspective on this dataset, I am particularly interested in gaining a deeper understanding of the distribution of species in a specific location. For example, focusing on the state level could be a good starting point.

I am not particularly concerned with keeping track of the origin of the observations, such as the organization, people, or method used to record the observations.
I want to use the observations solely to focus on the species.
Additionally, I want to focus on properties that are statistically significant in order to extract and explain global trends and patterns.

Ultimately, if I plan to use these data to explain species distributions and correlate them over time and locations, the first step would be to clean and retain the right data. This will be a crucial starting point for computing basic statistics.

Cleaning and Removing Unnecessary Data with Pandas

My dataset has 50 columns, and some of them contain data that is incomplete, unreliable, duplicated, or otherwise not useful to me. I will begin by addressing these issues.

The properties verbatimScientificNameAuthorship and verbatimScientificName appear to be similar or derived from scientificName, so I will only keep scientificName. In addition, scientificName has a value for every row, while the other two do not. While I could use a regex to infer the missing values in the other properties from scientificName, I do not currently see a need to do so. Therefore, I will drop these two properties.
There are several ID fields: recordNumber, catalogNumber, taxonKey, and speciesKey. These are likely foreign keys to other datasets. I do not need them, and some of them are sometimes null. For speciesKey, I will prefer the field scientificName, which provides a more human-readable species identification. Therefore, I will drop these properties.
There are also several date properties: year, month, and day appear to correspond to eventDate, but they contain some null values when eventDate does not. Therefore, I will only keep eventDate. dateIdentified has many null values and may be unreliable, as indicated by the issue IDENTIFIED_DATE_UNLIKELY. However, null values in this field could simply mean that the species has not yet been identified, making it potentially useful. lastInterpreted is complete, so I will keep it. The process of documenting an observation likely progresses through observation, interpretation, and identification phases, with the latter two potentially repeated.
I am going to conduct some basic analysis on three fields: gbifID, datasetKey, and occurenceID. gbifID is a unique identifier provided by GBIF for each record in the dataset, which can be used to retrieve specific records or cross-reference with other datasets. It can also be helpful for tracking the source of the data and maintaining data integrity. datasetKey is an encrypted ID and I have chosen to remove it. The function of occurenceID is not clear to me, it could potentially identify the species, as occurenceStatus can be set to either PRESENT or ABSENT. But I already have speciesKey property, and since occurenceID can sometimes be null and I have multiple ways to identify a species, I will not use it.

# Iterate over the specified columns
for col in ['gbifID', 'datasetKey', 'occurrenceID']:
    # Count the number of unique values in the column
    uniq = df[col].nunique()
    # Count the number of non-null values in the column
    nonnull = df[col].count()
    # Print the results
    print(f"Column {col}: {uniq} unique values, {nonnull} non-null values")

------------------------------------------------------------
Column gbifID: 4675566 unique values, 4675566 non-null values
Column datasetKey: 933 unique values, 4675566 non-null values
Column occurrenceID: 4675124 unique values, 4675561 non-null values

At this point, there are already 11 columns that can be dropped:

df = df.drop(columns=[
    "verbatimScientificNameAuthorship",
    "verbatimScientificName",
    "recordNumber",
    "catalogNumber",
    "taxonKey",
    "speciesKey",
    "year",
    "month",
    "day",
    "datasetKey",
    "occurrenceID"
])
df.info()

------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 39 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   gbifID                         int64  
 1   kingdom                        object 
 2   phylum                         object 
 3   class                          object 
 4   order                          object 
 5   family                         object 
 6   genus                          object 
 7   species                        object 
 8   infraspecificEpithet           object 
 9   taxonRank                      object 
 10  scientificName                 object 
 11  countryCode                    object 
 12  locality                       object 
 13  stateProvince                  object 
 14  occurrenceStatus               object 
 15  individualCount                object 
 16  publishingOrgKey               object 
 17  decimalLatitude                float64
 18  decimalLongitude               float64
 19  coordinateUncertaintyInMeters  object 
 20  coordinatePrecision            object 
 21  elevation                      object 
 22  elevationAccuracy              float64
 23  depth                          float64
 24  depthAccuracy                  float64
 25  eventDate                      object 
 26  basisOfRecord                  object 
 27  institutionCode                object 
 28  collectionCode                 object 
 29  identifiedBy                   object 
 30  dateIdentified                 object 
 31  license                        object 
 32  rightsHolder                   object 
 33  recordedBy                     object 
 34  typeStatus                     object 
 35  establishmentMeans             object 
 36  lastInterpreted                object 
 37  mediaType                      object 
 38  issue                          object 
dtypes: float64(5), int64(1), object(33)
memory usage: 1.4+ GB

I have managed to reduce the size of the dataset by more than 300MB out of 1.7GB, while also improving the value of the data. This represents a reduction of approximately 20%, which is a good start!

Understanding the presence and distribution of null values in our dataset is crucial as it can reveal different insights. By analyzing the proportion of null values to the total number of rows for each column, we can gain further insights.

# some of the properties have low cardinality... If I can not extrapolate or makes sense of the values, better to drop them.
# Create a list of tuples containing the column name and its null value ratio
ratios = [(col, round(df[col].isnull().sum() / df[col].shape[0], 2)) for col in df.columns]
# Sort the list by the null value ratio
from operator import itemgetter
ratios = sorted(ratios, key=itemgetter(1), reverse=True)
# Print the sorted list
print("\n".join(f"{col} {ratio}" for col, ratio in ratios))

------------------------------------------------------------
coordinatePrecision 1.0
typeStatus 1.0
infraspecificEpithet 0.98
establishmentMeans 0.95
individualCount 0.83
elevationAccuracy 0.82
elevation 0.77
depth 0.75
depthAccuracy 0.75
mediaType 0.72
coordinateUncertaintyInMeters 0.6
dateIdentified 0.57
institutionCode 0.54
recordedBy 0.52
collectionCode 0.51
identifiedBy 0.5
rightsHolder 0.45
stateProvince 0.44
locality 0.43
species 0.23
class 0.2
genus 0.15
family 0.06
order 0.05
phylum 0.03
decimalLatitude 0.01
decimalLongitude 0.01
issue 0.01
gbifID 0.0
kingdom 0.0
taxonRank 0.0
scientificName 0.0
countryCode 0.0
occurrenceStatus 0.0
publishingOrgKey 0.0
eventDate 0.0
basisOfRecord 0.0
license 0.0
lastInterpreted 0.0

I will focus on columns that have a high ratio of null values, as they may not provide valuable information if they cannot be inferred or used meaningfully.

Let’s focus on all the properties with a ratio greater than or around 50%:

coordinatePrecision has a null value ratio of 1.0. In cases where this value is missing, I don’t believe that the location is 100% precise. Therefore, I don’t think this property is useful at the whole dataset level and I will discard it for now.
typeStatus has a null value ratio of 1.0, which is too high. I will drop it.
depthAccuracy has a null value ratio of 0.75. I will discard it for the same reason as coordinatePrecision.
depth has a null value ratio of 0.75. I will keep it, as it seems logical that some species are not found “underground”.
infraspecificEpithet has a null value ratio of 0.98. After some investigation, I will keep it. An epithet is a word or phrase added to a name in biological classification to specify a subspecies or variety. For example, in the scientific name Canis lupus, “lupus” is the specific epithet, indicating that this is a species of “wolf”. In the case of infraspecific epithets, it refers to lower levels of classification like subspecies or variety. It is used to specify a more specific type of organism within a species.
establishmentMeans has a null value ratio of 0.95. Apart from NaN, there are only two potential values: "Native" and "Introduced". Given that missing values in this property may not provide a clear indication of population growth and my goal is to have a broad perspective on the dataset rather than focusing on specific subsets identified by certain properties, I have decided to discard it for now.
mediaType has a null value ratio of 0.72. This rate is too high, and I don’t see how I could use this property, so I will get rid of it.
elevationAccuracy has a null value ratio of 0.82, which is too high. I will discard it for the same reason as coordinatePrecision.
elevation has a null value ratio of 0.77. I will keep it, similar reasoning as for Depth.
coordinateUncertaintyInMeters has a null value ratio of 0.6. I will discard it for the same reason as coordinatePrecision.
dateIdentified has a null value ratio of 0.57. I will keep it. I’ve explained why earlier.
rightsHolder has a null value ratio of 0.47. I don’t see any potential usage, so I will drop it.
individualCount has a null value ratio of 0.83. I will keep the property and assume that when the value is missing, it means that only one individual was observed.
identifiedBy has a null value ratio of 0.5. I don’t see any potential usage, so I will drop it.

At this point, I can clean up the dataset a bit more based on my intended use. I plan to perform statistical analysis on these data and extract relationships between species, as well as correlate them with their environment.
Therefore, I don’t need information like license, recordedBy, collectionCode, and institutionCode. Additionally, these last two have been flagged as untrustworthy (COLLECTION_MATCH_NONE, COLLECTION_MATCH_FUZZY, INSTITUTION_MATCH_FUZZY, AMBIGUOUS_INSTITUTION, INSTITUTION_MATCH_NONE).

As a result, my refined dataset now looks like this:

df = df.drop(columns=["coordinatePrecision",
                     "typeStatus",
                     "depthAccuracy",
                     "establishmentMeans",
                     "mediaType",
                     "elevationAccuracy",
                     "coordinateUncertaintyInMeters",
                     "rightsHolder",
                     "identifiedBy",
                     "license",
                     "recordedBy",
                     "collectionCode",
                     "institutionCode"])

------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   gbifID                        int64  
 1   species_kingdom               object 
 2   species_phylum                object 
 3   species_class                 object 
 4   species_order                 object 
 5   species_family                object 
 6   species_genus                 object 
 7   species_species               object 
 8   species_infraspecificEpithet  object 
 9   species_taxonRank             object 
 10  species_scientificName        object 
 11  env_countryCode               object 
 12  env_locality                  object 
 13  env_stateProvince             object 
 14  meta_occurrenceStatus         object 
 15  meta_individualCount          object 
 16  meta_publishingOrgKey         object 
 17  env_decimalLatitude           float64
 18  env_decimalLongitude          float64
 19  env_elevation                 object 
 20  env_depth                     float64
 21  env_eventObservationDate      object 
 22  meta_basisOfRecord            object 
 23  meta_identificationDate       object 
 24  meta_lastInterpretationDate   object 
 25  species_issue                 object 
dtypes: float64(3), int64(1), object(22)
memory usage: 927.5+ MB

I have now successfully retained only the necessary information in my dataset. As a result, the dataset size has been reduced from 235 MB to 122 MB and the number of columns has been halved.

Improving Data Clarity and Efficiency

Data Cleaning and Grouping — Renaming Columns for Clarity

I’ve looked at all the columns and now have a better idea of what they represent. I have identified three groups of properties: those related to species, observation, and the environment. To improve the clarity of the dataset, I will rename these properties to better convey their respective group or type of information.

I also change the names of the date properties to make them more uniform.

dateColumnNames = {"lastInterpreted":"lastInterpretationDate","dateIdentified":"identificationDate","eventDate":"eventObservationDate"}
df = df.rename(columns=dateColumnNames)

env_columns = ["countryCode","locality","stateProvince","decimalLatitude","decimalLongitude","elevation","depth","eventObservationDate"]
meta_columns = ["occurrenceStatus","individualCount","basisOfRecord","typeStatus","establishmentMeans","lastInterpretationDate","publishingOrgKey","identificationDate"]
species_colums=["kingdom","phylum","class","order","family","genus","species","infraspecificEpithet","taxonRank","scientificName","issue"]df = df.rename(columns={col:"env_"+col for col in env_columns})
df = df.rename(columns={col:"meta_"+col for col in meta_columns})
df = df.rename(columns={col:"species_"+col for col in species_colums})df.info()------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   gbifID                        int64  
 1   species_kingdom               object 
 2   species_phylum                object 
 3   species_class                 object 
 4   species_order                 object 
 5   species_family                object 
 6   species_genus                 object 
 7   species_species               object 
 8   species_infraspecificEpithet  object 
 9   species_taxonRank             object 
 10  species_scientificName        object 
 11  env_countryCode               object 
 12  env_locality                  object 
 13  env_stateProvince             object 
 14  meta_occurrenceStatus         object 
 15  meta_individualCount          object 
 16  meta_publishingOrgKey         object 
 17  env_decimalLatitude           float64
 18  env_decimalLongitude          float64
 19  env_elevation                 object 
 20  env_depth                     float64
 21  env_eventObservationDate      object 
 22  meta_basisOfRecord            object 
 23  meta_identificationDate       object 
 24  meta_lastInterpretationDate   object 
 25  species_issue                 object 
dtypes: float64(3), int64(1), object(22)
memory usage: 927.5+ MB

When using Pandas to read a CSV file, the data types of the columns are automatically determined and defaulted to the largest data type (such as int64, float64, or object) which can cause excessive memory consumption.

However, by examining the data more closely, we can optimize the memory usage and have access to data type specific operations.

Unifying Date Formats for Consistent Data Manipulation

Initially, there are 3 distinct date formats for the date properties. As meta_lastInterpretationDateis in Coordinated Universal Time (UTC), I will also convert the other two formats to UTC. This will enable us to manipulate these dates based on their month, day, or year level, keeping in mind that we have removed the properties year, month, and day.

dateColumnNames = ["meta_lastInterpretationDate", "meta_identificationDate", "env_eventObservationDate"]

# Iterate over the date columns
for col in dateColumnNames:
    # Convert the data in the column to datetime format
    df[col] = pd.to_datetime(
        df[col], 
        # Infer the datetime format
        infer_datetime_format=True, 
        # Set the timezone to UTC
        utc=True, 
        # Handle errors by replacing invalid data with NaT
        errors="coerce"
    )

Optimizing Categorical Data with Pandas’ ‘category’ Type

The “category” data type in pandas is a data type for categorical variables. Categorical variables are variables that take on a limited, fixed set of values.

One of the main benefits is that “category” dtype uses less memory compared to “object” dtype, as it maps the values to integers under the hood. It also allows for faster computation and aggregation of data.

If the ratio is small, meaning that there are relatively few unique values compared to the number of non-null values, it may be beneficial to convert the column to the “category” dtype.

# Get the object columns in the DataFrame
object_columns = df.select_dtypes(include=['object']).columns.tolist()

# Iterate over the object columns
for col in object_columns:
    # Calculate the ratio of unique values to rows in the column
    ratio = df[col].nunique() / df[col].count()
    # Print the ratio
    print(f"Ratio of unique values to rows for column '{col}': {ratio:.2f}")------------------------------------------------------------
Ratio of unique values to rows for column 'species_kingdom': 0.00
Ratio of unique values to rows for column 'species_phylum': 0.00
Ratio of unique values to rows for column 'species_class': 0.00
Ratio of unique values to rows for column 'species_order': 0.00
Ratio of unique values to rows for column 'species_family': 0.00
Ratio of unique values to rows for column 'species_genus': 0.01
Ratio of unique values to rows for column 'species_species': 0.02
Ratio of unique values to rows for column 'species_infraspecificEpithet': 0.06
Ratio of unique values to rows for column 'species_taxonRank': 0.00
Ratio of unique values to rows for column 'species_scientificName': 0.02
Ratio of unique values to rows for column 'env_countryCode': 0.00
Ratio of unique values to rows for column 'env_locality': 0.04
Ratio of unique values to rows for column 'env_stateProvince': 0.00
Ratio of unique values to rows for column 'meta_occurrenceStatus': 0.00
Ratio of unique values to rows for column 'meta_individualCount': 0.00
Ratio of unique values to rows for column 'meta_publishingOrgKey': 0.00
Ratio of unique values to rows for column 'env_elevation': 0.01
Ratio of unique values to rows for column 'meta_basisOfRecord': 0.00
Ratio of unique values to rows for column 'species_issue': 0.00

Based on these ratios and the column names, all these properties are good candidates for being converted to the category data type.

df[object_columns] = df[object_columns].astype('category')

Optimizing Numeric Properties for Storage Efficiency

Let’s optimize the storage of all float64 numeric properties. I will assume for now that they are all floats and use pandas to identify the most efficient storage method.

floats = df.select_dtypes(include=['float64']).columns.tolist()
df[floats] = df[floats].apply(pd.to_numeric, downcast='float')

df[floats].info()
# elevation should be a float too but there is some textual values
df['env_elevation'] = pd.to_numeric(df['env_elevation'], downcast='float', errors='coerce')
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 3 columns):
 #   Column                Dtype  
---  ------                -----  
 0   env_decimalLatitude   float32
 1   env_decimalLongitude  float32
 2   env_depth             float32
dtypes: float32(3)
memory usage: 53.5 MB

I have observed that the properties env_depth and env_elevation are currently in feet. In order to work with metrics, I will convert them to meters. Based on the assumption that null values indicate 0 meters, I will also rename these properties for better understanding and coherence:

print(df[['env_depth', 'env_elevation']].describe().loc[['min', 'max']])

------------------------------------------------------------
     env_depth  env_elevation
min       0.00        -330.00
max    2067.00       16917.50------------------------------------------------------------
df = df.rename(columns={"env_depth": "env_depthInMeters", "env_elevation": "env_elevationInMeters"})
df[['env_depthInMeters', 'env_elevationInMeters']] = (df[['env_depthInMeters', 'env_elevationInMeters']] * 0.3048).fillna(0).round(1).astype("float32")

I noticed that the property meta_individualCount should be a numeric and that it has many missing values. I will assume that this means that only one individual was observed when the value is not present and update the rows accordingly.

print(df[['meta_individualCount']].describe().loc[['min', 'max']])

------------------------------------------------------------
     meta_individualCount
min                  0.00
max              49000.00------------------------------------------------------------
# Replace NaN values in 'meta_individualCount' column with 1
df['meta_individualCount'].fillna(1, inplace=True)
# Convert the 'meta_individualCount' column to integer type and handle errors
df['meta_individualCount'] = pd.to_numeric(df['meta_individualCount'], downcast='integer', errors='coerce')

Maximizing the Value of My Dataset

To improve the efficiency of my analysis, I reduced the number of columns from 50 to 26. This resulted in a significant decrease in memory usage, being reduced by a factor of 4. Before proceeding with deeper analysis of these data, it was important to take the initial step of cleaning and preparing the data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
 #   Column                        Dtype              
---  ------                        -----              
 0   gbifID                        int64              
 1   species_kingdom               category           
 2   species_phylum                category           
 3   species_class                 category           
 4   species_order                 category           
 5   species_family                category           
 6   species_genus                 category           
 7   species_species               category           
 8   species_infraspecificEpithet  category           
 9   species_taxonRank             category           
 10  species_scientificName        category           
 11  env_countryCode               category           
 12  env_locality                  category           
 13  env_stateProvince             category           
 14  meta_occurrenceStatus         category           
 15  meta_individualCount          float64            
 16  meta_publishingOrgKey         category           
 17  env_decimalLatitude           float32            
 18  env_decimalLongitude          float32            
 19  env_elevationInMeters         float32            
 20  env_depthInMeters             float32            
 21  env_eventObservationDate      datetime64[ns, UTC]
 22  meta_basisOfRecord            category           
 23  meta_identificationDate       datetime64[ns, UTC]
 24  meta_lastInterpretationDate   datetime64[ns, UTC]
 25  species_issue                 category           
dtypes: category(17), datetime64[ns, UTC](3), float32(4), float64(1), int64(1)
memory usage: 414.6 MB

However, I don’t want to get caught up in minor details and spend too much time on unnecessary tasks, so I will stop here for now. I can always return to further data wrangling if I encounter any obstacles during my analysis.

At this point, I have:

A more comprehensive (to me) dataset: meaning of missing values, right data types, explicit names. It will help me during my analysis or to present compelling facts and reports.
A more compact dataset: I’ve gotten rid of all the unnecessary data that I don’t think will be useful to me. As a result, the memory usage is much lower, which will speed up my analyses.
Built up some domain knowledge: If it looks like a duck and quacks like a duck, it could be a common duck (Kingdom: Animalia, Phylum: Chordata, Class: Aves, Order: Anseriformes, Family: Anatidae, Genus: Anas, Species: A. platyrhynchos) or a wood duck (Kingdom: Animalia, Phylum: Chordata, Class: Aves, Order: Anseriformes, Family: Anatidae, Genus: Aix, Species: A. sponsa) or a roasted duck.
Figured out some potential statistical analysis and machine learning next steps to take.

Analysis of Species Distribution and Family Trends in the US by State

Now that I have a clean dataset and understand what it’s all about, I’m ready to start exploring it and discover some interesting insights!

What are the top 5 species kingdoms per state in the US?

To gain a better understanding of the distribution of species in a specific location, I plan to count the number of individuals observed by state and kingdom. To do this, I will sum up the values in the meta_individualCount property. I will focus on the top 5 states and display the counts using a bar chart with a logarithmic scale to account for the large differences between the top 5 kingdoms.

import matplotlib.pyplot as plt

# Create a new dataframe with only observations from the US
df_us = df[df['env_countryCode']=="US"]# Group the data by state and get the top 5 states with the most observations
top_states = df_us.groupby(["env_stateProvince"]).size().sort_values(ascending=False).head(5)# Create a dictionary to store the top 5 species for each state
top_species_per_state = {}# Iterate through the top states and get the top 5 species for each state
for state in top_states.index:
    top_species_per_state[state] = df_us[df_us['env_stateProvince'] == state].groupby("species_kingdom").meta_individualCount.sum().nlargest(5).sort_values(ascending=False)# Print the top 5 species for each state
for state, species in top_species_per_state.items():
    print(f"{state} : {species}")# Convert the dictionary to a dataframe
df_top_species = pd.DataFrame.from_dict(top_species_per_state)# Transpose the dataframe
df_top_species_transposed = df_top_species.transpose()# Create a bar plot of the data with a logarithmic y-axis
ax = df_top_species_transposed.plot.bar()
ax.set_yscale('log')
plt.xlabel('State')
plt.ylabel('Count')
plt.title('Top 5 species_kingdom count per state (logarithmic scale)')
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.show()

What are the top 5 classes for the Animalia kingdom in these states?

It’s interesting to take a closer look at the animal kingdom in these 5 states across the US. I would like to compare the diversity in these states and see what the top 5 animal classes are in each one.

# create a dictionary to store the top 5 classes for each state
top_classes_per_state = {}

for state in top_states.index:
    # create a new dataframe for the current state
    df_state = df_us[df_us['env_stateProvince'] == state]
    # group by kingdom and class, and count the number of observations
    classes_count = df_state.groupby(["species_kingdom", "species_class"])["meta_individualCount"].sum()
    # for the Animalia, get the top 5 classes by count
    animalia_classes = classes_count.loc["Animalia"].nlargest(5)
    # add the top classes for each kingdom to the dictionary
    top_classes_per_state[state] = {"Animalia": animalia_classes}# print the top classes for each state
for state, classes in top_classes_per_state.items():
    print(state + ":")
    print("Animals:", classes["Animalia"])

I decided to use a treemap to present this data. A treemap is a type of chart that uses nested rectangles to show hierarchical data. The size of each rectangle represents the quantity of data. This format makes it easy to see the relative abundance of different species classes within each state and compare the proportions between them. However, it doesn’t allow for comparison of the actual numbers between states (which a bar chart would have done). To address this, I included the number of observations for each rectangle for added context.

import seaborn as sns
import squarify

def assign_colors(species_class):
    """
    Assign a color to a species class if it hasn't been assigned one yet.
    """
    if species_class not in class_colors:
        class_colors[species_class] = color_pool[0]
        color_pool.remove(class_colors[species_class])
    return class_colors[species_class]def plot_treemap(ax, state, species_class, species_class_counts):
    """
    Plot a treemap of the species classes and their counts for a given state.
    """
    colors = [assign_colors(sc) for sc in species_class]
    labels = [f"{sc}\n({scc})" for sc, scc in zip(species_class, species_class_counts)]
    squarify.plot(sizes=species_class_counts, label=labels, color=colors, ax=ax)
    ax.set_title(state)# Create a figure with 3 rows and 2 columns of subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20,20))# Flatten the list of axes
axes_list = [item for sublist in axes for item in sublist] # Create a color palette with 30 colors
color_pool = sns.color_palette("pastel", 30)# Initialize an empty dictionary to store the assigned colors for each species class
class_colors = {}# Iterate over the top states
for i, state in enumerate(top_states.index):
    # Get the species classes and their counts for the current state
    species_classes = top_classes_per_state[state]["Animalia"].index
    species_class_counts = top_classes_per_state[state]["Animalia"].values
    plot_treemap(axes_list[i], state, species_classes, species_class_counts)# Show the plot
plt.show()

I won’t try to make too many conclusions, but from what I can see, insects are a big part of the observations in all states, particularly in Colorado and Texas. However, in Florida, sea life comes in second when you combine Elasmobranchi and Anthozoa. In California, Arves are more common (could this be due to it being a migration path?), and in New York you have to love frogs. This is just a high-level overview, but it raises a lot of potential for further analysis and makes me want to dive deeper into the dataset.

What’s going on with the frogs in New York?

I am curious about the high number of frogs in New York! I am unsure if this is typical or just a result of the dataset, but since I mentioned earlier that I wanted to investigate trends, I will quickly plot the number of Amphibia observations each year in New York.

Since I already converted env_eventObservationDate to a date object, I can now extract the year and use it to group the number of amphibia observations in New York by year.

# Create a new dataframe that only contains observations from New York and of Amphibia
df_ny_amphibia = df_us[(df_us['env_stateProvince'] == "New York") & (df_us['species_class'] == "Amphibia")]

# Group the data by year and sum the number of observations for each year
yearly_counts = df_ny_amphibia.groupby(df_ny_amphibia['env_eventObservationDate'].dt.year)['meta_individualCount'].sum()# Create a line plot of the yearly counts
plt.plot(yearly_counts.index, yearly_counts.values)# Add axis labels and a title
plt.xlabel('Year')
plt.ylabel('Number of Amphibia observations')
plt.title('Yearly trend of Amphibia observations in New York')# Display the plot
plt.show()

Something looks definitly off and it’s worth further investigation, but that’s a task for another time.

From Cleaning to Querying: A Data Science Adventure

As you begin analyzing your dataset, you may discover that your data is not as clean as you initially thought. This can include issues such as duplicated states, incorrect spelling, and year data that falls outside of the expected range. It’s important to keep in mind that the data analysis process is iterative and requires going back and forth between cleaning, analyzing, and interpreting the data.

Now, as I wrap up this journey, I can say that it was a lot of fun using my data science skills to explore a new subject. Along the way, I also reminded myself (and hopefully you too!) how important it is to clean our data before diving into analysis. I may not have uncovered any groundbreaking insights, but it was a great journey overall and I hope you enjoyed it too.

Not too bad for a monkey who’s still learning the ropes!

Found this article useful? Follow me on Linkedin, Hackernoon, and Medium! Please 👏 this article to share it!