Ape-xploring the GBIF: A Monkey’s Safari through Python and Panda’s Data Analysis Habitat
Exploring and Enhancing North American Species Data with Pandas: Data Cleansing, Formatting and Optimization
It’s Christmas time and the whole family is gathered around the dinner table, ready to gobble up some roast duck. The zoology major daughter is chatting with her neuropsychologist mom when all of a sudden, Darwin jumps into the conversation and starts quacking on about species instead of just thinking of the duck as a tasty meal. As a poor keyboard-pushing monkey, I feel left out and fowl-ly. But there’s hope! There are plenty of rounds of discussion during this holiday season, giving me ample time to go do some research and come back with some stuffing to say (pun intended). Maybe I’ll even be able to contribute something more substantial than just a handful of cranberry sauce.
Table of Contents
— Discovering North American Species with GBIF and Panda
— — Obtaining access to the raw data
— — Pandas Dataframe Loading
— — First look at the data
— Cleaning and Removing Unnecessary Data with Pandas
— Improving Data Clarity and Efficiency
— — Data Cleaning and Grouping — Renaming Columns for Clarity
— — Unifying Date Formats for Consistent Data Manipulation
— — Optimizing Categorical Data with Pandas’ ‘category’ Type
— — Optimizing Numeric Properties for Storage Efficiency
— Maximizing the Value of My Dataset
— Analysis of Species Distribution and Family Trends in the US by State
— — What are the top 5 species kingdoms per state in the US?
— — What are the top 5 classes for the Animalia kingdom in these states?
— — What’s going on with the frogs in New York?
Hey there, data lovers! As we all know, the first step in any data analysis journey is to get our hands dirty and clean up that messy data. It’s not the most glamorous part of the process, but it’s super important! Not only does it give us a chance to get to know our data set a little better, but it can also lead to some seriously serendipitous insights and wild hypotheses.
In this post, we’ll take a look at a GBIF dataset and see how we can tackle the data wrangling process using some tried-and-true tools and approaches. Get ready to roll up your sleeves and dive into the data!
Discovering North American Species with GBIF and Panda
The GBIF dataset is an incredible resource for anyone interested in species and biodiversity. It contains a vast amount of information on species from all around the globe, including details on their distribution, biology, and conservation status. Right now, we’re taking a closer look at the species that can be found in North America between 2016 and 2022.
This dataset includes a wide variety of animals, from the tiniest insects to the mightiest mammals. Whether you’re a scientist, a nature enthusiast, or just curious about the world around you, this dataset has something to offer. With it, we’ll try to learn more about the species that call North America home and how they’re faring over time. Our friend Panda is going to help us explore the dataset and see what insights we can uncover.
The GBIF dataset is constantly being updated with new information; it’s always worth checking back to see what’s new. So if you’re interested in North American species or just want to explore the world of biodiversity, join me on this adventure!
Through this post, I will provide some partial code snippets. The full notebook is available on Google Collab and Git.
Obtaining access to the raw data
Without further ado, let’s start by obtaining the GBIF dataset for North America from 2016 to 2022.
Here’s how to do it:
- Visit the GBIF website (https://www.gbif.org/) and click on the ‘Occurrences’ tab in the top menu.
- In the ‘Continent’ filter, enter ‘North America’.
- In the ‘Year’ filter, enter ‘2016–2022’.
- In the ‘License’ filter, select ‘CC0 1.0’. CC0 1.0 is similar to CC0, it is a public domain dedication, meaning the person who associated the work waives all rights to it, allowing anyone to use it without any conditions or non-commercial use.
- (Optionally, this URL should directly select the right dataset.)
6. On the central page, click on the ‘Download’ tab.
7. You’ll see a list of available download options. Select the ‘SIMPLE’ option to begin the download process.
I stored this zip file in a GitHub repository (GBIF.org (13 January 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.qq349x).
Pandas Dataframe Loading
All that’s left now is to use Panda to load it as a dataframe.
First, let’s import some libraries.
import pandas as pd
import numpy as np
import git
import os
import zipfile
Then, we can proceed with the actual loading.
# Load GBIF dataset
repo = git.Repo.clone_from("https://github.com/marqub/gbif-species-distribution-analysis.git", "gbif-species-distribution-analysis")
os.chdir("gbif-species-distribution-analysis/data")
with zipfile.ZipFile("gbif_data_2016to2022_northamerica.csv.zip", "r") as zip_ref:
zip_ref.extractall(".")df = pd.read_csv("0233944-220831081235567.csv", sep='\t', on_bad_lines='skip')
First look at the data
Now we have a dataframe that looks like this:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 50 columns):
# Column Dtype
--- ------ -----
0 gbifID int64
1 datasetKey object
2 occurrenceID object
3 kingdom object
4 phylum object
5 class object
6 order object
7 family object
8 genus object
9 species object
10 infraspecificEpithet object
11 taxonRank object
12 scientificName object
13 verbatimScientificName object
14 verbatimScientificNameAuthorship object
15 countryCode object
16 locality object
17 stateProvince object
18 occurrenceStatus object
19 individualCount object
20 publishingOrgKey object
21 decimalLatitude float64
22 decimalLongitude float64
23 coordinateUncertaintyInMeters object
24 coordinatePrecision object
25 elevation object
26 elevationAccuracy float64
27 depth float64
28 depthAccuracy float64
29 eventDate object
30 day float64
31 month float64
32 year float64
33 taxonKey float64
34 speciesKey float64
35 basisOfRecord object
36 institutionCode object
37 collectionCode object
38 catalogNumber object
39 recordNumber object
40 identifiedBy object
41 dateIdentified object
42 license object
43 rightsHolder object
44 recordedBy object
45 typeStatus object
46 establishmentMeans object
47 lastInterpreted object
48 mediaType object
49 issue object
dtypes: float64(10), int64(1), object(39)
memory usage: 1.7+ GB
The output of df.info()
is useful:
- There are 4,675,566 rows in the dataset, which can be thought of as observations. The dataset is 1.7+ GB in size, making it a substantial dataset. When working with datasets that have millions of rows, it can take a long time for a single line of code to execute, resulting in extended waiting times and less time spent on actual analysis.
- I notice that the data type for most of the 50 properties is ‘Object’, which may not be efficient or optimal for processing purposes.
- Some rows in the dataset are missing values for certain properties, so a decision needs to be made about how to handle these incomplete rows.
- The names of the properties are not immediately clear to me, so I need to determine if they are relevant for my purposes and what they represent.
Let’s also take a quick look at a few rows of the dataset by using df.head()
.
- I see some properties that seem to describe the species, even though it’s not my area of expertise. After some digging, I learn that the categories “kingdom,” “phylum,” “class,” “order,” “family,” “genus,” and “species” are part of a hierarchical classification system used to organize and name living organisms. Together, these categories form what is known as a “taxonomic hierarchy.” The classification of a species within this hierarchy is based on its characteristics and its evolutionary relationships with other species.
- There are a bunch of IDs that I need to look at in more detail. They may not all be useful to me.
- I also see some inconsistencies between properties, for example, we have several pieces of temporal information that are not in the same format or are incomplete.
- The property
issue
contains a list of issues that are relevant to a particular row in the dataset. Examining this property can help me determine which elements of the dataset are valuable and which may need to be cleaned or modified.
tokens = df['issue'].str.split(';', expand=True).stack()
counts = tokens.value_counts()
print(counts)
------------------------------------------------------------
CONTINENT_DERIVED_FROM_COORDINATES 4341570
GEODETIC_DATUM_ASSUMED_WGS84 1976773
COORDINATE_ROUNDED 1603999
GEODETIC_DATUM_INVALID 1134822
REFERENCES_URI_INVALID 587791
INSTITUTION_MATCH_FUZZY 326699
COLLECTION_MATCH_FUZZY 266620
DIFFERENT_OWNER_INSTITUTION 192224
OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT 184592
OCCURRENCE_STATUS_UNPARSABLE 171696
RECORDED_DATE_INVALID 155747
TAXON_MATCH_HIGHERRANK 149980
INSTITUTION_MATCH_NONE 96314
TAXON_MATCH_NONE 72634
INSTITUTION_COLLECTION_MISMATCH 63484
AMBIGUOUS_COLLECTION 50552
COUNTRY_DERIVED_FROM_COORDINATES 35926
COLLECTION_MATCH_NONE 34953
CONTINENT_INVALID 22258
AMBIGUOUS_INSTITUTION 21256
COORDINATE_PRECISION_INVALID 17124
COORDINATE_UNCERTAINTY_METERS_INVALID 13329
CONTINENT_DERIVED_FROM_COUNTRY 12184
TAXON_MATCH_FUZZY 11491
PRESUMED_NEGATED_LONGITUDE 7693
COORDINATE_REPROJECTED 5881
MODIFIED_DATE_UNLIKELY 4851
TYPE_STATUS_INVALID 4432
INDIVIDUAL_COUNT_CONFLICTS_WITH_OCCURRENCE_STATUS 2553
RECORDED_DATE_UNLIKELY 1215
COORDINATE_INVALID 1147
COUNTRY_COORDINATE_MISMATCH 1031
CONTINENT_COUNTRY_MISMATCH 487
CONTINENT_COORDINATE_MISMATCH 418
MULTIMEDIA_DATE_INVALID 343
INDIVIDUAL_COUNT_INVALID 292
OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD 216
COORDINATE_OUT_OF_RANGE 168
ELEVATION_MIN_MAX_SWAPPED 65
DEPTH_NON_NUMERIC 49
COUNTRY_MISMATCH 48
DEPTH_MIN_MAX_SWAPPED 28
MULTIMEDIA_URI_INVALID 25
FOOTPRINT_WKT_INVALID 18
IDENTIFIED_DATE_UNLIKELY 16
PRESUMED_NEGATED_LATITUDE 14
COUNTRY_INVALID 12
BASIS_OF_RECORD_INVALID 3
PRESUMED_SWAPPED_COORDINATE 1
ELEVATION_NON_NUMERIC 1
I’ll comb through all these points, possibly finding even more hidden treasures. My goal is to get a thorough understanding of the dataset, optimizing and cleaning it as needed, and determining the best ways to put it to use in the future.
With a broad perspective on this dataset, I am particularly interested in gaining a deeper understanding of the distribution of species in a specific location. For example, focusing on the state level could be a good starting point.
- I am not particularly concerned with keeping track of the origin of the observations, such as the organization, people, or method used to record the observations.
- I want to use the observations solely to focus on the species.
- Additionally, I want to focus on properties that are statistically significant in order to extract and explain global trends and patterns.
Ultimately, if I plan to use these data to explain species distributions and correlate them over time and locations, the first step would be to clean and retain the right data. This will be a crucial starting point for computing basic statistics.
Cleaning and Removing Unnecessary Data with Pandas
My dataset has 50 columns, and some of them contain data that is incomplete, unreliable, duplicated, or otherwise not useful to me. I will begin by addressing these issues.
- The properties
verbatimScientificNameAuthorship
andverbatimScientificName
appear to be similar or derived fromscientificName
, so I will only keepscientificName
. In addition,scientificName
has a value for every row, while the other two do not. While I could use a regex to infer the missing values in the other properties fromscientificName
, I do not currently see a need to do so. Therefore, I will drop these two properties. - There are several ID fields:
recordNumber
,catalogNumber
,taxonKey
, andspeciesKey
. These are likely foreign keys to other datasets. I do not need them, and some of them are sometimes null. ForspeciesKey
, I will prefer the fieldscientificName
, which provides a more human-readable species identification. Therefore, I will drop these properties. - There are also several date properties:
year
,month
, andday
appear to correspond toeventDate
, but they contain some null values wheneventDate
does not. Therefore, I will only keepeventDate
.dateIdentified
has many null values and may be unreliable, as indicated by the issueIDENTIFIED_DATE_UNLIKELY
. However, null values in this field could simply mean that the species has not yet been identified, making it potentially useful.lastInterpreted
is complete, so I will keep it. The process of documenting an observation likely progresses through observation, interpretation, and identification phases, with the latter two potentially repeated. - I am going to conduct some basic analysis on three fields:
gbifID
,datasetKey
, andoccurenceID
.gbifID
is a unique identifier provided by GBIF for each record in the dataset, which can be used to retrieve specific records or cross-reference with other datasets. It can also be helpful for tracking the source of the data and maintaining data integrity.datasetKey
is an encrypted ID and I have chosen to remove it. The function ofoccurenceID
is not clear to me, it could potentially identify the species, asoccurenceStatus
can be set to either PRESENT or ABSENT. But I already have speciesKey property, and sinceoccurenceID
can sometimes be null and I have multiple ways to identify a species, I will not use it.
# Iterate over the specified columns
for col in ['gbifID', 'datasetKey', 'occurrenceID']:
# Count the number of unique values in the column
uniq = df[col].nunique()
# Count the number of non-null values in the column
nonnull = df[col].count()
# Print the results
print(f"Column {col}: {uniq} unique values, {nonnull} non-null values")
------------------------------------------------------------
Column gbifID: 4675566 unique values, 4675566 non-null values
Column datasetKey: 933 unique values, 4675566 non-null values
Column occurrenceID: 4675124 unique values, 4675561 non-null values
At this point, there are already 11 columns that can be dropped:
df = df.drop(columns=[
"verbatimScientificNameAuthorship",
"verbatimScientificName",
"recordNumber",
"catalogNumber",
"taxonKey",
"speciesKey",
"year",
"month",
"day",
"datasetKey",
"occurrenceID"
])
df.info()
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 39 columns):
# Column Dtype
--- ------ -----
0 gbifID int64
1 kingdom object
2 phylum object
3 class object
4 order object
5 family object
6 genus object
7 species object
8 infraspecificEpithet object
9 taxonRank object
10 scientificName object
11 countryCode object
12 locality object
13 stateProvince object
14 occurrenceStatus object
15 individualCount object
16 publishingOrgKey object
17 decimalLatitude float64
18 decimalLongitude float64
19 coordinateUncertaintyInMeters object
20 coordinatePrecision object
21 elevation object
22 elevationAccuracy float64
23 depth float64
24 depthAccuracy float64
25 eventDate object
26 basisOfRecord object
27 institutionCode object
28 collectionCode object
29 identifiedBy object
30 dateIdentified object
31 license object
32 rightsHolder object
33 recordedBy object
34 typeStatus object
35 establishmentMeans object
36 lastInterpreted object
37 mediaType object
38 issue object
dtypes: float64(5), int64(1), object(33)
memory usage: 1.4+ GB
I have managed to reduce the size of the dataset by more than 300MB out of 1.7GB, while also improving the value of the data. This represents a reduction of approximately 20%, which is a good start!
Understanding the presence and distribution of null values in our dataset is crucial as it can reveal different insights. By analyzing the proportion of null values to the total number of rows for each column, we can gain further insights.
# some of the properties have low cardinality... If I can not extrapolate or makes sense of the values, better to drop them.
# Create a list of tuples containing the column name and its null value ratio
ratios = [(col, round(df[col].isnull().sum() / df[col].shape[0], 2)) for col in df.columns]
# Sort the list by the null value ratio
from operator import itemgetter
ratios = sorted(ratios, key=itemgetter(1), reverse=True)
# Print the sorted list
print("\n".join(f"{col} {ratio}" for col, ratio in ratios))
------------------------------------------------------------
coordinatePrecision 1.0
typeStatus 1.0
infraspecificEpithet 0.98
establishmentMeans 0.95
individualCount 0.83
elevationAccuracy 0.82
elevation 0.77
depth 0.75
depthAccuracy 0.75
mediaType 0.72
coordinateUncertaintyInMeters 0.6
dateIdentified 0.57
institutionCode 0.54
recordedBy 0.52
collectionCode 0.51
identifiedBy 0.5
rightsHolder 0.45
stateProvince 0.44
locality 0.43
species 0.23
class 0.2
genus 0.15
family 0.06
order 0.05
phylum 0.03
decimalLatitude 0.01
decimalLongitude 0.01
issue 0.01
gbifID 0.0
kingdom 0.0
taxonRank 0.0
scientificName 0.0
countryCode 0.0
occurrenceStatus 0.0
publishingOrgKey 0.0
eventDate 0.0
basisOfRecord 0.0
license 0.0
lastInterpreted 0.0
I will focus on columns that have a high ratio of null values, as they may not provide valuable information if they cannot be inferred or used meaningfully.
Let’s focus on all the properties with a ratio greater than or around 50%:
coordinatePrecision
has a null value ratio of 1.0. In cases where this value is missing, I don’t believe that the location is 100% precise. Therefore, I don’t think this property is useful at the whole dataset level and I will discard it for now.typeStatus
has a null value ratio of 1.0, which is too high. I will drop it.depthAccuracy
has a null value ratio of 0.75. I will discard it for the same reason ascoordinatePrecision
.depth
has a null value ratio of 0.75. I will keep it, as it seems logical that some species are not found “underground”.infraspecificEpithet
has a null value ratio of 0.98. After some investigation, I will keep it. An epithet is a word or phrase added to a name in biological classification to specify a subspecies or variety. For example, in the scientific name Canis lupus, “lupus” is the specific epithet, indicating that this is a species of “wolf”. In the case of infraspecific epithets, it refers to lower levels of classification like subspecies or variety. It is used to specify a more specific type of organism within a species.establishmentMeans
has a null value ratio of 0.95. Apart from NaN, there are only two potential values: "Native" and "Introduced". Given that missing values in this property may not provide a clear indication of population growth and my goal is to have a broad perspective on the dataset rather than focusing on specific subsets identified by certain properties, I have decided to discard it for now.mediaType
has a null value ratio of 0.72. This rate is too high, and I don’t see how I could use this property, so I will get rid of it.elevationAccuracy
has a null value ratio of 0.82, which is too high. I will discard it for the same reason ascoordinatePrecision
.elevation
has a null value ratio of 0.77. I will keep it, similar reasoning as for Depth.coordinateUncertaintyInMeters
has a null value ratio of 0.6. I will discard it for the same reason ascoordinatePrecision
.dateIdentified
has a null value ratio of 0.57. I will keep it. I’ve explained why earlier.rightsHolder
has a null value ratio of 0.47. I don’t see any potential usage, so I will drop it.individualCount
has a null value ratio of 0.83. I will keep the property and assume that when the value is missing, it means that only one individual was observed.identifiedBy
has a null value ratio of 0.5. I don’t see any potential usage, so I will drop it.
At this point, I can clean up the dataset a bit more based on my intended use. I plan to perform statistical analysis on these data and extract relationships between species, as well as correlate them with their environment.
Therefore, I don’t need information like license
, recordedBy
, collectionCode
, and institutionCode
. Additionally, these last two have been flagged as untrustworthy (COLLECTION_MATCH_NONE
, COLLECTION_MATCH_FUZZY
, INSTITUTION_MATCH_FUZZY
, AMBIGUOUS_INSTITUTION
, INSTITUTION_MATCH_NONE
).
As a result, my refined dataset now looks like this:
df = df.drop(columns=["coordinatePrecision",
"typeStatus",
"depthAccuracy",
"establishmentMeans",
"mediaType",
"elevationAccuracy",
"coordinateUncertaintyInMeters",
"rightsHolder",
"identifiedBy",
"license",
"recordedBy",
"collectionCode",
"institutionCode"])
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
# Column Dtype
--- ------ -----
0 gbifID int64
1 species_kingdom object
2 species_phylum object
3 species_class object
4 species_order object
5 species_family object
6 species_genus object
7 species_species object
8 species_infraspecificEpithet object
9 species_taxonRank object
10 species_scientificName object
11 env_countryCode object
12 env_locality object
13 env_stateProvince object
14 meta_occurrenceStatus object
15 meta_individualCount object
16 meta_publishingOrgKey object
17 env_decimalLatitude float64
18 env_decimalLongitude float64
19 env_elevation object
20 env_depth float64
21 env_eventObservationDate object
22 meta_basisOfRecord object
23 meta_identificationDate object
24 meta_lastInterpretationDate object
25 species_issue object
dtypes: float64(3), int64(1), object(22)
memory usage: 927.5+ MB
I have now successfully retained only the necessary information in my dataset. As a result, the dataset size has been reduced from 235 MB to 122 MB and the number of columns has been halved.
Improving Data Clarity and Efficiency
Data Cleaning and Grouping — Renaming Columns for Clarity
I’ve looked at all the columns and now have a better idea of what they represent. I have identified three groups of properties: those related to species, observation, and the environment. To improve the clarity of the dataset, I will rename these properties to better convey their respective group or type of information.
I also change the names of the date properties to make them more uniform.
dateColumnNames = {"lastInterpreted":"lastInterpretationDate","dateIdentified":"identificationDate","eventDate":"eventObservationDate"}
df = df.rename(columns=dateColumnNames)
env_columns = ["countryCode","locality","stateProvince","decimalLatitude","decimalLongitude","elevation","depth","eventObservationDate"]
meta_columns = ["occurrenceStatus","individualCount","basisOfRecord","typeStatus","establishmentMeans","lastInterpretationDate","publishingOrgKey","identificationDate"]
species_colums=["kingdom","phylum","class","order","family","genus","species","infraspecificEpithet","taxonRank","scientificName","issue"]df = df.rename(columns={col:"env_"+col for col in env_columns})
df = df.rename(columns={col:"meta_"+col for col in meta_columns})
df = df.rename(columns={col:"species_"+col for col in species_colums})df.info()------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
# Column Dtype
--- ------ -----
0 gbifID int64
1 species_kingdom object
2 species_phylum object
3 species_class object
4 species_order object
5 species_family object
6 species_genus object
7 species_species object
8 species_infraspecificEpithet object
9 species_taxonRank object
10 species_scientificName object
11 env_countryCode object
12 env_locality object
13 env_stateProvince object
14 meta_occurrenceStatus object
15 meta_individualCount object
16 meta_publishingOrgKey object
17 env_decimalLatitude float64
18 env_decimalLongitude float64
19 env_elevation object
20 env_depth float64
21 env_eventObservationDate object
22 meta_basisOfRecord object
23 meta_identificationDate object
24 meta_lastInterpretationDate object
25 species_issue object
dtypes: float64(3), int64(1), object(22)
memory usage: 927.5+ MB
When using Pandas to read a CSV file, the data types of the columns are automatically determined and defaulted to the largest data type (such as int64, float64, or object) which can cause excessive memory consumption.
However, by examining the data more closely, we can optimize the memory usage and have access to data type specific operations.
Unifying Date Formats for Consistent Data Manipulation
Initially, there are 3 distinct date formats for the date properties. As meta_lastInterpretationDate
is in Coordinated Universal Time (UTC), I will also convert the other two formats to UTC. This will enable us to manipulate these dates based on their month, day, or year level, keeping in mind that we have removed the properties year
, month
, and day
.
dateColumnNames = ["meta_lastInterpretationDate", "meta_identificationDate", "env_eventObservationDate"]
# Iterate over the date columns
for col in dateColumnNames:
# Convert the data in the column to datetime format
df[col] = pd.to_datetime(
df[col],
# Infer the datetime format
infer_datetime_format=True,
# Set the timezone to UTC
utc=True,
# Handle errors by replacing invalid data with NaT
errors="coerce"
)
Optimizing Categorical Data with Pandas’ ‘category’ Type
The “category” data type in pandas is a data type for categorical variables. Categorical variables are variables that take on a limited, fixed set of values.
One of the main benefits is that “category” dtype uses less memory compared to “object” dtype, as it maps the values to integers under the hood. It also allows for faster computation and aggregation of data.
If the ratio is small, meaning that there are relatively few unique values compared to the number of non-null values, it may be beneficial to convert the column to the “category” dtype.
# Get the object columns in the DataFrame
object_columns = df.select_dtypes(include=['object']).columns.tolist()
# Iterate over the object columns
for col in object_columns:
# Calculate the ratio of unique values to rows in the column
ratio = df[col].nunique() / df[col].count()
# Print the ratio
print(f"Ratio of unique values to rows for column '{col}': {ratio:.2f}")------------------------------------------------------------
Ratio of unique values to rows for column 'species_kingdom': 0.00
Ratio of unique values to rows for column 'species_phylum': 0.00
Ratio of unique values to rows for column 'species_class': 0.00
Ratio of unique values to rows for column 'species_order': 0.00
Ratio of unique values to rows for column 'species_family': 0.00
Ratio of unique values to rows for column 'species_genus': 0.01
Ratio of unique values to rows for column 'species_species': 0.02
Ratio of unique values to rows for column 'species_infraspecificEpithet': 0.06
Ratio of unique values to rows for column 'species_taxonRank': 0.00
Ratio of unique values to rows for column 'species_scientificName': 0.02
Ratio of unique values to rows for column 'env_countryCode': 0.00
Ratio of unique values to rows for column 'env_locality': 0.04
Ratio of unique values to rows for column 'env_stateProvince': 0.00
Ratio of unique values to rows for column 'meta_occurrenceStatus': 0.00
Ratio of unique values to rows for column 'meta_individualCount': 0.00
Ratio of unique values to rows for column 'meta_publishingOrgKey': 0.00
Ratio of unique values to rows for column 'env_elevation': 0.01
Ratio of unique values to rows for column 'meta_basisOfRecord': 0.00
Ratio of unique values to rows for column 'species_issue': 0.00
Based on these ratios and the column names, all these properties are good candidates for being converted to the category data type.
df[object_columns] = df[object_columns].astype('category')
Optimizing Numeric Properties for Storage Efficiency
Let’s optimize the storage of all float64 numeric properties. I will assume for now that they are all floats and use pandas to identify the most efficient storage method.
floats = df.select_dtypes(include=['float64']).columns.tolist()
df[floats] = df[floats].apply(pd.to_numeric, downcast='float')
df[floats].info()
# elevation should be a float too but there is some textual values
df['env_elevation'] = pd.to_numeric(df['env_elevation'], downcast='float', errors='coerce')
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 env_decimalLatitude float32
1 env_decimalLongitude float32
2 env_depth float32
dtypes: float32(3)
memory usage: 53.5 MB
I have observed that the properties env_depth
and env_elevation
are currently in feet. In order to work with metrics, I will convert them to meters. Based on the assumption that null values indicate 0 meters, I will also rename these properties for better understanding and coherence:
print(df[['env_depth', 'env_elevation']].describe().loc[['min', 'max']])
------------------------------------------------------------
env_depth env_elevation
min 0.00 -330.00
max 2067.00 16917.50------------------------------------------------------------
df = df.rename(columns={"env_depth": "env_depthInMeters", "env_elevation": "env_elevationInMeters"})
df[['env_depthInMeters', 'env_elevationInMeters']] = (df[['env_depthInMeters', 'env_elevationInMeters']] * 0.3048).fillna(0).round(1).astype("float32")
I noticed that the property meta_individualCount
should be a numeric and that it has many missing values. I will assume that this means that only one individual was observed when the value is not present and update the rows accordingly.
print(df[['meta_individualCount']].describe().loc[['min', 'max']])
------------------------------------------------------------
meta_individualCount
min 0.00
max 49000.00------------------------------------------------------------
# Replace NaN values in 'meta_individualCount' column with 1
df['meta_individualCount'].fillna(1, inplace=True)
# Convert the 'meta_individualCount' column to integer type and handle errors
df['meta_individualCount'] = pd.to_numeric(df['meta_individualCount'], downcast='integer', errors='coerce')
Maximizing the Value of My Dataset
To improve the efficiency of my analysis, I reduced the number of columns from 50 to 26. This resulted in a significant decrease in memory usage, being reduced by a factor of 4. Before proceeding with deeper analysis of these data, it was important to take the initial step of cleaning and preparing the data.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
# Column Dtype
--- ------ -----
0 gbifID int64
1 species_kingdom category
2 species_phylum category
3 species_class category
4 species_order category
5 species_family category
6 species_genus category
7 species_species category
8 species_infraspecificEpithet category
9 species_taxonRank category
10 species_scientificName category
11 env_countryCode category
12 env_locality category
13 env_stateProvince category
14 meta_occurrenceStatus category
15 meta_individualCount float64
16 meta_publishingOrgKey category
17 env_decimalLatitude float32
18 env_decimalLongitude float32
19 env_elevationInMeters float32
20 env_depthInMeters float32
21 env_eventObservationDate datetime64[ns, UTC]
22 meta_basisOfRecord category
23 meta_identificationDate datetime64[ns, UTC]
24 meta_lastInterpretationDate datetime64[ns, UTC]
25 species_issue category
dtypes: category(17), datetime64[ns, UTC](3), float32(4), float64(1), int64(1)
memory usage: 414.6 MB
However, I don’t want to get caught up in minor details and spend too much time on unnecessary tasks, so I will stop here for now. I can always return to further data wrangling if I encounter any obstacles during my analysis.
At this point, I have:
- A more comprehensive (to me) dataset: meaning of missing values, right data types, explicit names. It will help me during my analysis or to present compelling facts and reports.
- A more compact dataset: I’ve gotten rid of all the unnecessary data that I don’t think will be useful to me. As a result, the memory usage is much lower, which will speed up my analyses.
- Built up some domain knowledge: If it looks like a duck and quacks like a duck, it could be a common duck (Kingdom: Animalia, Phylum: Chordata, Class: Aves, Order: Anseriformes, Family: Anatidae, Genus: Anas, Species: A. platyrhynchos) or a wood duck (Kingdom: Animalia, Phylum: Chordata, Class: Aves, Order: Anseriformes, Family: Anatidae, Genus: Aix, Species: A. sponsa) or a roasted duck.
- Figured out some potential statistical analysis and machine learning next steps to take.
Analysis of Species Distribution and Family Trends in the US by State
Now that I have a clean dataset and understand what it’s all about, I’m ready to start exploring it and discover some interesting insights!
What are the top 5 species kingdoms per state in the US?
To gain a better understanding of the distribution of species in a specific location, I plan to count the number of individuals observed by state and kingdom. To do this, I will sum up the values in the meta_individualCount
property. I will focus on the top 5 states and display the counts using a bar chart with a logarithmic scale to account for the large differences between the top 5 kingdoms.
import matplotlib.pyplot as plt
# Create a new dataframe with only observations from the US
df_us = df[df['env_countryCode']=="US"]# Group the data by state and get the top 5 states with the most observations
top_states = df_us.groupby(["env_stateProvince"]).size().sort_values(ascending=False).head(5)# Create a dictionary to store the top 5 species for each state
top_species_per_state = {}# Iterate through the top states and get the top 5 species for each state
for state in top_states.index:
top_species_per_state[state] = df_us[df_us['env_stateProvince'] == state].groupby("species_kingdom").meta_individualCount.sum().nlargest(5).sort_values(ascending=False)# Print the top 5 species for each state
for state, species in top_species_per_state.items():
print(f"{state} : {species}")# Convert the dictionary to a dataframe
df_top_species = pd.DataFrame.from_dict(top_species_per_state)# Transpose the dataframe
df_top_species_transposed = df_top_species.transpose()# Create a bar plot of the data with a logarithmic y-axis
ax = df_top_species_transposed.plot.bar()
ax.set_yscale('log')
plt.xlabel('State')
plt.ylabel('Count')
plt.title('Top 5 species_kingdom count per state (logarithmic scale)')
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.show()
What are the top 5 classes for the Animalia kingdom in these states?
It’s interesting to take a closer look at the animal kingdom in these 5 states across the US. I would like to compare the diversity in these states and see what the top 5 animal classes are in each one.
# create a dictionary to store the top 5 classes for each state
top_classes_per_state = {}
for state in top_states.index:
# create a new dataframe for the current state
df_state = df_us[df_us['env_stateProvince'] == state]
# group by kingdom and class, and count the number of observations
classes_count = df_state.groupby(["species_kingdom", "species_class"])["meta_individualCount"].sum()
# for the Animalia, get the top 5 classes by count
animalia_classes = classes_count.loc["Animalia"].nlargest(5)
# add the top classes for each kingdom to the dictionary
top_classes_per_state[state] = {"Animalia": animalia_classes}# print the top classes for each state
for state, classes in top_classes_per_state.items():
print(state + ":")
print("Animals:", classes["Animalia"])
I decided to use a treemap to present this data. A treemap is a type of chart that uses nested rectangles to show hierarchical data. The size of each rectangle represents the quantity of data. This format makes it easy to see the relative abundance of different species classes within each state and compare the proportions between them. However, it doesn’t allow for comparison of the actual numbers between states (which a bar chart would have done). To address this, I included the number of observations for each rectangle for added context.
import seaborn as sns
import squarify
def assign_colors(species_class):
"""
Assign a color to a species class if it hasn't been assigned one yet.
"""
if species_class not in class_colors:
class_colors[species_class] = color_pool[0]
color_pool.remove(class_colors[species_class])
return class_colors[species_class]def plot_treemap(ax, state, species_class, species_class_counts):
"""
Plot a treemap of the species classes and their counts for a given state.
"""
colors = [assign_colors(sc) for sc in species_class]
labels = [f"{sc}\n({scc})" for sc, scc in zip(species_class, species_class_counts)]
squarify.plot(sizes=species_class_counts, label=labels, color=colors, ax=ax)
ax.set_title(state)# Create a figure with 3 rows and 2 columns of subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20,20))# Flatten the list of axes
axes_list = [item for sublist in axes for item in sublist] # Create a color palette with 30 colors
color_pool = sns.color_palette("pastel", 30)# Initialize an empty dictionary to store the assigned colors for each species class
class_colors = {}# Iterate over the top states
for i, state in enumerate(top_states.index):
# Get the species classes and their counts for the current state
species_classes = top_classes_per_state[state]["Animalia"].index
species_class_counts = top_classes_per_state[state]["Animalia"].values
plot_treemap(axes_list[i], state, species_classes, species_class_counts)# Show the plot
plt.show()
I won’t try to make too many conclusions, but from what I can see, insects are a big part of the observations in all states, particularly in Colorado and Texas. However, in Florida, sea life comes in second when you combine Elasmobranchi and Anthozoa. In California, Arves are more common (could this be due to it being a migration path?), and in New York you have to love frogs. This is just a high-level overview, but it raises a lot of potential for further analysis and makes me want to dive deeper into the dataset.
What’s going on with the frogs in New York?
I am curious about the high number of frogs in New York! I am unsure if this is typical or just a result of the dataset, but since I mentioned earlier that I wanted to investigate trends, I will quickly plot the number of Amphibia observations each year in New York.
Since I already converted env_eventObservationDate
to a date object, I can now extract the year and use it to group the number of amphibia observations in New York by year.
# Create a new dataframe that only contains observations from New York and of Amphibia
df_ny_amphibia = df_us[(df_us['env_stateProvince'] == "New York") & (df_us['species_class'] == "Amphibia")]
# Group the data by year and sum the number of observations for each year
yearly_counts = df_ny_amphibia.groupby(df_ny_amphibia['env_eventObservationDate'].dt.year)['meta_individualCount'].sum()# Create a line plot of the yearly counts
plt.plot(yearly_counts.index, yearly_counts.values)# Add axis labels and a title
plt.xlabel('Year')
plt.ylabel('Number of Amphibia observations')
plt.title('Yearly trend of Amphibia observations in New York')# Display the plot
plt.show()
Something looks definitly off and it’s worth further investigation, but that’s a task for another time.
From Cleaning to Querying: A Data Science Adventure
As you begin analyzing your dataset, you may discover that your data is not as clean as you initially thought. This can include issues such as duplicated states, incorrect spelling, and year data that falls outside of the expected range. It’s important to keep in mind that the data analysis process is iterative and requires going back and forth between cleaning, analyzing, and interpreting the data.
Now, as I wrap up this journey, I can say that it was a lot of fun using my data science skills to explore a new subject. Along the way, I also reminded myself (and hopefully you too!) how important it is to clean our data before diving into analysis. I may not have uncovered any groundbreaking insights, but it was a great journey overall and I hope you enjoyed it too.
Not too bad for a monkey who’s still learning the ropes!
Found this article useful? Follow me on Linkedin, Hackernoon, and Medium! Please 👏 this article to share it!