Biodiversity in national parks: project walkthrough

8 min readNov 27, 2021

This is a summary of my EDA process and results for a Codecademy portfolio project. I admit to not doing a very extensive project scoping in advance. I just went where the data took me and had some fun with it.

Two data sets were provided as csv files for the project: one named “observations.csv” and the other named “species.csv”. I started with the one named observations.

I did all the usual preliminary investigations: used the .head(), .info(), and .describe() functions to get a feeling for the data set.

The number of entries is relatively large, but the data is pretty straightforward: only 3 columns 2 containing strings (species and location) and one containing integers: the number of observations of a species at a given location. There seemed to be no missing data, which should make analyzing the data easier.

Figure 1: histogram of the number of observations per species and park

The overall distribution of the number of observations is bimodal. I was interested to do a histogram of the distribution per park, but I got error messages when I tried to pivot the data, which meant there were inconsistencies in the entries.

After dropping duplicate entries, there were still inconsistencies left. There were roughly 5820 entries for each of the four parks, but only 5541 unique ones. Since .drop_duplicates() did not remove them, they must be duplicate species entries that differ from each other in the number of observations.

OK, so there were not only duplicates, but triplicate entries of the same species for all the parks. I wanted to look at an example.

Three different entries, each with a different number of observations were made for each of the parks for the wolf. The difference in the observation counts between entries was quite high, so just keeping one of the entries, or averaging them out did not seem like appropriate strategies. I decided to completely remove the species with multiple entries from the data set. The analysis will not be complete, but it will not contain inaccuracies.

There are 1082 indices with multiple entries in the dataframe. Removing them altogether means loosing between 2164 and 3246 rows (duplicate and triplicate entries of the same species) from an initial total of 23296. That is more than 10% of the data, but I proceded with the plan. To be able to go back on the decision further along the road without reloading the data I saved the purged data in a new dataframe, named imaginatively df1.

Figure 2: histogram of the distribution of the number of observations per species grouped by park

After cleaning the data I could finally plot the histograms by park. The Yellowstone National Park has by far the highest number of observations. Higher number of observations does not necessarily mean higher diversity, it can also mean a larger park area. The biodiversity relates more to the number of species present in each park, and from the counts plotted, there seems to be no major difference among the parks in the number of species present. viewing the histogram by park it was obvious that the second peak in figure one represented Yellowstone Park, and the wider peak with higher counts represented the overlapping data for the other three parks.

Figure 3: histogram of the overall distribution of the species for all parks

The distribution of the total number of observations per species is normal and quite narrow with a long left tail made up of a small number of species with very low counts. I guess those should make up the endangered species.

The new dataframe contains some missing values, which means that not all species are present in all parks. I am not American but I googled the parks, and they seem to be located in different parts of the country and different environments, so I am actually wondering why there is so much overlap between the four. I would have expected the parks to share less species. Yellowstone hast the highest overall observation numbers, but I was curious if there is any difference in the distribution of the species between the parks.

Not unexpectedly most species also have their maximum frequency in the Yellowstone National Park, but at least one is most found in Bryce National Park , three in Yosemite and six in the Great Smoky Mountains National Park.

Next I wanted to examine those missing entries in the dataframe, which to me indicated that some species have unique habitats and are not found in all of the parks.

One species is unique to the Bryce National Park, six to the Great Smoky Mountains and three to Yosemite, so these are also the species with their observations maxima in these parks. There are also four species which are unique to the Yellowstone Park. I would expect species that are present in only one park, to be very localized, and therefore to be at a greater risk of extinction than others with a wider geopgraphical spread. Next up was opening the species file and check my theory…

Oopsie, choices of the past that come to roost..or something. The number of entries in the species file was of course larger than the number of entries I kept from the previous data set, so I needed to be careful merging the two.

Another issue is that the conservation status is known or specified only for 191 entries, which is less than 5%. Not much information to go on here..

I decided to do an inner merge, in order to drop the data pertaining to the species with conflicting observations counts from this analysis step as well. By doing so the number of species for which the conservation status is known was further reduced to 168.

The predominance of plants is not unexpected since they are at the basis of any biological pyramid, but I would have expected more fish, amphibian and reptile species.

There are 14 species listed as endangered, all of them with total counts below 200. I played with a pie chart to show the distribution according to category. I am always insecure about the use of color in charts, since I know a lot of people with some form and degree of color blindness, so I experimented with textures. I have to say that when more than 2 or three different textures are used, the chart is also not necessarily very easy to read at a glance. I would avoid this kind of representation for a presentation, because it would divert the attention of an audience. It is OK though I guess for a technical paper or a report targetted at a group with an attention span longer than 30s. This excludes all gen Z population and management types.

Figure 4: pie chart of the endangered species by category

The pie chart indicates that most endangered species are mammals, and less than 10% of them (just one) are plants.

Figure 5: boxplot of the total number of observations per species as function of their conservation status

According to the boxplot endangered species seem to be species with an observation count below 200, threatened species have a total count below 300 and species of concern seem to have counts below 600, with one outlier with a higher count. The outlier is Bazzania nudicaulis, which according to google is some sort of moss-like plant with a very limited geographical distribution in the Appalachians. Funnily enough the dataset lists it as being present in all parks, which means there might be other errors in the original data, beyond what was previously discussed.

I do not know how the conservation status of a species is decided. It is a bit oversimplified to assume that the same numeric limit applies to all species categories. There might be other contributing factors as well, but the data only includes the observation numbers, so the analysis can only be based on the information at hand.

Figure 6: total number of observations per species for the different categories

The total number of observations per species seems to be quite consistent among categories. The IQRs are very similar and consistent with the range of the overall distribution, and as seen in figure 3 there are a few outliers representing species with low counts, especially in the case of vascular plants. However, there is only one vascula plant species labeled as endangered in the data set.

Based on the assumption that a count of 200 observations or less is the threshhold to consider a species endagered, I created an extended endangered species list.

This brought up the total number of endangered species to 24, including all species with unique habitats I had previously encountered, which are all plants. Eleven vascular plant species would be considered endangered out of a total of over 4069 by this criterion, which is only a tiny fraction. At 3.5%, the percentage of endangered mammals is 10 times higher.

Figure 7: distribution of the endangered species revisited

Conclusion: the proportion of endangered species is very low, but the overall number of observations for any given species is not high to begin with, so climate change and the related natural disasters seen in the past few years can change the conservation status fairly quickly. If a count of 600 or less qualifies a species for concern, than almost two thirds of those listed in the dataset would qualify. Action would be necessary to improve the health and resilience of these protected ecosystems.

Biodiversity in national parks: project walkthrough

Written by Valentina