Fish Food Analysis: Part 2

10 min readMar 29, 2022

If you haven’t already, please check out Part 1 of this story, it will explain a lot about the project. For those joining from part 1, thanks for continuing on this journey!

Let’s jump right in! Now that we have our four sets of data cleaned and wrangled, we are ready to start exploring. This is known as Exploratory Data Analysis, or EDA for short. If we are being honest, this is one of my favorite parts of the Data Science pipeline. So much is gained from this initial investigation, sometimes enough to warrant entirely new studies!

1.0 Growth Data EDA

Since we’ve already wrangled the data and checked some very basic statistics using summary(), I want to jump in with some visualizations to see how our two feed groups’ distributions looked.

My visualization tool of choice in R is ggplot2. I use libraries that expand on ggplot2as well. For example, ggridgescreates amazing ridgeline density graphs (showcased in Part 1).

Density Distributions of all age categories grouped by feed type.

We can see Mass is right-skewed for both groups, which I expected given my experience with these fish. There is generally one or two dominant, large females per group. I noticed that the Mysis group has a higher density of fish less than 2000 mg, with a sharper sloped, yet longer, tail. This could show the majority of fish in the tanks are smaller, with larger outlier fish. Both groups have roughly the same density peak; the Gemma group has more fish from the 2000–4500 mg range.

The Length shows a bimodal distribution for both Gemma and Mysis groups. Gemma had two roughly equal peaks, while Mysis had a higher peak around 40 mm. Both groups had peaks at 30 and 40 mm. This could indicate a quick stage of growth for the animal. Something we will look into during the age breakdown.

When we break down distribution by age category, there are two initial takeaways. The first is the Gemma group has a gradual slope for mass and length after the peak. This starts at six months and continues through twelve months. The Mysis group does not have a gradual slope after the mode. Instead, we see in both mass and length, there is a sharp negative slope, followed by a small bump. This gap in data could further support our idea of one to two dominant females within each group of mysis.

The second takeaway is the lack of a normal distribution. While not shown here, this was confirmed through Q-Q plots. We may need to change tests to their nonparametric counterparts or transform the dataset later.

What I like most about EDA, is the more you explore, the more the data reveals the story to you. You can dive deeper and deeper and find no end to the insights you can uncover.

Earlier when we looked at the density distribution, we noted a bimodal distribution for length, with two peaks at 30 and 40 mm. When we look at five and six months here, we can see a sizable jump in both mass and length, indicating a large growth spurt during that time!

Next, we see a large difference in outliers, Mysis has more outliers for both length and mass variables. This again, helps us understand the questions about the long tails and small secondary bumps in distribution.

The story is beginning to unravel and it looks like the Mysis feed produces one to two significantly larger fish in each group, unlike the Gemma.

Additionally, we can see the Interquartile range for Gemma is larger than the Mysis group. This also indicates that the central data is more widely spread in the Gemma vs. the Mysis group.

Finally, it doesn’t look like we are seeing any significant difference in either mass or length based on the median and IQRs for each Age Category. We can confirm this later with some significance testing. But so far, it looks like our first measure may be a wash.

When we drill down the Feed Groups into their individual samples (tank locations), we actually get a very consistent picture intra-group. Like I thought, each Mysis contains one to three outliers per sample, while the gemma group does not.

While we haven’t seen any strong evidence supporting a difference in growth, I plotted a Mass vs. Length Relationship to see what else I could learn. I mean that’s a huge part of this study right?

When we look at the data points split into Feed Groups, I don’t see any difference in their relationship. In fact, the data appears to be a pretty close fit for both groups.

I do notice the three Mysis points way out at the top of the graph, those are our Godzilla fish! That being said, I am curious what the outliers look like on a plot like this. Will they be some of the points that seem to float above and below their given length for mass? Or will they follow the general trend, just higher up for their age?

Before I dive into the outliers, I took a look at the same plot, but rather than splitting it by Group, I split it by Group and Age. We can see that some of the five to six-month fish have dispersed farther up the chart. While most of these belong to the Mysis category, there are a fair few Gemma points that have these larger fish too.

1.1 Outlier Detection for Growth Data

I ended up becoming engrossed in the outlier data for some time. It’s not too often I find myself with such a robust data set that includes so many outliers. I wanted to understand as much as I could about them. Additionally, I wanted to understand as much as I could about outlier detection.

One bit of code I find super useful is deriving boxplot statistics using dplyr and boxplot.stats(). With the following snippet, I can create a tibble of all boxplot statistics, separated by their Age Categories.

You can also use boxplot.stats to list the boxplot outliers in a tibble! It's amazing for future analyses with outliers.

I looked at outliers using four different methods: Boxplot, Percentile, Rosner, and Cook’s Distance. One thing to note about Rosner’s method, it needs to have an approximately normal distribution of data, once the outliers are removed. Based on our previous EDA, this may not be the best detection method for us.

This is where I went a little crazy, to be honest. I’m fascinated by the different methods of outlier detection. I was curious how each one operates and which outliers would be signaled by which method. Additionally, I was curious if the outliers for length were also outliers for mass, or if they were completely independent.

After calculating the outliers for mass and length, I created a summary table for each group that showed values detected by one or more methods. I then combined the counts of length and mass outliers for each group (Gemma and Mysis). This gave me a potential range from zero to eight.

If a value was detected by all four methods for mass and length it would receive an 8, if it was detected by 1 in each feature, it would get a two. This way I could see all the outliers, and which values were considered the most extreme of them all.

These graphs help to confirm that the outliers appear to be true examples of dominant and larger fish for each sample! One insight from the Mysis outliers, pretty much any fish over 50 mm was considered an outlier. This contrasts with Gemma which had many non-outliers over 50 mm. Something about the Mysis feed leads to a reduction in fish length of around 50 mm. We see this in the distribution graphs as well.

2.0 Survival Data Exploration

Moving on to the next dataframe, I collected census from each tank during the weigh and measure events.

First, I plotted the census for each tank during each of the measuring events. I can already see that the Mysis group has a higher survival per event than the Gemma group. Additionally, it looks like the Gemma group continues to decline, while the Mysis group appears to only lose a few at the start and end.

One thing I really enjoy about dplyr is the ability to use the pipeline %>% operator even with ggplot2. For example, here I use it to create a graph of the average census for each group. This is something I would do using SQL, but we are already in here, which is the nice thing about R and ad hoc analysis, making adjustments on the fly.

By calculating the standard error, I can see that our census numbers become significantly different after the third census point. Obviously, there shouldn’t be a difference at the first census point since the groups all start with the same number of fish.

This is not looking good for our second metric. Survival appears to favor the Mysis fish. It’s possible the once-living organisms provide enrichment to the fish, evoking their predator drive. This could quell the “erratic viciousness” that tends to cause such high aggression-related deaths.

While there is definitely a difference in the average census for each group, I notice there are two very different slopes in survival: Event 1–2 vs 2–6. This had me thinking that maybe after the initial deaths, there was a less significant amount of deaths between each event.

The difference between Gemma and Mysis deaths per event does decrease over time. Unfortunately, we see that there is still a large difference for each of the first three events. While this is interesting, the decline in overall deaths could be due to age. A previous study of mine helped to identify the peak aggression timepoints for Surface Mexico at around 5–6 months old, which coincides with the two highest differences in deaths.

3.0 Sex Sort Data Exploration

Since we had different survival rates at the end of our study, I can’t compare the sex counts one-to-one. First, we need to normalize them as percent total. In order to do this, I need to mutate and pivot our current table. It’s important to normalize the data before we perform any summary statistics.

While there is a pretty large standard error, we can be relatively certain based on this chart that there isn’t any difference in sex skew between the two groups.

4.0 Fecundity Data Exploration

I was pretty certain of the unfortunate outcome based on the survival numbers. I was still interested in the fecundity data, especially knowing it took me a year to get to this point, I wanted to see the data to the end, well at least for a few months longer…

From the total viable production below, you see three Mysis tanks and two Gemma tanks. 10C6 was an extra sample tank that had fewer males and females, but I kept it as a backup. I won’t use it in the calculations moving forward. However, it was interesting to see that it had less fish than the Gemma and still contributed a higher viable total!

Frankly, just from this chart alone, there is a pretty clear picture that Mysis has some sort of impact on the embryo product of A. mexicanus. However, I took a look at the average production per breeding event per Group to see how clear that difference actually was. The standard error will give us a visual indicator.

Yikes, I mean it’s not even close. Whatever vitamins and minerals are in the Mysis Shrimp really pumps up the breeding events. I would be very interested to know if this was behavioral or nutritional in nature as well.

5.0 Conclusion

These visuals tell a very clear story. Mysis is the superior feed when discussing fecundity and survival. It also has no detriment on the growth of the fish.

Come check out Part 3 where I do some regression modeling, significance testing, and discuss the outcome and impact on the animals, facility, and team.

Thanks for hanging around, I hope you enjoyed and learned a bit from my work! If you aren’t thinking to yourself “I’ll never get those 10 minutes back”, I would appreciate any of the following:

👉🏽 Give this article a clap

👉🏽 Comment with a question, feedback, or improvement

👉🏽 Take a look at Part 3 where I go into significance testing, models, and conclusions