In the late 1960s, scientist Roger Payne popularized underwater recordings of humpback whales, with the goal of ending the extinction-level threat of commercial whaling.
“I felt that unless people got interested in whales there was no hope of saving them and I realized that I might be able to help change that. […] I spent two years recording whales and lecturing about them and going around playing whale songs for anyone who’d listen. My aim was to try to build whale songs into human culture.” — Roger Payne
In 1969 he released “Songs of the Humpback Whale” on a flexi-disc with Katy Payne and Frank Watlington, included in National Geographic and selling over ten million copies. For many people, this was the moment they learned that whales sing. A few years later the United Nations recommended an international moratorium on commercial whaling, adopted in 1982.
As an artist, I feel this represents one of the greatest cultural interventions in history: transporting ancient sound from the depths of the ocean, a team of researchers weaving alien music into the fabric of human culture. It’s akin to taking a photo of Earth from space. A radical new perspective that cannot be unseen, or unheard.
“Even wonderful films of marine life, often artificially illuminated, miss the fact that this is not a world of light and sight like ours but one of sound, where the primary sense of most denizens is hearing! In fact, to put it bluntly, we lack the capacity to listen and comprehend this world in anything like the way that the sophisticated sea creatures do.” — Mark Peter Simmonds, for Holoturian
I’ve recently been working with a decade of underwater recordings collected by NOAA to help Google Creative Lab visualize humpback whale songs for the Pattern Radio project. The scientists who regularly work with this data are intimately familiar with the nuances of animal vocalizations and the underwater soundscape. They use specialized software for processing and analyzing these recordings. But my ears are mostly trained for experimental music, and my tools come from machine learning or computer vision for interactive installations and other kinds of media art. My goal was not to answer a specific question or build a specific tool, but to explore new kinds of analysis that might help build deeper appreciation for what we are hearing when we listen to humpback whales. I’m writing to share my experience; not as a scientist writing a paper, or even a professional with a rigorous blog post, but as an amateur keeping a journal — having an encounter with an alien art form with only my ears, and code, and seven terabytes of data.
For this project I had access to uncompressed audio recorded between 2005 and 2016. Just moving this data around is incredibly resource intensive, and a lot of pre-processing was done by Matt Harvey. The audio comes from a “High-frequency Acoustic Recording Package” or HARP. “High-frequency” because the audio is originally recorded at more than 100kHz and downsampled to 10–16kHz for analysis. HARPs are deployed at depths around 400–900 meters for months at a time.
To save power a HARP will alternate between recording and sleeping. It might be recording for 75 seconds followed by sleeping for 15 minutes. To account for these gaps, audio is stored in a modified WAV format called XWAV that encodes the duration of those gaps (documented by the Triton acoustic analysis software). Each contiguous recording is called a “subchunk” (the file itself being one “chunk” of the deployment). XWAV also includes a small amount of additional metadata, including a timestamp, latitude/longitude, and the name of the deployment location.
In spite of the gaps, and that deployments in any single location are not continuous, I estimate nearly 12 years of audio in total recorded over 15 years worth of deployments. This is far more than anyone could listen to, so researchers often view spectrograms, which can be quickly skimmed to guess whether a certain kind of sound is present.
For visualizing sound over longer periods of times, researchers create a long term spectral average (LTSA) which downsamples over time and can visualize anywhere from minutes to days instead of seconds.
There are also more subjective, interpretive kinds of analysis. Artist Ariel Guzik imagines a kind of calligraphic writing system for all kinds of whale vocalizations.
Roger Payne and Scott McVay identified a hierarchical structure to humpback whale songs, describing individual sounds as “units” that combine into a “phrase”, repeated with variations as a “theme”, and with themes combined into a “song” that lasts in total around 10–15 minutes.
David Rothenberg and Mike Deal have identified, averaged, and color coded units to create evocative “sheet music” showing the evolution of a humpback whale song.
Researcher Ellen Garland analyzes humpback songs using first-order Markov models and lots of manually-annotated data.
Other work argues that this Payne’s “hierarchical” structure is better described as heterarchical: that it cannot be reduced to a fixed order or simple set of rules, and in the most extreme interpretation might be best understood as a solution to physiological constraints rather than an intentional semantic gesture. This gives me mixed feelings. Humans have a history of projecting our very narrow perspective onto everything (in this case, our ideas about linguistics, or music). But at the same time, we also have a history of reducing intelligence in other creatures to mere instinct.
So before we get too much into the data, a quick reminder of the creature itself.
Humpback whales have been roaming the oceans for at least 11 million years. They feed on krill and small fish in the summer, sometimes teaming up to trap fish with bubble nets. They are generally friendly and appear to protect other species like gray whales and humans. They can grow up to 16m (52 feet), and live up to 100 years.
Only male humpbacks produce songs. All humpbacks in one area sing variations on the same song on a yearly cycle, and those songs spread from one region to another as shown in the diagram above. The purpose of these songs is unknown. Because the songs are typically performed during mating season, and in breeding grounds, there is some suspicion that the songs are meant to indicate reproductive fitness (longer songs demonstrating the whale can hold its breath longer). Others have proposed the songs are just sonar, or for demonstrating dominance. All of this is very speculative, because singing males are either solitary or sometimes with other males, and they don’t appear competitive when singing. I like the theory that the songs aren’t designed to impress females, but mostly for males to altruistically bond with and support each other. Maybe they’re just consoling each other.
Before doing any other kind of analysis, I wanted to build a good system for visualizing short snippets of what I was hearing. I did a quick survey of prepackaged software that uses spectrograms: Audition, Audacity, Sonic Visualizer, Triton, Raven, Sound Analysis Pro.
I tried to recreate my favorite results with librosa and wrote a notebook showing some variations on how to create spectrograms. We needed to make some decisions about our frequency range, volume range, whether to use a linear or logarithmic frequency axis, etc. The biggest decision was between using CQT or FFT.
We went with the CQT because it has the nice property of balancing time and frequency resolution at all frequencies, which avoids the usual “stretching” effect in lower frequencies.
Because we wanted to ensure that the volume levels were consistent across the entire visualization, we needed to pick the right min and max volume levels. To do this we sampled ten thousand random subchunks across the entire dataset and sorted them by how loud they were, looking at curves of different loudness percentiles.
This allowed us to estimate a minimum and maximum loudness so we could encode all spectrograms as images in a portable format with minimal clipping.
One of the next challenges for clear visualization was removing mechanical noise artifacts from the spectrograms. While the HARP microphone is acoustically decoupled from the recorder, it still regularly picks up the sound of the hard drives spinning up at the beginning of each recording “subchunk” (after sleeping). Fortunately, this spin up has a consistent timing relative to the beginning of the subchunk. So if we take the median across hundreds of consecutive subchunks, we can extract the signature of the HARP spin-up noise.
Then we divide the original spectrogram by the HARP noise spectrogram to get an “equalized” version where every pixel is scaled by the expected noise level.
This has the added benefit of boosting the high frequencies and muting the lower frequencies to create an equal appearance of brightness across the entire spectrogram.
After creating clean spectrograms for 12 hours of audio, I created an LTSA. A typical LTSA would take the mean across each subchunk. But I noticed that it was possible to create an LTSA with more contrast for loud, narrow-band sounds like whale songs by instead taking the 99th percentile. Taking the 99th percentile is a more robust alternative of taking the max, telling us how loud the loudest sounds are over a given time period. The mean is large when there are quiet but persistent engine noises, but the 99th percentile is only large when there are very loud sounds (even if they are brief).
I also explored the possibility of coloring the LTSA by the duration of repetition at that frequency.
In the image above, the red sounds are repeating slower and the green sounds are repeating faster. Because humpback whales seem to keep an even “tempo”, refining this visualization could make it a little easier to distinguish their songs from other repeating sounds (including other whales). In some cases I heard multiple whales singing at the same time, phasing in and out due to slightly different tempos, so a shorter-time version of this visualization might make it easier to visually distinguish multiple whales singing in the same recording.
In theory it should be possible to scale up the LTSA concept to average across days and see patterns in times of the year. I tried to visualize an entire year’s worth of audio, and without any alignment of day and night quickly hit the limits of information density.
A Brief Soundscape
Come into my home Murder my family and leave me alone Ceaseless hunger ran Until the sea is silent and deadly quiet But for an engine — Björk + Dirty Projectors, “Sharing Orb”
The background of every HARP recording is the boat engine. This broad-spectrum noise masks nearly everything else and creates big problems for sea life. There’s an incredible audio gallery of recordings from the Discovery of Sound in the Sea website, cataloging everything from animal sounds like blue whales, natural sounds like lightning, and anthropogenic sounds. But I wanted to share a few weird sounds that stood out to me from my relatively brief time listening.
- “Oink” or “grunt” sound from unknown source. Possibly fish. Yes, fish make sounds. Listen to some recordings here. Saipan, October 2015
- The omnipresent engine. Hawaiʻi, May 2011
- Possible echosounder. Wake Island, April 2016
- Navy sonar. Kauaʻi, July 2010
- Dolphin whistles, aliased from higher frequencies. Ladd Seamount, May 2009
- More dolphin whistles with engine noise. Hawaiʻi, July 2012
- Humpback song with multiple clear echoes. Hawaiʻi, March 2014
- HARP microphone getting scratched by something. Wake Island, April 2011
- Sperm whale click used for echolocation while feeding/foraging. Cross Seamount, January 2006
- Likely fin whale or sei whale call. Tinian, December 2013
- Minke whale call (also described as a “boing”). Wake Island, March 2012
I don’t have statistics on the presence of various sounds throughout the data, but from listening to a lot of recordings from different years and locations my intuition is: the audio is mostly quiet at night, or with engine noise during the day, and one in twenty recordings have some kind of humpback song or another interesting feature. There’s also a very common “pulse”, “thumping”, or “heartbeat” sound that is as yet unidentified. It sometimes sounds like this “heartbeat” recording, but not always.
In 2017 I visualized bird songs by organizing them based on similarity. I used a technique called t-SNE which is a kind of nonlinear dimensionality reduction. In this case that just means making a 2D plot where similar sounds are closer together and dissimilar sounds are farther apart. I wanted to see what would happen if we tried that with humpback whale songs. My first idea was to try the same thing: extract 10,000 random snippets and sort them using UMAP, a newer algorithm very similar to t-SNE. Then I arranged them in a grid and visualized each point as a small spectrogram, a “fingerprint”.
What this showed me was less about the humpback sounds, and more about the kinds of noise across the dataset. At the bottom left there are a handful of almost completely silent recordings. Towards the bottom right there’s a cluster of what we’ve been calling “heartbeats”, unidentified regular low-frequency pulses that seems to show up very often but intermittently. The top center has some broad spectrum noise from engines.
I decided to try again on a shorter time scale. I used UMAP on each time frame from a single subchunk, and plotted the 3D embedding as RGB colors beneath the spectrogram.
Initially it didn’t work at all. I was expecting similar units to be color coded similarly, since they should all arrive in a similar 3D location in the UMAP embedding. This was based on intuition from trying something similar with text. I looked at the plot of the first two dimensions of the 3D embedding for hints.
My interpretation here was that the consecutive frames were too similar to each other for UMAP to find longer range similarity. So I decided to break this up by using chunks of consecutive frames instead of single frames, and by spacing out the chunks with a small stride.
This worked much better: the quick rising units all have a green/yellow-to-brown gradient, the growls are more bright green, with some other pink and lavender units getting their own classification. If we look at a simpler example, we can see that UMAP turns simple repetition into a loop through the embedding space:
For a segment with two different repeating units, UMAP creates two separate loops through embedding space:
I’d like to imagine that with enough work in this direction, and a robust metric for comparing two moving windows, UMAP might be able to discover some of the same structures that Ellen Garland’s Markov chain captures.
I also tried creating UMAP embeddings from the consecutive UMAP embeddings, but I’m not convinced this is a practical method for finding higher level structure without some additional processing to account for time warping.
One reason I find algorithms like t-SNE and UMAP so compelling is they have a “softness”: their output can’t be clearly judged as correct or incorrect, it’s more of a complex suggestion. Face detection is usually “hard”: when it draws a box around something that’s not a face, we say “that’s wrong”. But it’s not necessarily the algorithm itself that is “soft”. Showing a confidence score next to the box makes the output feel a little softer. Showing a heat map of “face-ness” across an image feels softest.
Most machine learning algorithms are designed for “hard” output. They’re designed to answer questions with decisive answers: are these two faces the same, what song is this, what route should I take? For this project we were more interested in soft answers that encouraged people to explore instead of hard answers that might mislead or distract.
One guiding question we asked from the beginning was: “where else can I find similar sounds to what I’m hearing now?” I was imagining something like Terrapattern: a tool for finding similar satellite images based on a location. To help understand what was needed I tried looking at intra-spectrogram distances on the scale of a few minutes.
My first idea was to take each frame from the spectrogram and compare its distance to the n other frames, in a big nxn image. I’ve been calling this a “distance matrix” or “similarity matrix”, but always have it presented so that brighter areas means “more similar”. I was expecting to see a bright line along the diagonal, with some bright spots when two frames were similar, and dark areas everywhere else.
My intuition was betrayed when I saw the above image that was bright almost everywhere. This is because two frames of silence are much more similar to each other than two matching sounds. I tried a few kinds of normalization to account for this.
Using correlation and then covariance as distance functions instead of Euclidean distance partially handled the issue with silence matching silence. But there were still a lot of false positives, so I used a moving window of consecutive frames instead a single frame.
Using a moving window with covariance helped, but I noticed the brightness didn’t match my intuition for the similarity of certain sounds. Applying a shaping function like a sigmoid or gamma curve wasn’t working across all recordings, so I tried something non-traditional: histogram equalization followed by a gamma curve. This is a way of preserving the relative ordering of similarity peaks while also compressing the bright values. Finally, instead of using covariance I switched to a custom distance function that seemed to work well: a per-sample standardized dot product. Like covariance, I subtract the mean from each sample, but I also divide by the standard deviation before taking the dot product. I also blur the spectrogram in the frequency axis a little before doing anything to create some leeway for whales that don’t sing the exact same pitch twice.
One piece of information is missing from this visualization: at what frequency is the match happening? If we compute a similarity matrix for multiple frequency ranges, then color the pixels by which range has the best match, we can see some features more clearly.
The 3 minute similarity matrix in the center is particularly neat because it seems to show a lower frequency red sound overlaid at a much slower rate compared to more complex patterns. It’s hard for me to identify this sound in the audio itself, but easy to see in the visualization.
One way to extract units from recordings is to look for louder sounds over time. This is how I segmented bird sounds a few years ago. But due to background noise, the same approach doesn’t always work with humpback sounds. I experimented with a different approach based on the similarity matrices above.
First I find a threshold for the similarity matrix such that some percentage of the columns will have nonzero entries. Then I take the mean across the columns. This gives a kind of “repetitiveness” feature for every frame. Finally, I look for event boundaries by identifying local peaks in this “repetitiveness” feature.
This technique seems to help with noisy recordings. It doesn’t pick up loud persistent background noises or one-off bursts of noise, only repetitive sounds.
I also tried a more traditional computer vision approach: blur the spectrogram then threshold at two different levels with some morphological filtering, using the higher-level thresholded regions as seeds to extract the connected lower-level thresholded region with contour detection and bounding box merging. In other words: find the loudest sounds, then find the region around them. This is vaguely similar to how the watershed algorithm is applied.
I imagine either of these approaches to detection could serve as a decent heuristic to bootstrap a human-annotated corpus of humpback whale songs, which could then be fed to a supervised learning algorithm to build a detector or classifier.
Once we have some unit boundaries, we can run the units through UMAP to get a coloring that roughly groups the different sounds (I also tried some clustering algorithms like hdbscan, but UMAP gave “softer” results).
Another interesting side-effect: because we have contours for each unit we can do a kind of noise reduction on the spectrogram, highlighting the regions inside those contours.
Triplet Loss Embedding
With a unit detection heuristic in place, we can take a huge sequence of units and train a neural network with triplet loss to find a high dimensional embedding. Triplet loss takes two samples that are known to be the same, and one sample known to be different, and tries to find an embedding that minimizes the distance between the similar and maximizes the distance between the dissimilar. We can guess which units are the same by looking in the neighborhood of each unit for the most similar units based on another metric like the standardized dot product described above. The most dissimilar units are almost certainly from another class and can be used as negative examples. Triplet loss should be able to learn an embedding that is more robust to the actual variation in the dataset.
Before training, I checked the ability of simple Euclidean distance based nearest neighbors to find similar sounds across the dataset. In some cases it worked, but in most cases it looked like this:
Given the unit on the far left, it mostly found other chunks of noise. So I did a quick experiment with a three layer convolutional network processing 32x128 pixel spectrogram images of individual units. I started from a triplet loss example I wrote in 2017. Strangely enough, it worked the first time. This was encouraging, because Matt Harvey at Google got similar results with triplet loss on this dataset.
Given the same unit, nearest neighbors in the embedding space appeared more similar. This technique seems to works well for finding long-distance matches, but it can’t be directly adapted for creating similarity images due to the translation invariance of the convolutional network making it harder to pinpoint the matches. I was hoping a UMAP of the triplet loss embeddings would create some clear clusters, but the result was much more interconnected than I expected.
This general direction is ripe for more work, especially looking closer at what the triplet loss embedding actually learned, and using embedding sequences to look up similar phrase sequences across different days.
Generating New “Songs”
Another way of learning an embedding is to use the state of a an unsupervised sequence prediction algorithm like seq2seq or a recurrent neural network (RNN). I didn’t get very far with this one. From my previous experience working with RNNs, the network is more likely to learn an embedding that captures a shorter or longer duration depending on whether you provide a shorter or longer window for training.
To maximize my chances of getting something out of the RNN I simplified the data as much as possible to a small binarized representation with only 32 frequency bands (I‘ve previously shared a similar notebook that predicts sequential handwritten digits). Binarized because I rarely have any luck training regression tasks, and because models like PixelRNN or WaveNet seem to suggest that discretized representations can work better for sequential generation. This allowed me to train to completion in 5 minutes and quickly test a few different hyperparameters.
In the above image the first 256 samples are a seed, then there is a vertical white line, and the rest is generated by the network. After struggling to get up to speed, the network falls into a regular phrase that appears frequently during the 12 hours of data that was provided.
The next things to try here might be:
- Sonifying the results using concatenative synthesis from the original recordings.
- Running a large amount of data through the neural network, saving the state of the network at each moment, and looking for other moments in the data with a similar state.
- Increasing the amount of training data, the number of frequency bands, and the levels of quantization.
- Using a mixture density network or discretized mixture of logistics for the output.
- Switching to a seq2seq-like model, which is explicitly designed to encode state.
While wrapping up work on this project, the big question tugging on me is one of the same questions that triggered the collaboration between NOAA and Google: what does it mean for two phrases to be similar to each other? It’s very tempting to just extend the window size of the chunks, but this mostly stretches the similarity matrix along the diagonal.
Another direction that seemed promising in theory but was incredibly slow and not super helpful in practice was dynamic time warping. DTW finds a time offset for each sample in a series such that it is best aligned with another series. Using DTW before comparing two spectrograms can help provide some temporal translation invariance when doing comparisons. I optimized some DTW code to produce whole similarity matrices (at a lower resolution than usual).
But each phrase section tends to bleed into the next. In retrospect it makes sense that DTW alone wouldn’t be able to tell when a phrase “begins” and “ends”.
Another difficulty with identifying repetition of phrases is that there are often multiple whale songs mixed together.
For me, this makes it seem unlikely that pure heuristics or mostly-unsupervised techniques will be able to correctly annotate NOAA’s multiple decades of recordings. But I think some combination of a few techniques could potentially provide useful unit annotations: heuristics for bootstrapping a corpus of units, manual labels for a complete but relatively small number of unit classes (possibly using audio annotator), and semi-supervised learning to make use of the rest of the data. Then again, this might not be holistic enough: a repeated unit from a single phrase can vary wildly in pitch over the course of a song, often starting at a higher pitch and descending.
This means that if some units are primarily identified by their pitch, any system operating on local context alone will fail often.
I imagine a “soft” way of showing phrasal similarity. Like the unit-level similarity matrices, but for entire sections. It might look something like this:
With the similarity matrix on the left, any row or column can be read to show similarity peaks at other moments in time. On the right, the current phrase would be highlighted as well as similar phrases at other times. It’s unclear to me how to create an image like this, especially when the kind of repetition in humpback songs isn’t generally as simple as the example above.
This work was all in service of Pattern Radio. With the launch of the website my exploration has wrapped up, but I’m still really interested in this kind of work! It’s hard for me to shake this thought that these creatures sing a new song every year, and they’ve been doing this for 11 million years. If you want to chat, please find me on Twitter or send me an email.
This work wouldn’t have been possible without help from a bunch of people. In alphabetical order:
- Alexander Chen (Google Creative Lab), Jonas Jongejan (Google Creative Lab), Lydia Holness (Google Creative Lab), Mohan Twine (Google Creative Lab), and Yotam Mann for holding the bigger project together, keeping me on track and sharing lots of helpful feedback, ideas, and questions 💪
- Ann Allen (NOAA Fisheries PIFSC) for entertaining my questions about weird sounds I heard in the course of listening to lots of hydrophone recordings 🐋
- Aren Jansen (Google Machine Hearing) for answering a bunch of my extremely poorly informed questions at the beginning of this project 🙏
- Kyle Kastner for a variety of suggestions, but especially for pointing me to this talk about detecting right whale calls, as well as Justin Salamon’s talk on self-supervised learning from weak labels 🙌
- Matt Harvey (Google AI Perception) for multiple discussions, answering a bunch of questions about his work with the same data, and especially for helping me understand the potential connection between the UMAP loops and Ellen Garland’s Markov chains, and that blurring the spectrogram before creating the distance matrix was equivalent to checking the max across multiple offsets 🤦♂️
- Nikhil Thorat (Google PAIR) who did some early work on unit segmentation and classification from a large chunk of manually annotated data. I learned a bunch from talking to him about what did and didn’t work 👏
- Parag Mital for reviewing this article, for a suggesting useful directions for analysis and making recommendations for other audio visualization tools to check for inspiration 👌
Thank you! 🎶🐋🙏