Hi readers thanks for coming back for another installment of DIY data science featuring me — a job searcher who would prefer to do python web scraping investigations like this one so I code on a Friday night (I’m actually 27, but I am often mistaken for Bernie Sanders’ older brother).
Why am I scraping data? Well when I am not data sciencing (yes I am making that a word) — I am making music — so I have a thing for audio. Turns out NASA also has a thing for audio — they (well University of Iowa) host a massive library of electromagnetic waves that happen to fall in the audible human frequency range. According to Dr. Bounds from the University:
“electromagnetic waves in a frequency range from ~4Hz up to ~12kHz. It just so happens this is the audible frequency range so we can easily convert the waveform data into a sound file. What you are listening to is not a sound in space but an electromagnetic wave (radio wave).”
Thank you Dr. Bounds! Ok I can go on about how cool this Van Allen NASA was, but you can Google around for that. What I immediately thought was:
Could it be possible to create an anomaly detection model (AKA an alien listening mechanism) based on the audio from the Van Allen mission?!
In diametric opposition to Fermes Paradox, let’s break down that outlandish thought into a 3 part series that I will post over a few weeks. In this article I will focus on item #1 below, specifically: how I quickly web scraped hundreds of hours of audio from NASA (well, technically University of Iowa) using Beautiful Soup. Fun!
a. Web scraping step. Acquire waves (audio data), must be automated due to data volume
b. Store data & run analysis on AWS EC2 (virtual machine)
a. Feature engineering work. Identify multiple, distinctive features from each audio sample
b. Use an application to keep job alive even when SSH tunnel closes
a. Unsupervised learning. Cluster samples based on features (this is unsupervised i.e. I have no labels for the different kinds of waves, so will use Kmeans)
b. Create compelling presentation of results
I’m going to provide a technical explanation to support step 1: how I was able to download thousands of mp3s through writing python code — mostly leveraging Beautiful Soup.
So above all I’ve done is imported packages and made sure the endpoint where I want to get my data is good. Then I made the site into a Beautiful Soup “soup” object so that we can start to parse the website.
Now this next block is not necessary to complete the task at hand ; however, they demonstrate some of the ways one might isolate the different objects/areas of a website.
IF you run the code below it will take a long time. You might not have adequate memory and your kernel might die. If you run into this issue refer to my guide on how to run your code on an AWS EC2 (virtual machine)
There’s two main functions in the block above — crawler and deeper. Basically, deeper is used to locate the “end” of the file structure, it navigates us to the .mp3 from parent folders. And the crawler, it does the heavy lifting of checking the file to make sure it’s what we’re seeking, that is files of the extension: .mp3.
Thanks for reading, I will be posting steps 2 & 3 in between interviews. I’m on the market if you’re hiring for Data Scientists/Analysts.