Battle of the Vectors

5 min readJan 30, 2024

Phase Three

Code: https://github.com/shreyadsouza/featured_artist/tree/main/phase_3

I knew from the get-go that I wanted to do some kind of audio-visual project using a movie snippet. Mashups, Glee, and Pitch Perfect defined a lot of my teenage years, and I did a lot of experimenting with different clips before coming up with my final idea. I realized the “riff-off” scene from Pitch Perfect, a scene where two different acapella groups riff off each other, would fit right in with the mosaic nature of this assignment.

My final mosaic for phase 3 is a system that takes in user input to select a features file and a Pitch Perfect clip. It then layers audio from the features file onto the Pitch Perfect clip using KNN on the extracted features file (functionality implemented in the provided mosaic-synth-file.ck file). Pressing the “up” key adds extra layers of audio, fitting in with the idea of adding harmonies in acapella, and the down key removes layers of “audio.”

I improved upon my milestone in the following ways:

Added user interaction to make it a more interactive system
Giving users the ability to select their feature and file means my implementation can be extended to any “pre-extracted feature file/audio file clip” pair without needing to change any code.
Additional keyboard interaction described above
Reduced the choppiness of the feature audio
I made the window size within the mosaic synthesis file slightly shorter than that of the features file. This allowed sounds to blend in a way that seemed more seamless. One limitation I experienced was using the song “Fancy” as one of my feature audio files; the mosaic file seemed to pick up on the instrumental more than the vocals.
While experimenting, I realized the tradeoff between how much songs felt like a mosaic versus being able to identify the source music. For my implementation, I wanted listeners to be able to tell what the original song was, so I set my window length/hop size accordingly.
Added ambient sounds as another source file for each extraction to fill silences.

Final Thoughts

I am pleased with how this turned out, but I wish I had more time to perfect it! I would have loved to create a DJ-like experience, where I could switch the feature file while the program was executing to truly create a blend of different sounds. Andrew had a great idea of adding a video of me DJing to demo this system; unfortunately, Week 5 time constraints prevented me from successfully refactoring the code to load multiple feature files to be able to use multiple while running the program.

Acknowledgments

Andrew and Ge for helping me in hours and Discord
Provided mosaic synthesis and feature extract files
Everyone for providing great feedback on the milestone

Phase Two

I honestly spent most of my time coming up with an idea of what I wanted my source for feature vectors to be and how to create a mosaic, and I’m still unsure if I’m completely sold on my idea.

I wanted to create feature vectors for one song and apply them to another to create a “remix” (initially a duet from High School Musical to one from La La Land). I quickly realized it sounded extremely choppy and nonsensical no matter how I extracted the vectors. Furthermore, I experimented with using isolated vocals to preserve the original instrumental and have some semblance of the original song, but the vocals did not provide enough variation in feature types for the mosaic to sound good.

My milestone is a scene from Pitch Perfect that they refer to as a “riff-off” between different acapella groups. I used different songs to extract features for each part.

I used both the file input and mic input files to create the mosaic. I did the latter for some because I found the outputted audio for some clips seemed quite repetitive because of a lack of variation in the audio. Some other experimentation I did was changing the window length to have longer snippets and make the audio less choppy.

In the extra two days we were given after I created this milestone, I have been experimenting with video processing and brainstorming how to add more user input. I have added a “quiet” audio to represent silences.

Code: https://github.com/shreyadsouza/featured_artist/tree/main/phase_2

I used the following sources:

Song 1: Big Boys — SNL (https://www.youtube.com/watch?v=B9Z5YtWOJBM)
Song 2: Rap Roundtable — SNL (https://www.youtube.com/watch?v=3sxRAeh8f7w)
Song 3: Whitney Houston — I Will Always Love You (https://www.youtube.com/watch?v=3JWTaaS7LdU)
Song 4: Queen — Bohemian Rhapsody (https://www.youtube.com/watch?v=fJ9rUzIMcZQ)
Song 5: Iggy Azalea — Fancy ft. Charli XCX (https://www.youtube.com/watch?v=O-zpOMYRi0w)
Song 6: Beyoncé — Single Ladies (Put a Ring on It) (https://www.youtube.com/watch?v=4m1EFMoRFvY)
Video: Pitch Perfect | Riff-Off (https://www.youtube.com/watch?v=hGdz2rMbTIM)

There is a lot of room for improvement :

(I finished my video before Tuesday’s class and didn’t have time to edit and reupload to reflect the helpful advice given!)

Add video processing to include the extracted video clips alongside
Add some kind of keyboard input (maybe to switch between original and mosaiced versions).
Add more user interaction in general.
Reduce the “choppiness” of the audio features.
Experiment with the sound some more (e.g. pitch shift). I experimented with window length, but need to do some more work to understand how that affects the sound
Add more variability as the video goes on. At present, we see the same concept of replacing audio with a single other song for the entire duration.
Code-wise: Generate clips all at once instead of having to record separately for different clips

Phase One

I conducted several types of experiments, including experimenting with single-unit analyzers to see which one performed the best on its own and experimenting with using more of the same type of feature. I also experimented with the best-performing combinations to see if I could improve their accuracy with my findings.

Overall, the original configuration in feature-extract.ck achieved the best single validation performance with fold-4 obtaining 45% accuracy. Interestingly, using 5 MFCCs (and nothing else) also achieved an accuracy of around 40%. This could be attributed to there being 75 dimensions per feature vector. The higher-dimensional extraction methods performed better; chroma performed better than rolloff, for example.

To try and achieve even higher accuracy, I added more MFCCs, chroma, and rolloff to the first combination of analyzers. However, this gave me a similar accuracy to the initial experiment, even though the vectors were of higher dimension. Thus, for the milestone I ended up keeping the original feature configuration.

Work completed for Stanford’s MUSC 356 / CS 470 course