I would like to create a program that uses a convolutional neural network to create an output of what famous quotes from movies or soundtracks would sound like from a learned expression of gathered data, from 10,000 files (mp3's or mp4 conversions to mp3's), using a “scalar” system to program the output by different levels of detection: with frequency and time. The program would go through each individual layer of detection to find, on a 0 to 1 spectrum, how high each millisecond of playback is interpreted. The closer to 1 the number is, the more “voice” the frequency is per millisecond.

This would mean 5000 milliseconds of voice. in a pattern per ‘example’ would be needed from each given example analyzed. 5000? Why such a randomly seeming number? 5000 milliseconds are in 5 seconds, thus why we need to analyze that many. Creating a visualization of the soundbite would be simple to do, by illustrating it with a depiction of a scrolling horizontal set of pixels that represents the length of time and the vertical axis which represents the frequency of the pitch. The x axis would be negative infinity to positivity infinity while the y axis would be the same, but, to make things simple, most humans can only hear up to 13,000 MHz, so that’s where we’d cap the boundaries for positive frequency.

So taking all of that into account, we could create and image with the dimensions 1300xtime, where as 1 millisecond is 1 pixel created on the horizontal scrolling axis. 1300 to condense the image into a smaller representation because 13,000 is just a lot to render and would require a lot of computing power.