Beautifully Buffered Bytes

An exploration of HTMLMediaElement fueled with inspiration from YouTube and Google Chrome.

11 min readJul 23, 2015

This article has been written for educational purposes only. Leveraging any code provided within will not ensure your compliance with YouTube’s TOS.

Let me tell you a story. A story about a feature. A feature which was born out of desperation, creativity, and countless, extremely strong cups of coffee. Coffee which gave way to some fucking cool code.

First, a bit of backstory, to bring you up to speed.

For the better part of the past three years I’ve spent my waking hours mucking around in Chrome’s extension environment. Day by day, week by week, I found myself combining pieces of their browser with YouTube’s video API in a most haphazard manner. Over time, a rather sensible monstrosity was constructed. It was known as Streamus. Streamus the music player.

“And what of it?” You might inquire. “What do you have to show for all your efforts?”

“A glorious success!” I’d exclaim, a grin breaking out across my face. “300,000 loyal users and just one C&D from YouTube’s legal department!”

Oops.

Would you believe it? YouTube wants their videos to be visible. Go figure.

Fortunately, I was privy to this fact. Unfortunately, showing video is a technical nightmare. I longed to go over my numerous ideas with the powers that be prior to embarking on a prototype. I had high hopes of successfully navigating YouTube’s choppy, political waters and was expecting to soon find my extension moored on the shores of sensible middle ground.

A realistic depiction of where I would bask after the success of my software.

Unfortunately, it was not to be.

Through months of brainstorming and debate YouTube and I exhausted one potential option after another. Eventually, it became painfully clear: Streamus would need to present video within its pop-up window.

Not so bad, yeah? People have been embedding YouTube videos into tabs since man harnessed the power of fire, right?

Yep! You’re right. Show’s over. Go home.

…

No!

Oh how I would have loved to rub my proverbial code sticks together to spark a resolution. Alas, as you may or may not know, it’s never that simple in software development.

A more likely candidate for the fire I’d be developing.

Chrome extensions come with their own set of limitations:

Security concerns? No problem.
Permission requests? Got it.
Pop-up window destroying itself upon losing focus? Shit.

As a user, you’d expect to watch a video at your leisure and, on a whim, hide it in an effort to explore other avenues of the Internet. Naturally, the audio should continue even after the video has gone on its way. You know, like how navigating between tabs in a browser works. Basic stuff. Internet 101.

Unfortunately, that’s not a free feature in Chrome extension land. Hell, a sane developer wouldn’t even consider it a remote possibility.

What’s a guy to do? Is this project fated to sink into the murky depths of the Internet ocean?

Nope. It’s about to get awesome. Let’s jump into some code.

YouTube Video Splitting

Goals:

Display YouTube’s visual content detached from its audio content.
Visual content should be synchronized with its audio content.
Incur no additional bandwidth usage.
Incur no egregious performance penalties.
Support only HTML5 <video> on the stable Google Chrome channel.

Tasks:

Mitigate cross-origin resource sharing policies.
Intercept communications with video servers.
Capture necessary video information.
Render video on separate webpage.
Synchronize video with external audio source.

Ready to learn? Me too! Let’s get down to business!

Mitigate cross-origin resource sharing policies.

For those not familiar, CORS is a security policy which strives to prevent code from directly acting upon an external website’s information.

Consider the following scenario:

Two websites exist: “Site A” and “Site B.”
“Site B” can show sensitive information inside of itself and has an interest in protecting said data.
“Site A” embeds an <iframe> whose URL points to “Site B.” In essence, “Site A” is hosting “Site B” within itself.

In theory, “Site A” should be able to programmatically access the information within “Site B” since “Site B” is a subset of “Site A” due to the embedded <iframe>. In practice, thanks to cross-origin restrictions, “Site A” is denied the ability to read the data.

This is all well and good for security, but, for our task at hand, it is an obstacle to overcome. As the name would imply, YouTube’s IFrame API provides an API to an embedded YouTube player which is housed inside of an <iframe>. This player is guarded by CORS policies. We are unable to directly act upon it.

One way of overcoming CORS is through asynchronous message passing. An external website, such as “Site A,” can call window.postMessage.

The window.postMessage method safely enables cross-origin communication.

“Site A” is able to request that “Site B” perform an action. If “Site B” is listening and wishes to honor the request then it may respond accordingly.

Unfortunately, YouTube’s <iframe> doesn’t give a damn about us. We need to encourage it a bit.

Enter Chrome extension content scripts.

Content scripts are able to extend the functionality of any web page. The only caveat is that the user must grant permission to do so.

Let’s look at some code. This is manifest.json. Manifest files are used within Chrome extensions to declare software needs prior to installation.

Declaring our intent to inject arbitrary JavaScript into YouTube pages provides us with a mechanism for interacting upon their data more closely. We’ll still need to communicate through window.postMessage, but at least we’re talking!

Intercept communications with video servers

It’s great that we have a way of communicating with YouTube’s embedded website, but it’s not much use unless we have something to chat with it about!

How are we going to get the data we’re interested in?

YouTube’s API is a staggering ~50,000 lines of code. Yeah. Fifty thousand lines. Minified. Are we really going to try and read, digest, and modify their source? That would be crazy. Only someone really, really stupid would try to do that…

…Your humble author found it to be an interesting experience. I wouldn’t recommend it to anyone who values their time, but I did end up learning an immense amount regarding the inner workings of YouTube’s IFrame API.

Eventually, it dawned on me that there was a much simpler solution: override YouTube’s usage of XMLHttpRequest and provide our own, custom implementation.

First and foremost, we’re going to need permission to make this happen:

Web accessible resources will allow us to load arbitrary content from within our content script. Why is that useful? The injected code will be run from within a different sandbox policy than our content script.

Content scripts are sandboxed such that they have access to Chrome extension APIs, but are prevented from accessing variables scoped to their parent window. Conversely, web accessible resources do not have access to Chrome APIs, but are able to work more closely with their parent window.

Now, inject interceptor.js into YouTube’s iframe via youTubeIFrameInject.js:

Voila! We’ve magically given ourselves the ability to listen in on all XMLHttpRequest instances spawned by YouTube’s IFrame API.

Capture necessary video information

This is where things start to get a bit more technical. We’re going to need to do a few things in order to capture the data we’re interested in:

Parse responses from YouTube’s video server.
Find codec information inside the appropriate response.
Find video buffer data as it is passed to us in chunks.
Make video buffer data accessible from outside the <iframe>

Here’s the code:

Holy moly! That’s some dense code. Fear not! I’ll break it down so that we can better understand what each piece contributes to the whole.

We’re interested in responses from YouTube’s server, not requests, but the easiest way to listen for a response is to setup an event handler beforehand. So, we start listening for the current request to finish loading.

YouTube provides a plethora of codecs based on the quality, size, and encoding of a given video. Simply hard-coding a codec will result in a lot of black screens. We know that YouTube has coded this already. So, digging through their source code proves not only warranted, but fruitful.

We’re able to find and leverage YouTube’s algorithm for parsing their server’s responses. It has some questionable edge cases, but, if it’s good enough for YouTube then it’s good enough for us. Let’s go ahead and store our found codec information in a lookup table for future requests.

Additional responses should hopefully contain chunks of video buffer data. This is easy enough to detect via the responseType property, but the data itself isn’t much use to us unless we know how to interpret it. That’s where our codec lookup table comes in handy.

Finally, we find ourselves leveraging window.postMessage to pass the ill-gotten gains back to our home turf. However…

Be careful! There’s a major performance bottleneck to take into consideration.

Without transferable objects you’ll find yourself walking straight into Mordor if performance is “the precious.”

One does not simply walk out of an <iframe> with a reference to a huge buffer of video data!

No. You’ll need to pass a pointer. You might be thinking, “What the hell? This isn’t C. Pointers in JavaScript?” Yup. Modern browsers now support tranferable objects. This allows for 0-byte transfers of data through window.postMessage.

Caveat: Once a transferable object’s pointer has been de-referenced it is no longer accessible from its origin. As such, instead of transferring the original buffer of data, make a copy of it and pass that around.

Finally, our long running background script squirrels away its newly found video data so that it’s ready for our pop-up when necessary.

Render video on separate webpage

Surprisingly, our once insurmountable problem now has a solution that’s starting to take shape. We’ve got an amorphous blob of video data ripe for presentation. What should we do with it? What can we do with it?

Well, for starters, we’re going to need to be able to show it to a user. So, let’s switch mindsets and think about the visible pop-up page. We need a web page and a video element:

A video element on its own doesn’t do us much good, though. We’ll need to spice it up with some JavaScript:

The above file, videoView.js, is a very basic view which is in charge of managing the video. Surprising, I know. The name of the file didn’t give it away at all.

videoView.js doesn’t do a whole lot on its own. It’s wholly reliant on having a child MediaSource. This child will be responsible for managing the video’s buffers of data.

Let’s take a look at what our wrappers for these HTMLMediaElements look like. This is going to feel like a lot of code, but I promise it’s not so bad:

You’re probably thinking, “Sean. What the hell. You lied. That’s a lot of code.” No it’s not. Breaking it down will make you feel better.

mediaSourceWrapper.js is mostly boilerplate. It’s in charge of managing sourceBufferWrapper.js by responding to interesting events. All you really need to understand is:

window.URL.createObjectURL: This method takes a MediaSource object and returns a URL which points to that MediaSource. This URL will be consumed by our <video> through its src attribute.
A MediaSource can reference multiple buffers of data — such as an audio and video buffer. For our purpose it will only reference one buffer — video.

sourceBufferWrapper.js is just a glorified queue. It talks to our permanent background page, where we previously squirreled away video buffer responses, and feeds those chunks of data into its parent MediaSource.

That’s it! 216 lines of code to create a URL which points to an open-ended queue containing chunks of video data. Simple.

Synchronize video with external audio source

The lights at the end of the tunnel are visible and, oddly enough, there seems to be some music accompanying them. Weird!

We’ve got audio on one page and visuals on another and we need to make our video’s time match YouTube’s. That’ll require interrogating their video, but we don’t have the leisure of being able to interact with it directly thanks to CORS. Our only mode of communication is asynchronous. That usually wouldn’t be a big deal, but, when we need precision within 6/100 of a second, waiting for a response will introduce a notable delay.

Fortunately, the solution is trivial. We will continue to use asynchronous communication, but adapt by timestamping everything. This will allow us to offset our needed time by the duration we spent retrieving it.

Inside of youTubeIFrameInject.js:

and within our extension:

This code uses a message passing technique similar to the one we’ve used previously, but is able to maintain a long lived connection.

Something worth mentioning is the omission of Date.now in favor of performance.now. For our uses, both options would function similarly, but performance.now is able to provide higher precision values while also being immune to modifications of the system clock. It’s definitely best practice to favor it.

Code works. Ship it?

I’m not about to claim that the code I’ve shown you all should be taken to production, but….

WOW! It actually works. Can you believe it?

You might have to take my word for the audio synchronization, but, if you’re the endeavoring type, all of the code needed is at your disposal on GitHub:

MeoMix/YouTube-video-split-test

YouTube-video-split-test - A proof of concept of detaching YouTube's video from audio without bandwidth or CPU increase

github.com

Knock yourself out.

Final thoughts

The seed of an idea once thought impossible to grow now blooms as a testament to clever ideas and tenacious coding. An unfettered, glorious success.

Well, almost.

YouTube decided showing video in such a manner was a bit too crazy.

Damn. Well, we tried and, in the process, learned a lot about what a modern browser can accomplish! As they say, “I’d rather live a life of ‘Oh wells’ than a life of ‘What ifs’.”

Readers — it’s been fun. Until next time.

Acknowledgements

Software development is rarely a single person’s efforts. I have a great deal of appreciation for many people who helped make this code a reality:

Mozilla’s Developer Network: For providing me with the original idea through their “Manipulating video using canvas” example.
MarionetteJS: For listening to my incessant, idle ramblings on their Gitter chatroom and for providing an excellent framework with which to build my examples.
Rob W: For always inspiring and challenging me to write better code and for reviewing my code once written.

P.S.

This is my first Medium article. I’ve had a blast writing it for you all!

Feel similarly about reading it? Hit that recommend button! I’d love to have your support for future writings.