Fixing HTML Video on Mobile
How and why we built Whitewater, an open source video encoder and player for our site
The problem with mobile video
In early 2015 we started development on a redesign for This Also’s website. As the project evolved it became clear that video would be a major component of the design language. We had a large background video on the home page and footer, more large videos near the top of each project page and many smaller videos littered across most of the other pages. All of them were meant to either autoplay or play and pause themselves due to some scripted behavior. None of them required any user interface or direct interaction.
These requirements were already complex on desktop, but on mobile they were a nightmare. Common HTML5 Video features such as preloading and autoplay are completely missing in some browsers. The scripting APIs are limited compared to what’s available on desktop. Worst of all, Safari on the iPhone (the most popular mobile browser to visit our site) does not allow inline video playback at all (Note: This limitation is being lifted in iOS 10 which is set to be released later this Fall).
GIFs weren’t an ideal workaround and requiring users to open a video in fullscreen didn’t make sense for the kinds of videos we were using. We looked for a solution and discovered other people had already devised some clever solutions.
Apple was doing it. So were a handful of news sites and other agencies. But none of them (that we could find) published their solutions or made them available to the community as a whole.
So we made our own
Instructions for how to use them are detailed elsewhere, so what we’ll do here is explain how they work so that you can decide whether Whitewater is appropriate for your own projects.
Faking Video 101
The basics of simulating a video in the browser are fairly simple. If you draw a sequence of images inside of a <canvas> tag fast enough, that’s a video. But that means creating those image assets up-front. You could save each frame as a separate image, but that’s neither practical nor efficient for lengthy or hi-resolution videos. To come up with a more efficient method for making those images there were three main considerations that we balanced:
- Overall file size of all assets combined, which is an important factor for mobile devices with limited RAM
- The number of HTTP requests needed to load all of the required assets, which can have a huge impact on the performance of a page as it loads
- The amount of computational complexity on the browser required to process and recreate each video frame
Whitewater favors optimizing 1 & 2 over 3. The complexity of the player might cause some slowdown on weaker devices, but loading oversized and/or too many assets is a more likely cause of browser crashes. As a result, the encoding — and therefore, reassembly—was necessarily more complex.
So how does encoding work?
The job of the encoder is twofold: 1: It saves the visual assets needed to reassemble the video, and 2: it provides a set of instructions for doing so. Because we want to minimize file sizes and asset count, our goal is to to cut out as much duplicate visual information as possible and then condense what’s left. The encoder accomplishes this by comparing each frame to the frame preceding it and saving only the parts that have changed.
On the first frame, the encoder saves the entire image as first.jpg. This is the only time the encoder does this, but it is important because it gives the player a starting point to draw each subsequent frame on top of.
For the rest of the frames, the encoder takes both the current and previous frames and breaks them into grids of 8x8 pixel blocks. The encoder cycles through each row, block by block, comparing the current block against the corresponding one from the previous frame.
It compares these blocks to determine not just whether they are different, but to what degree they are different. This gives us flexibility to set thresholds of allowable variability to account for things like film grain or small compression artifacts present in the source video. The method used for getting this value is to find the difference between two blocks, then calculate the Root-Mean-Square (RMS) from the histogram of the resulting image.
Setting a higher RMS threshold equates to more leniency, which reduces the number of blocks saved. This can reduce overall file size, but possibly at the expense of video quality. A lower threshold will have the opposite effect. Setting the threshold to zero tells the encoder to save every block (which is generally a bad idea).
If the RMS value is above our threshold, the encoder saves that block into an image called a diffmap. Diffmaps are a series of images that store the blocks needed by the player. Diffmaps fill up from left to right, top to bottom. Every diffmap for a single video is the same predefined size. When one fills up, it is saved and a new, blank diffmap is created to accommodate more blocks.
The default output format is JPEG. Because JPEG compression works on 8x8 pixel blocks, storing our own blocks in the same dimensions prevents JPEG compression artifacts from spilling out into unrelated parts of the video. Both the format and the blocksize can be changed in the encoder settings, but it is recommended that the 8 pixel default is kept when using JPEGs.
There is one more important thing that happens when block pairs are determined to be different: An internal consecutive counter is incremented. If there are consecutive different blocks beside one another in a frame it is useful to know how many. Knowing that there are seven blocks in a row to be copied onto the <canvas> lets us reduce the number of copy-paste actions by six.
The counter continues to be incremented until either…
- The consecutive chain is broken by a block pair that is not different,
- The encoder reaches the end of the current row in the video frame,
- The current row of the diffmap being used fills up, or
- The entire diffmap fills up.
In all of these cases, the counter is reset to zero and some metadata is stored which will instruct the player on how to reassemble a video from the diffmaps. That metadata is a 5-character string which represents two thing: the location on a frame where a block group originates and how many consecutive blocks that group contains.
When the encoder broke each frame up into a grid, it numbered each cell sequentially. The location we store is the cell number from the first block of a consecutive block group, converted to base 64 and padded to 3 digits in length. The number of consecutive blocks is similarly converted to base 64, and then padded to 2 digits. Both are converted to strings, and then concatenated to make one, 5-character long string. After the encoder finishes an entire frame, each of these is concatenated into one long string and appended to an array called “frames.”
This process continues until the final frame. Once all of the images have been created, the encoder stores the metadata in a JSON file called manifest.json. The frames array becomes part of this, along with information about the source video (dimensions, FPS, frame count), the number of diffmaps created and their format, the block size used, the diffmap dimensions and the version of the encoder used.
The final output is a directory of files. When initializing a video with the player, you point to this folder as the video source.
Results may vary
Now that we understand the process itself, we have a better understanding of why we might want to update some of the encoder settings. You may want to reduce the number of assets by increasing the dimensions of the diffmaps or increasing the RMS threshold. Or you might find that your video only uses one diffmap and does not completely fill it, so reducing the dimensions might be desirable. Depending on the kind of video you’re encoding, you might even find that using GIF or PNG as the output filetype saves some space over JPEG.
The video player
From here, it’s up to the user or developer to execute the play() function. This function uses requestAnimationFrame() to handle the logic of drawing each frame and manage the interval between frames. In many cases, requestAnimationFrame fires faster than we need, so the FPS is used to calculate the minimum time that must pass before the next frame should be drawn.
The player works by layering the differences of the next frame on top of the previous one. In this way, the video is built of up a layer at a time.
For each frame, it takes what has changed, and repaints that on top of the previous frame. In this way, the video is built up a layer at a time and needs that first frame to serve as the base.
For each frame, one line of the frames array from manifest.json is decoded into base 10 and used to grab the necessary blocks from the current diffmap. Rather than drawing these blocks directly to the <canvas>, they’re first preassembled into a single image in memory and then drawn. This way, each frame is rendered as a whole, removing the possibility of a frame being only partially rendered.
The end result
Our goal for this project was to develop a workaround for some of the limitations of HTML5 Video. The Whitewater system provides developers with preloading, inline playback, some DOM events, and a set of scripting APIs that aren’t reliant on user action to trigger.
It should be noted that Whitewater is not meant to replace all video on mobile. There is no support for audio. We’ve also found that in many cases the aggregate file size of all the needed assets will be greater than a regular video file. In fact, for videos of any considerable length, the file size can be too large to be useful. However, in the circumstances in which you need to get around the limitations of mobile browsers and the drawbacks of Whitewater are not an issue, this system provides you with an option.