H265 on WebRTC without Using DataChannel(2/2)

7 min readOct 24, 2023

In this post, I will demonstrate how to show H.265 frames through WebRTC, as mentioned in the previous post.

Media Source Extensions (MSE)

The Media Source API, formally known as Media Source Extensions (MSE), provides functionality enabling plugin-free web-based streaming media. Using MSE, media streams can be created via JavaScript, and played using <audio> and <video> elements.
https://developer.mozilla.org/en-US/docs/Web/API/Media_Source_Extensions_API

MSE supports fragmented mpeg-4(fMP4) containing H.265 frames. The structure of fMP4 consists of ftyp , and moov followed by multiple moof , and mdat pairs.

https://bitmovin.com/container-formats-fun-1/

Playing a streaming H.265 video on MSE is like playing an endless fMP4 file. Each chunk contains moof , and mdat and the first chunk is prepended with ftyp , and moov .

There are two checkpoints before playing a streaming video.

Implement MSE

I write a WebSocket server to send fMP4 data to a website. To generate sample fMP4 videos, I recommend downloading Bento4.

Bento4 MP4, DASH, HLS, CMAF SDK and Tools
A fast, modern, open source C++ toolkit for all your MP4 and DASH/HLS/CMAF media format needs.

# convert demo.mp4 to a fMP4 file fdemo.mp4
$ /bento4/bin/mp4fragment demo.mp4 fdemo.mp4

# show fMP4 info
$ ./bento4/bin/mp4info fdemo.mp4

# dump fMP4 boxes
$ ./bento4/bin/mp4dump fdemo.mp4

[ftyp] size=8+24
...
[moov] size=8+732
...
[moof] size=8+216
...
[mdat] size=8+67484
[moof] size=8+216
...
[mdat] size=8+67584
[moof] size=8+216
...
[mdat] size=8+69004

If you want to have an interactive viewer online, you can browse the MP4Box.js website.

Frontend codes are written in native JavaScript without ReactJS or VueJS frameworks. There is a download function to check what has been received from the WebSocket. This website can be used to check supports of common MIME types.

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>Websocket Frontend</title>
  </head>
  <body>
    <div id="output"></div>
    <video
      id="video"
      width="640"
      height="480"
      autoplay
      style="border: 1px solid black"
    ></video>
    <br />
    <button id="play" onclick="play()">Play</button>
    <button id="stop" onclick="stop()">Stop</button>
    <button id="download" onclick="download()">Download</button>
    <script>
      let recordedChunks = [];
      let ws = null;

      function play() {
        const ms = new MediaSource();
        const player = document.getElementById("video");

        player.src = URL.createObjectURL(ms);

        player.addEventListener("error", () => {
          console.log("player error");
        });

        ms.addEventListener("sourceopen", () => {
          const sb = ms.addSourceBuffer(
            'video/mp4; codecs="hev1.1.6.L150"'
            );
          ws = new WebSocket("ws://localhost:8000");

          sb.mode = "sequence";

          sb.addEventListener("updateend", () => {
            console.log(
              sb.buffered,
              sb.buffered.start(0),
              '==>',
              sb.buffered.end(0),
              sb.mode,
            )
          });
          sb.addEventListener("error", () => {
            console.log("error");
          });

          ws.addEventListener("open", () => {
            console.log("open");
            // clear data
            recordedChunks = [];
          });

          ws.addEventListener("message", (event) => {
            event.data.arrayBuffer().then((buffer) => {
              sb.appendBuffer(buffer);
            });
          });
        });
        ms.addEventListener("error", () => {
          console.log("error");
        });
      }
      function stop() {
        ws.close();
      }
      function download() {
        const blob = new Blob(recordedChunks, {
          type: "video/mp4",
        });
        const url = URL.createObjectURL(blob);
        const a = document.createElement("a");
        document.body.appendChild(a);
        a.style = "display: none";
        a.href = url;
        a.download = "test.mp4";
        a.click();
        window.URL.revokeObjectURL(url);
      }
    </script>
  </body>
</html>

WebSocket server is written with Deno. To read a fMP4 file and send it to a frontend.


const demo = Deno.readFileSync("fdemo.mp4");


Deno.serve((req) => {
  if (req.headers.get("upgrade") != "websocket") {
    return new Response(null, { status: 501 });
  }
  const { socket, response } = Deno.upgradeWebSocket(req);
  socket.addEventListener("open", () => {
    console.log("a client connected!");
    setTimout(() => {
      socket.send(demo);
    }, 1000);
  });
  socket.addEventListener("message", (event) => {
    console.log("message", event);
  });
  socket.addEventListener("close", () => {
    sockets.delete(socket);
    console.log("a client disconnected!");
  });
  return response;
});

Everything works well so far.

Write a Pure JavaScript Packetizer

Because it is impossible to call Bento4 to generate fMP4 files (without using wasm), I want to write a pure JavaScript fMP4 packetizer for H.265. I reference src/remux/mp4-generator.ts in hls.js repository, which has predefined many boxes but only supports H.264 (avc1).

Add hev1 and hvcc

To support H.265 from raw frames, I need to add hev1 and hvcc boxes. Two Chinese websites (1, 2) really helped me understand the structures of each box. hev1 , and hvcc are under stsd to replace avc1 , and avcc .

[moov] size=8+732
  [mvhd] size=12+96
  [trak] size=8+560
    [tkhd] size=12+80, flags=7
    [mdia] size=8+460
      [mdhd] size=12+20
      [hdlr] size=12+41
      [minf] size=8+367
        [vmhd] size=12+8, flags=1
        [dinf] size=8+28
          [dref] size=12+16
            [url ] size=12+0, flags=1
        [stbl] size=8+303
          [stsd] size=12+223
            [hev1] size=8+211 <-------------------
              [hvcC] size=8+105 <-----------------
              [btrt] size=8+12
          [stts] size=12+4
          [stsc] size=12+4
          [stsz] size=12+8
          [stco] size=12+4
  [mvex] size=8+48
    [mehd] size=12+4
    [trex] size=12+20

The hev1 , hvcc are assembled as follows. track.vps/sps/pps are NALUs in H.265 keyframes.

Of note,track.vps/sps/pps are not include start codes! Payloads in mdat do not include start codes, too.

static hev1(track: Track) {
  let vps: number[] = []
  let sps: number[] = []
  let pps: number[] = []
  let len

  // assemble the VPS
  vps.push(0x20)
  vps.push(0x00)
  vps.push(0x01) // vps count
  len = track.vps.byteLength
  vps.push((len >>> 8) & 0xff)
  vps.push(len & 0xff)
  vps = vps.concat(Array.prototype.slice.call(track.vps))

  // assemble the SPS
  sps.push(0x21)
  sps.push(0x00)
  sps.push(0x01) // sps count
  len = track.sps.byteLength
  sps.push((len >>> 8) & 0xff)
  sps.push(len & 0xff)
  sps = sps.concat(Array.prototype.slice.call(track.sps))

  // assemble the PPS
  pps.push(0x22)
  pps.push(0x00)
  pps.push(0x01) // pps count
  len = track.pps.byteLength
  pps.push((len >>> 8) & 0xff)
  pps.push(len & 0xff)
  pps = pps.concat(Array.prototype.slice.call(track.pps))

  const hvcc = MP4.box(
    MP4.types.hvcC,
    new Uint8Array(
      [
        // those magic bytes are copied from fdemo.mp4
        0x01, 0x01, 0x60, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x96, 0xf0, 0x00, 0xfc, 0xfd, 0xf8, 0xf8, 0x00, 0x00, 0x0f,
        0x03,
      ]
        .concat(vps)
        .concat(sps)
        .concat(pps),
    ),
  )

  return MP4.box(
    MP4.types.hev1,
    new Uint8Array([
    ...
    ]), // pre_defined = -1
    hvcc,
    MP4.box(
      MP4.types.btrt,
      new Uint8Array([
      ...
      ]),
    ), // avgBitrate
  )
}

static mdat(track: Track) {
    const vps = track.vps
    const sps = track.sps
    const pps = track.pps
    const frame = track.frame

    return MP4.box(
      MP4.types.mdat,
      new Uint8Array(
        [
          (vps.byteLength >>> 24) & 0xff,
          (vps.byteLength >>> 16) & 0xff,
          (vps.byteLength >>> 8) & 0xff,
          vps.byteLength & 0xff,
        ]
          .concat(Array.prototype.slice.call(vps))
          .concat([
            (sps.byteLength >>> 24) & 0xff,
            (sps.byteLength >>> 16) & 0xff,
            (sps.byteLength >>> 8) & 0xff,
            sps.byteLength & 0xff,
          ])
          .concat(Array.prototype.slice.call(sps))
          .concat([
            (pps.byteLength >>> 24) & 0xff,
            (pps.byteLength >>> 16) & 0xff,
            (pps.byteLength >>> 8) & 0xff,
            pps.byteLength & 0xff,
          ])
          .concat(Array.prototype.slice.call(pps))
          .concat([
            (frame.byteLength >>> 24) & 0xff,
            (frame.byteLength >>> 16) & 0xff,
            (frame.byteLength >>> 8) & 0xff,
            frame.byteLength & 0xff,
          ])
          .concat(Array.prototype.slice.call(frame)),
      ),
    )
  }

NOTE: track here is not the same in hls.js .

BaseMediaDecodeTime & DefaultSampleDuration

I encountered an issue with the player freezing on the first frame and taking 20 or more seconds to update after sending the first chunk to MSE, which included a keyframe. Upon further investigation, I found that the problem was related to the default sample durations of a frame.

Relations among real time, BaseMediaDecodeTime in traf box (version 0), and buffered.end(0)

To play a live view smoothly, BaseMediaDecodeTime should be increased by DefaultSampleDuration. timebase is an arbitrary value but not less than 10, according to my tests.

Because BaseDecodeTime may overflow after playing for a long time, it is possible to use less frame interval. For example, if fps is 25 and the frame interval is 40ms, I can use a minimum of 10ms. Therefore, DefaultSampleDuration is from 40 * 10 to 10 * 10. Of course, using less frame interval will affect the increase of Buffered.end(0).

const timebase = 10;

static traf(track: Track) {
    const baseMediaDecodeTime = track.dt
    const defaultSampleDuration = timeBase * Math.floor(1000/track.fps)
    const id = track.id
    const size =
      track.vps.byteLength +
      track.sps.byteLength +
      track.pps.byteLength +
      track.frame.byteLength +
      16

    return MP4.box(
      MP4.types.traf,
      MP4.box(
        MP4.types.tfhd,
        new Uint8Array([
          0x00, // version 0
          0x02,
          0x00,
          0x38, // flags
          id >> 24,
          (id >> 16) & 0xff,
          (id >> 8) & 0xff,
          id & 0xff, // track_ID
          (defaultSampleDuration >>> 24) & 0xff,
          (defaultSampleDuration >>> 16) & 0xff,
          (defaultSampleDuration >>> 8) & 0xff,
          defaultSampleDuration & 0xff,
          (size >>> 24) & 0xff,
          (size >>> 16) & 0xff,
          (size >>> 8) & 0xff,
          size & 0xff,
          0x01,
          0x01,
          0x00,
          0x00,
        ]),
      ),
      MP4.box(
        MP4.types.tfdt,
        new Uint8Array([
          0x00, // version 0
          0x00,
          0x00,
          0x00, // flags
          baseMediaDecodeTime >> 24,
          (baseMediaDecodeTime >> 16) & 0xff,
          (baseMediaDecodeTime >> 8) & 0xff,
          baseMediaDecodeTime & 0xff, // baseMediaDecodeTime is an integer equal to the sum of the decode durations of all earlier samples in the media
        ]),
      ),
      MP4.trun(
        28 + // tfhd
          16 + // tfdt
          8 + // traf header
          16 + // mfhd
          8 + // moof header
          8,
      ), // mdat header
    )
  }

In the end, I change mdhd to version 0.

  const timebase = 10;
  const timescale = 1000;

  static mdhd() {
    const ts = timeBase * timescale

    return MP4.box(
      MP4.types.mdhd,
      new Uint8Array([
        0x00, // version 0
        0x00,
        0x00,
        0x00, // flags
        0x00,
        0x00,
        0x00,
        0x02, // creation_time
        0x00,
        0x00,
        0x00,
        0x03, // modification_time
        (ts >> 24) & 0xff,
        (ts >> 16) & 0xff,
        (ts >> 8) & 0xff,
        ts & 0xff, // timescale
        0x00,
        0x00,
        0x00,
        0x00, // duration
        0x55,
        0xc4, // 'und' language (undetermined)
        0x00,
        0x00,
      ]),
    )
  }

WebRTC Insertable Streams

Now, I have a pure JavaScript H.265 fMP4 packetizer. But how can we process data from WebRTC? The answer is using WebRTC insertable Streams.

Using WebRTC Encoded Transforms - Web APIs | MDN

WebRTC Encoded Transforms provide a mechanism to inject a high performance Stream API for modifying encoded video and…

developer.mozilla.org

I can get Encoded video frames (RTCEncodedVideoFrame) in the transform function. RTCEncodedVideoFrame.data contains data from WebRTC.

Parsing top 8 bytes to check if it is a H.265 keyframe but packets in H.264 P frame. If it is a keyframe, read NALU position string size and get the string using TextDecoder.decode() . Finally, the NALU position map can separate VSP/SPS/PPS.

Ensure to cache VPS, SPS, and PPS to packetize them with P frames.

Again
Of note,track.vps/sps/pps are not include start codes! Payloads in mdat do not include start codes, too.

Use a Queue

Frames cannot be appended to SourceBuffer if the updatend event is not fired. Use a queue to buffer frames and pass a frame to SourceBuffer in the updateend event callback at a time.