Building Myntra’s Video Platform: Part 2
This is the second article in the series ‘Building Myntra’s Video Platform’. In case you haven’t read the first part, you may refer to the article here.
In the previous article, we touched upon the types of streaming and the various technologies which exist, the ones which are the best suited in today’s time. In this article, we will delve deeper into aspects like Encoders & Decoders, the kind of client players you can have, glass to glass latency, stream security and the protocols you can use for ingestion, among others.
Video On Demand (VOD) vs Live Streaming
It is important to understand the distinction between Video On Demand (VOD) & Live Streaming. In layman terms, VOD is like pre-recorded content and it allows users to access the video content whenever and wherever they want to watch them. Live Streaming allows users to view the content creator to be viewed in real time.
Both VOD and Live Stream experience can be powered through HLS. The manifest files differ slightly in both. The Master Playlist continues to remain the same however the Media Playlists differ slightly.
The Media Playlist for VOD has a #EXT-X-ENDLIST tag. Also for a Live Stream, the Media Playlist files keep on changing continuously. The Media Playlist describes the current active video segments. It has something known as Playlist length. Playlist Length is defined as the maximum number of segments or entries a Media Playlist can contain. As the live stream continues new segments get added until the Playlist size is breached at that point the older ones keep on getting removed. For VOD Playlist length is not applicable and it contains the entries of all segments.
#EXT-X-MEDIA-SEQUENCE attribute keeps on changing in a live stream. This attribute conveys as to what is the index of the media segment from the start of the live stream.
Segment Alignment
In order for the live stream to work properly, it is important to make sure that the segments align perfectly with each other so that players can move between the streams.
Bitrate Ladder
As we had discussed in the previous article, in ABR clients switch between the different bitrates depending on the network quality, type, size of video etc they need. A bitrate ladder corresponds to the master playlist in case of HLS. Each bitrate defines network quality, video size etc based on which a client can choose it. Please refer to the master playlist section in the previous article for more details.
There are a lot of sample bitrate ladders on the net which can be used right away. While these bitrate ladders work fairly well the experience for users may not be ideal.
Some of the most important things to keep in mind are:
Content being offered
Content plays a very important role when defining the bitrate ladder, for example sports content will have fast motion with many key frames and may require higher bitrates, while talk shows where there is not a lot of motion may work well with lower bitrates too. So for a given video dimension say 480p, 720p, 1080p etc. you may be able to pack the video with much lesser bitrates in case of the latter.
Target audience
It is important to understand the devices and the type of internet your users have access to. You may offer a 4K quality but if most of your users are running on a slow 4G internet, they may not be able to benefit much from it. There will be wastage of CPU for generating this bitrate and storage too. If the target audience has slow internet then you may want to have more bitrates in the lower speed range so that they benefit from switching.
Content space
Also if you want the video to show in a small part of your app or website with no features to enable full screen etc then you may want to stick to lower video resolutions in your bitrates as clients will not benefit from higher resolution in smaller space beyond a point. While some clients are smart and they do pick video quality based on space where it needs to be played in, others may not be so. In such cases, it will end up wasting bandwidth on the client side. So being diligent helps.
Apart from the above, few points to note
- Too many bitrates doesn’t help too, the more bitrates you have, the more the processing needed on the transcoder side. Depending on the requirement detailed evaluation may be needed, but typically you can keep 4–5 bitrates.
- Another important thing to note is that audio or video parameters need to remain consistent throughout, i.e a video against 480p cannot change audio format etc suddenly, it needs to remain the same throughout.
Based on the above, we defined our custom bitrate ladders based on the different content we offer on the platform.
Encoders and Decoders
Encoding is the process of compressing raw video which contains a large amount of data into a smaller format thus reducing the size. Decoding is essentially the reverse of encoding. A decoder takes an encoded video and converts it into a form which client players can consume.
Codecs
A video codec is software or hardware that compresses and decompresses digital video.
Video Codecs
In the context of VOD and Live Streams, the most popular codecs include
- H264 / AVC
- H265 / HEVC
- VP8
- VP9
- AV1
- VVC
H264 is the most supported out there. H265 takes almost 4 times the time to encode a video than H264. Same for VP9. Because of the more CPU load it is less suitable for Live Stream. Same for VP9.
AV1 is a new codec by Google. The CPU load is much more than VP9 and H265, almost 10x. The encoder is still getting built and this number may come down as the encoder matures and HW encoders start coming in devices too. This is definitely not meant for Live Stream with the current numbers as encoding time may end up being much more than the video segment duration. VVC is another new codec getting built. Netflix has been trying AV1 codec for some titles on a limited scale.
Audio codecs
The most popular ones include:
- AAC
- Opus
- Vorbis
- MP3
AAC is the most popular one with the largest support. Opus is the most optimised one but lacks large support.
The choice of the encoder is dictated by decoders on the client-side.
HLS supports H264 and H265. H265 while providing higher compression is not completely free, we chose to use the former. On audio codec we chose AAC.
Hardware vs Software
Encoders
Hardware encoders can be optimised to run blazing fast. Good hardware encoders are usually very expensive. Any change in choice of codec would mean buying another hardware. Any change in the specification or a newer version of codec may require changing the hardware if the new changes are not supported by the same. Software encoders can be updated frequently. Powerful hardware encoders can be used in specialised use cases where performance is extremely critical and price is not a concern.
For the encoder, we used Software encoder, we plan to try out hardware encoders too in the future.
Decoders
Mobile devices, laptops, and desktop machines pack in hardware decoders. If a hardware decoder for a codec is present that should always be preferred over a software-based decoder. Hardware decoders take away the load from the devices’ CPU. It offers better performance. Most of the Android devices pack in AVC, HEVC, VP8 and VP9 hardware decoders. However, in some cases for example in the case of AV1 since it is new, codec hardware decoders are not present in the current set of devices. Android 10 onwards devices have a software decoder. Hence for AV1 CPU will use the software decoder.
Software Tools
Ffmpeg
Ffmpeg is a free and open source tool licensed under GPL. It is the most popular out there.
Bento
- Another alternative
- Lacks community support like Ffmpeg
- Limited in functionality
We use ffmpeg.
Clients
Players
- On Android ExoPlayer is the de facto player. https://exoplayer.dev/
- On iOS AVPlayer is the de facto standard. https://developer.apple.com/documentation/avfoundation/avplayer
On Web stack, there are multiple players
- Safari browsers offer native Playback for HLS in iOS and macOS.
- Windows offers HLS support too.
- VideoJs, hls.js, Shaka players are some of the popular ones with Video.js the most popular one.
How does a client play a playlist?
A client is provided with the link to the Master Playlist. The first media playlist entry in the Master Playlist is the default playlist to be played. A low video quality playlist at the start would mean faster load times for the video with the compromise that the video quality offered may not be the best.
Clients start at the start of the media playlist.
While fetching the resources, clients continuously measure the network speed. This helps in deciding which playlist to continue fetching the content from. In case the client realises that the network speed has gone down and its speed is lesser than the bandwidth parameter mentioned in the playlist, it chooses another playlist. As and when the conditions change it changes the playlist. Clients ensure that they fill up some buffer. This buffer allows it to continue playing the video even when some temporary net disruption etc happens.
Platform-specific nuances
For Live streams, platform-specific nuances exist, this may vary with player versions.
- On Android, ExoPlayer picks the 3rd last entry in the mutating media playlist. This used to be Apple ‘s old recommendation where they suggested having 3 segments worth of buffer.
- AVPlayer on the other hand picks the 2nd last element.
- Exoplayer and AVPlayer both pick a playlist based on the media segment target duration.
On the backend side, every effort should be made to make sure that there is resilience and failover happens properly. This will be covered in subsequent articles.
Clients may see some kinds of errors like:
Playlist not getting updated
- Every media playlist is expected to be updated frequently depending on the media segment time defined.
- Client players poll and populate their cache. Polling frequency varies depending on the above.
- If the player realizes that the media playlist has not updated, it retries for a few times and then it throws an error indicating that Playlist is not updated.
Playlist resets
- HLS recommends that media sequences should always be kept in increasing order and the playlist should never let the media sequence number go down.
- Due to some failure or any other reason the media playlist is regenerated or the media sequence number in the playlist is accidentally updated to a lower number, the player may throw some error.
Livestream pipeline
In a typical Live Stream the different steps include Input, Stream segmentation, Transcoding, Distribution and finally client playback.
The sequence of steps are as follows
- The first step is where the video from an input source is picked up. This video is fed into the stream segmentation step.
- The stream segmentation step breaks the continuous video stream into smaller video segments. These small video segments are then being fed into the Transcoder.
- Transcoder converts these small video segments into different video qualities depending on the bitrate ladder that has been defined in the system.
- The transcoded video segments are then pushed onto CDN ingest servers.
- Depending on the CDN infrastructure, ingest servers push content to multiple servers all the way to edge servers.
- Edge servers serve the content to clients.
Glass to Glass Latency
Glass to glass latency is defined as the total latency from the time content is being captured to the time when the viewer views the content on the device. First glass is the camera lens and the other viewer’s device, hence the term glass to glass latency.
Input
A minimum delay of a frame will always be present, however, this number is pretty less. For example, in a 30fps video, this number for a single frame will resolve to 1/30th. If any software or hardware is used for capturing the video the delay may be introduced by that. If a chip runs some image optimisation/video stabilisation filters that may add some delay.
Segment Generation
If the input is a continuous stream, the input obtained may be broken into segments before being transcoded. The delay will be equal to the duration of the segment. The larger the segment, the larger the delay. Sub 1 second segment sizes need a different kind of technology. We will talk about them in the ultra low latency section.
Transcoding
This is the process of converting one single video segment into multiple bitrates for an ABR streaming. This delay may depend on how fast or slow the systems are. If the delay is more than the size of the duration then issues will happen as the playlist will not be updated so fast on the client side, subsequently throwing errors or showing a buffering icon as the player waits for the content to be updated.
Network
Post the transcoding step the transcoded segments need to be sent to the CDN primary ingest server. This adds to the latency.
CDN Propagation: Primary to Edge
The primary CDN server propagates the segments to all the edge servers. Clients interact with the edge servers.
Player Buffers
Players buffer 2–3 segment sizes worth of content. This step is most responsible for adding to the latency. Earlier for HLS a lot of players would pick the 3rd last segment in the Playlist. This meant latency of 3 * segment size was introduced. These days some players pick the 2nd last segment i.e 2 * segment delay.
Ultra Low latency technologies
For most users conventional HLS works fine and improvements can be made on the various aspects mentioned above to control the lag. However in certain cases very low glass to glass latency may be needed, in such cases the technologies mentioned below can be used. Our use case didn’t have any such requirement and we use the former.
CMAF
Common Media Application Format (CMAF) came into the picture in 2017 when Apple and Microsoft joined hands. CMAF relies on Fragmented MP4 which was used in MPEG-DASH and later versions of HLS added support for. CMAF builds on top of HLS and DASH. CMAF segments can be served through both HLS and DASH i.e transmuxing is possible.
There are 3 constructs:
- CMAF Tracks
- CMAF Fragments
- CMAF Chunk
CMAF Fragment is a part of CMAF Track. CMAF chunk is a part of CMAF Track. A continuous stream of CMAF chunks is sent over the wire to the client player. In this mode, unlike the regular HLS or DASH based approach where delays are a multiple of segment lengths, here the delays are a multiple of CMAF chunks. The general chunk duration varies from 0.5 seconds to 1 second. Theoretically, a chunk can be as small as a single frame i.e for a 30fps that is 33ms worth of chunk.
Apple Low Latency HLS (ALLHS)
Partial Segments
Now players can fetch partial segments even before they are completely done. These segments can be around 250–300ms in duration. These segments will contain many video and audio frames. For this to happen HLS spec version 9 needs to be supported.
HTTP/2 pushed segments
In HLS and other ABR tech, players have always polled for playlists and media segments. In order to fetch the media segment, first players fetch the, and then upon realizing that the content has been updated a 2nd network call happens for the media segment. Apple has suggested that whenever a client fetches a media playlist any new content added can be pushed along with the updated playlist.
Blocking request /Delayed responses
A client can choose to request a playlist and content (and wait for it) even before the content is generated. Here the Http request is held at the server-side. So if a client is playing a media segment and tries fetching the playlist to see if there is any updated content. Server holds the requests and waits till new content is generated and stored. Once the content is ready, the server sends the playlist and content.
Delta updates
HLS playlists can become huge. Fetching the playlist and its content over and over again multiple times just to fetch the latest content is waste. In order to mitigate this, Apple introduced Delta updates. Here the client fetches the complete playlist once. Post that players can request for Delta Playlists. Delta playlists contain some of the latest segments and chunks of the complete playlist.
Faster bitrate Switching
Clients can choose to ask for other media playlists too while fetching a playlist. This allows the client to switch to other playlists without requesting the other playlists. This is especially useful for these ultra-low latency scenarios.
Low Latency HLS (LHLS)
This technology existed before ALLHLS. However this was an informal specification with very little adoption and was not supported everywhere especially iOS had no support for this. It was similar to CMAF.
Stream Security
Origin blocking (Origin, CORS, Referrer)
In order to make sure that our videos are not put on other websites, we need to make sure that the referrer or domain from where the request has originated is the correct one only. CORS policy will help in that regard. Only certain domains can be whitelisted where it is believed that videos will be shown or shared.
iFrame blocking
Another issue which is possible is that people embed our videos using iFrame so that Origin Blocking is circumvented. There are Javascript based code and header options like X-Frame-Options which can be used to block this.
Basic Token
Basically not allow access to the resource directly. The token in the url determines whether we can have access to the resources
index.m3u8?token=SOME_TOKEN_HERE&expiry=EXPIRY_TIME
User Specific Tokens
Each user is provided with a token specific to them. This token is used for validation. The token can be provided in headers, query or as a path param. Some client players (like iOS) don’t provide easy ways to add these attributes in headers so the other two options may be preferred.
Stream encryption
Media segments are encrypted using AES encryption. In a general scenario, the Stream Decryption key is provided in the Playlist itself. However, one change which can make it more secure is to provide the decryption key separately using a secure API. This way getting access to just the Media Playlists will not help and there will be an additional layer of security.
DRM (Digital Rights Management)
DRM is maintained by external systems. Decryption key is kept separate. Some of the popular solutions include Widevine, Fairplay etc. DRM solutions are generally costly. Android ‘s Exoplayer had good support for Widevine and lacks support for Fairplay currently and for iOS vice versa is true. Depending on the use case you may / may not need it.
Platform support comparison of DRM is available at https://castlabs.com/resources/drm-comparison/
Video-Input for Encoder
2 popular formats in vogue are WebRTC and RTMP.
RTMP is a streaming protocol that maintains a persistent TCP connection between the player and the server during the whole broadcast. Unlike HLS, RTMP uses a push model. Instead of the player requesting each segment, the server continuously sends video and audio data. The client can still issue pause and resume commands when the person requests it or when the player is not visible. In RTMP, the broadcast is split into two streams: a video stream and an audio stream. The streams are split into chunks of 4 KB, which can be multiplexed in the TCP connection, i.e., video and audio chunks are interleaved. At a video bit rate of 500 Kbps, each chunk is only 64 ms long, which, compared with HLS segments of 3 seconds each, produces smoother streaming across all components. The broadcaster can send data as soon as it has encoded 64 ms of video data; the transcoding server can process that chunk and produce multiple output bit rates. The chunk is then forwarded through proxies until it reaches the player. The push model plus small chunks reduce the lag between broadcaster and viewer by 5x, producing a smooth and interactive experience. Most of the live stream products use HLS because it’s HTTP-based and easy to integrate with all existing CDNs.
RTMP
Pros:
- Has a lot of tools, implementations and ecosystem built around it
- Wide Platform support
- Used for livestream ingestion
Cons:
- It’s a very old protocol developed by Macromedia, acquired by Adobe and was mostly used in Flash applications
- Latency is around 4 seconds
- Dated tech and is being phased out
- RTMP doesn’t have modern browser support
WebRTC
Pros
- It is a very new protocol that is gaining popularity
- Multiple participants are possible
- Has been implemented in the main browsers in the past 2–3 years
- UDP & TCP both supported
- Also implements a low-latency messaging system based on Websockets.
- We can actually call it real-time or ultra-low latency. Less than 200ms
- Supports encryption by default
- Native official SDKs for Android, web and iOS exist.
Cons:
- Each browser used to have its own API which unfortunately did not follow the W3C standards. But things are getting standardised now.
WebRTC also allows us to host video conferences with multiple speakers and makes live streaming talk shows possible which RTMP can never offer.
On the basis of the above we decided to go ahead with WebRTC.
That concludes Part 2 of the series. In the next article we will touch upon WebRTC in more detail. You can read the next one by clicking here.
We hope you enjoyed reading this article. We would love to hear your thoughts and suggestions in the comments.