STARTtrek: Features of Encoding and Streaming

Published in

STARTteam

11 min readDec 3, 2023

The streaming service START was founded in 2017 by the producers of the Yellow, Black and White studio as a site for online distribution of their own original content. By 2023, over 100 originals have been released and the library of additional purchased content has grown significantly. Each episode of a TV series or a movie is a large file that needs to be delivered and shown to the streaming service user in the best possible way. In this article, Kirill Evseenko, Director of Technology and Products, and CTO at START over the previous five years, will explain how we do it.

From the beginning, the streaming service was decided to be developed as a globally accessible platform with emphasis on flexible distribution. Initially it was a monolith, which gradually transformed into microservices. This article considers some of them which are directly related to streaming.

There are two approaches to implementing streaming: using in-house or third-party solutions. In-house solutions are used by major players who have the infrastructure and great capabilities. It is good for startups to use third-party solutions from different vendors, and this is how we started out. As we developed, we encountered the disadvantages of this approach: costliness, insufficient manageability, inadequate quality, and vendor lock-in as a constant dependence on the solutions that are selected. Gradually, we started to develop our own solutions.

Encoding

We encode on the CPU using an FFmpeg library with the “fast” preset and the Main and High profiles. What is more, when creating the encoding profile, we focused on the HLS specification since at that time HLS was used for all clients. We select the bit rate taking into account the narrowband channels of Internet providers. We use Media Stream Validator and the PSNR, SSIM, and VMAF metrics to assess encoding quality when selecting a profile. Let us consider them in more detail.

Media Stream Validator is Apple’s tool that analyzes HLS and verifies a segment’s compliance with the standard. HLS Report is another utility that generates human-readable reports for the validated stream. We use HLS wherever possible and, for video-on-demand, the standard specifies tolerances for the metrics that are analyzed. Now we are not fully compliant with the standard, our deviation is about 20%, which we consider acceptable. Below is an example of a report generated by HLS Report.

The FFmpeg quality metrics library measures quality metrics and calculates various metrics: metrics per frame, average metrics statistics (maximum, minimum, mean, and standard deviation). By selecting different parameters of encoding profiles, it is possible to significantly improve the video quality for the user.

For example, PSNR (peak signal-to-noise ratio) is a metric that is used to measure the level of distortion in images subject to lossy compression. And through profile selection, we increased the metric significantly. The graph shows how our performance improved.

SSIM (structure similarity index measure) is a metric that measures the similarity between two images. This metric was not bad anyway, but it got even better, as you can see in the graph.

VMAF (Video Multimethod Assessment Fusion) is a video quality assessment algorithm developed by Netflix. We improved this metric by almost 10% for the videos we encode.

The results of the work are as follows: by selecting different parameters, we were able to reduce traffic by 28%, improve visual quality of video both by metrics and by eye, and shorten encoding time by 15%, while reducing the storage space for encoded videos on storages — all this under the same conditions, on the same hardware.

We use parallel encoding: doing single-pass encoding, loading the source, analyzing it, encoding, packaging, slicing, and giving it away. In compliance with business requirements, we encode by track. We use the standard H.264 and H.265 codecs. This is necessary for ensuring operability on very old devices, such as Smart TVs from 2012, which do not support the new codecs. Moreover, we seed the same video in only one codec to avoid increasing the storage space for one video. Below is an example of what it looks like in the admin panel: we can display progress and statuses and view logs for each of the encoding stages. We do not have infinite resources and have made it possible to prioritize encoding in case there is a queue.

For example, the movie “The Challenge” in 4K with 5.1 sound. It is almost 3 hours long while the encoding stage is just over 1.5 hours. And we do not encode the original ProRes, only the prepared MP4s.

Not Everyone Needs Great Video and Sound Quality

We encountered an interesting case when mobile operators came to us and asked us to stream video of very poor quality so that it would work for their customers on very weak devices and with poor internet. Our platform provides an API that allows for embedding our service on third-party sites, so we, among other things, actively work with mobile operators and added a 240p profile in 2018, which was not great. But we removed this profile later on as the quality of mobile networks improved.

Encoding Audio and Subtitles

We encode audio and subtitles in parallel and independently of each other — there are encoding profiles for this specifically. We make subtitles in two formats: .srt and .vtt for different devices. Audio is encoded using the AAC and AC3 codecs. Our service allows for uploading multiple audio tracks and subtitles for a video and streaming different sets of audio and subtitles depending on geo-restrictions and distribution requirements for different partners, sites, and device types. In our admin panel, it looks like this:

As for license attributes, we can select a geolocation, a platform, and a variable set of audio and subtitles that are available for them.

Audio and Video Track Duration Case

We found that the audio duration in the metadata of the MP4 file becomes longer by thousandths of a second. It looks like this:

The problem was discovered when playing the DASH playlist on a smart TV. And eventually it turned out that it was not a problem, but a feature of FFmpeg, which writes the track duration in the MP4 metadata after encoding.

And this duration gets unsynced around the end of the segment: that is, one track becomes longer than the segment extent, while another may be less than or equal to it. And then the situation described above arises. One solution is to trim the audio track to the length of the video track, as we do.

Packaging

We cut into segments via nginx_vod_module (internally, we call it Kaltura after the name of the module’s developer company) — it is a module for Nginx designed to stream videos in different formats. Now we provide HLS and DASH, with DRM if necessary. Segmentation and encryption is carried out on the fly on storages, and we cache the final files in CDN nodes. This allows us to cache static content more easily and efficiently, as well as reduce storage space on storages by a factor of 4.

Interesting features of Kaltura operation: segment duration, splicing, and language handling.

The duration of the segments is determined by parameters that are created in the Kaltura configuration. There are two main parameters: the first one writes the duration of the first segment, and the second one writes the duration of the other segments. Using the HLS playlist as an example, we show what this looks like internally:

There is the duration of all segments and the duration of a particular chunk. When we first launched, the duration of the first segment was half a second, while the others were 10 seconds long. The reason is that the player measured the speed of the Internet connection when launching on the first segment, giving the player the opportunity to switch to a higher quality when using the adaptive playlist. But now we have all segments equal to 6 seconds according to the HLS standard. For the same reason, the splash screen is 6 seconds long, so that when the user starts watching the main content, they already have a transition to the desired quality, according to their Internet connection.

In order for segments to be sliced exactly to the size specified in the config, the keyframe must start at these boundaries. And when creating segments, we rely on the keyframe in the FFmpeg parameters, among other things.

We use the first parameter — every 2 seconds. This is due to the fact that we have videos with different frame rates, and such a parameter allows us to slice everything correctly.

The second interesting case is related to the law on age-restricted content labeling. There were input constraints related to the lack of explanatory documents on the implementation, the coming new year, and the need for labeling, including on the partners’ side, that we had to do it all in one month. In the end, we selected the following solution: it is a possibility to splice several video files into one, and Kaltura supports such a solution. Tags are used for this purpose: for HLS and for DASH.

The way it looks in HLS, that is, two videos are spliced together through this tag.

An example of what this looks like in DASH:

But there are certain nuances. We did it quickly and prepared only one variant of a splash screen for different ages (0+, 3+, 6+, 12+ and so on), and only had time to test on a limited number of devices. Later, we started to encounter problems on end devices that due to differences in video files (different color spaces, different bit rate, different frame rate), some devices did not process these differences and stuttered or had some other problems. Having fulfilled the requirements of the law, we started looking for a solution: for example, labeling in the player itself. And as all the clients rolled out, we promptly made a hack to be able to remove these splash screens for problem clients remotely by controlling Kaltura from the back end. But still, if there is a need to use such functionality, you need to map video when splicing and select the same parameters so that the spliced video files are identical.

Another case is working with audio tracks when adding multiple audio tracks to a single video. In Kaltura, we pass a three-letter code — it is their mandatory requirement. The list of languages is limited, and Kaltura itself will convert them into a two-letter code, according to the RFC. We did not encounter any problems when adding different languages, but a problem arose when we needed to add several audio tracks with the same language: in our case, for example, Russian, Russian explicit, and Russian 5.1. We always find all the major problems on SmartTV.

Some smartphones, when receiving such a playlist, where there are two audio tracks with the same language, stopped working correctly, froze, or did not play audio at all. We found out that everything worked correctly if we shaped the language parameters by labeling them differently, by adding something. One solution to this problem is Fork Kaltura to transfer custom parameters into it. We decided not to do that, but to use other standard, rare languages, like Zulu or Xhosa. We reserved these languages within our platform for specific audio languages, and after these conversions, the problem was solved. Here is an example of how we have audio tracks transferred now:

How We Approach the Issue of Balancing and Delivering Video

In order to understand where to send users and from which server to deliver content, we have our own balancer (LIBR — it is like a librarian who knows what lies where and where to deliver what from).

LIBR is controlled by a config, below is an example of our config:

That is, we have a set of sites and a rule for each one of them. There are certain rules that allow us to manage delivery by platform, by country, by city, and even by block of networks. There is also a set of available conditions under which this balancing occurs, which are specified in the config.

Content delivery via Content Delivery Network (CDN)

We have moved away from using third-party CDNs in favor of developing our own.

External solutions become very expensive as the amount of content being delivered increases. We have certain requirements to our CDN: flexible delivery of super-popular content, efficient outgoing traffic management for each site, linear scaling of caching servers if necessary, and control of uplinks loading. Now we have three distribution sites in Moscow in different data centers. There are two sites outside the Russian Federation as well. All sites are connected to the main operator junctions and IX (Internet eXchange — traffic exchange points). All this guarantees us the necessary fault tolerance, high availability, and flexible loading management of both channels and sites. We use consistent hashing within each site, with data sharding across cache servers, and each piece of content is only delivered from one server from each site. At the same time, if we are talking about super-popular content, there is no binding to the host for it, the content is distributed from all hosts to the sites. And with the addition of servers, we are able to linearly scale the amount of cache and the amount of output on the site.

Conclusion

We have not stopped with the current tools and continue to make improvements. Our plans in terms of coding include getting the functionality of pre-cutting sources into smaller parts, as the amount of released content is growing, and in the future we will need to ensure faster appearance of content on the service. We also plan to add more cool new features to the balancer. For example, automatic balancing of traffic between sites, we will expand our CDN network for better delivery of content to our users.

STARTtrek: Features of Encoding and Streaming

Written by Kirill Evseenko