From the eyes of the viewer: Attention-aware video compression for improved low bitrate video
While the majority of the video content consumed at SVT Play is enjoyed through high-bitrate streams, a certain responsibility lies with SVT as a public service streaming platform to also be accessible to viewers with unreliable or slow internet connections. As the average bitrates used by streaming services are continuously increasing, it is a matter of inclusion to deliver enjoyable watching experiences at lower bitrate levels as well. However, with the high expectations that viewers have regarding video quality, this task has proven to be tricky and a perfect solution has yet to be found. In my master’s thesis that I conducted at SVT, I aimed to tackle this challenge in improving video quality for low bitrates, thus making SVT Play a more inclusive streaming platform for those with worse connectivity. My solution: AI based attention-aware video compression.
When compressing video to a low bitrate, the optimization of how bits are used and distributed in the video becomes increasingly important. Here, it is important to remember that the most crucial measure of quality in streaming is how well the video is perceived by the viewer — also known as subjective video quality. Since the human eye is more sensitive to certain types of visual information, it is possible to remove some information in a video without any losses in subjective video quality [1], and this is something that can be taken advantage of in video compression. Video compression techniques commonly used for streaming makes use of the fact that certain colors and frequencies in video are visually redundant to us in order to achieve a more efficient compression. The question is, can we take this even further? What if we could find more visual redundancy in the frames based on the content of the video and what catches the viewer’s attention, and then adapt the video compression according to this? Will this increase the overall perceived quality of the video?
With Artificial Intelligence it is possible to locate which area within a video frame that would attract the viewers attention, a so-called region of interest (ROI). The current AI models for visual attention prediction have some known limitations, but research has overall shown quite promising results. These models could have the potential to be used in video compression in order to adapt the compression rate based on visual attention, so that regions of interest are compressed less while the surrounding area receives a higher compression, and in that way achieve an improved subjective video quality while also maintaining a low bitrate. Previous research has been made around attention-aware video compression and has shown promising results [2], however the inclusion of AI is a much less explored topic and lacks subjective evaluation and discussion about application areas of the solution [3]. I therefore wanted to explore how these types of solutions could be used for the video transcoding done at SVT, and particularly whether it could be useful in compression to low bitrates.
In practice this meant that I modified the open source video encoder x265 which is used at SVT with a new Adaptive Quantization (AQ)-mode. An AQ-mode is a tool in x265 that makes it possible to adapt the quantization, i.e. the degree of compression, of each coding unit in the video frame based on certain criteria. While the existing AQ-modes of x265 for instance are based on complexity within the coding unit, the new AQ-mode instead used visual attention to decide the adaption of the quantization, where the higher the interest-value a block has to the viewer, the lower the level of quantization is used. The interest-value of each block was calculated based on a region of interest (ROI)-map for each frame. The map was generated by the deep learning model ACLNet [4] which takes in a batch of frames and uses a CNN-LSTM architecture to identify the region of each frame that would be the most interesting to the viewer and draws their attention. The model has been trained on three different datasets with recorded visual attention data for video. As an output the model generates one ROI-map per frame where each pixel has a certain interest-value between 1 and 0. This map was in turn used to decide the level of quantization for each coding unit based on the average interest-value of that block.
The attention-aware video compression solution was evaluated in terms of subjective video quality through a double-blind AB-test in order to determine whether the viewer perceived the attention-aware video compression to give better quality. Ten different source files were selected from SVT Play to be used as test material. The videos contained a variation of genres such as drama, sports, news and nature documentaries in order to include different types of content and movement. In the evaluation the attention-aware video compression was compared to a version of the video encoded with the same settings but without the ROI-based AQ-mode. A low target bitrate of 200 kbps was used for both versions. The evaluation was set up on a website which was sent out to 33% of all users of SVT Play who accessed the platform through a browser, both on desktop and phone. Participants were presented with two variants of the same video, and asked to select which variant looked best to them, this was done for all 10 clips.. The evaluation was available for a three week period and at the end of the study period, a total of 317 people had participated in the study resulting in a total of 3170 answers for the ten videos.
The results from the evaluation were analyzed using one-way analysis of variance (ANOVA) in order to determine if there were any statistical differences in perceived video quality. The one-way ANOVA revealed a statistically significant preference for attention-aware compression in one out of ten test cases at a p-value less than 0.05. In four cases a preference was shown for the original encoding of the video. For the rest of the cases no clear statistically significant preference in perceived video quality was found. Because of these mixed results from the evaluation, it was of bigger interest to instead analyze each of the encoded test videos in further detail.
After viewing the encoded evaluation videos and analyzing their bit plots it becomes clear that there are certain types of video material that the solution struggles with. Both scene cuts and fast movement seem to be the biggest cause of bad performance on the video compression, which the evaluation results confirm. The biggest reason for this is most likely that the calculation of interest value for each coding unit was a bit too naive. This caused the first frames of each video to receive an unproportionally large amount of bits compared to the following frames, which naturally is something that makes it hard to encode a video that changes a lot. The evaluated solution unfortunately did not get much iteration due to the scope of the project, but with some adjustments to the formula for setting the interest-value it would be possible to make the compression perform more evenly throughout the video. Another improvement would be to take the interest-value into consideration also in other parts of the rate control.
Another reason why scene cuts particularly had an impact on the compression is the fact that the visual attention model uses information from previous frames for inference. This made the predictions more precise since temporal information also was used. However, it also made the predictions inaccurate directly after a scenecut.
In order to meet the target bitrate it was necessary to compensate for the lower compression in the ROIs by using a higher compression outside the ROI. The hypothesis was that this would go unnoticed by the viewer, since it is not where their attention is. However, when a video was complex and maybe had a lot of movement even outside of the ROI, the high compression rate to meet the low bitrate caused artifacts to sometimes appear in these areas. It is possible that this was something that drew the viewer’s attention away from the actual ROI, and instead the artifact caused a new unintended “ROI”. Since motion is a common attention cue, it is possible that this issue emerged from the fact that some lower-level ROIs were filtered out for simplicity reasons, meaning that the areas that were partly interesting to the viewer did not get handled as a ROI. By expanding on what should be considered a ROI and also averaging the compression rate depending on the size of the ROI it would probably be possible to achieve better performance.
However, there were some situations where the attention-aware compression did seem to perform well despite its areas of improvement. Video that is typically referred to as easier to compress, such as interviews with a fixed camera and a short depth of field seemed to be the situations where people could not see a difference between the original video and the video encoded with the attention-aware compression. This tells us that there could be potential to make the difference in compression rate between ROI and the areas around it even bigger. Also, the one video where the attention-aware compression did give a better perceived quality was a sequence from a rally. What’s interesting about this result is that for this particular video there is a clear decrease in video quality outside the ROI whereas the ROI, the rally car, does get more detail. This case confirms the hypothesis that it to some extent is possible to achieve an improved subjective video quality by focusing the data to areas of interest.
In conclusion I think this project shows that there is a lot more to research on the topic of AI based attention-aware compression and the use of it in low bitrate encoding. The region of interest based compression could be a new tool to make the encoding of low bitrate video even more effective, in order to improve perceived video quality and make lower quality streams more enjoyable to watch. As of now the solution seems to have the most potential for videos where there is a clear region of interest and a background that is not too busy. However, there are a lot of possible improvements to explore in order to achieve better results for more complex video sequences. Lastly, I want to end by mentioning that when exploring AI solutions it is important to take into account that these solutions are computationally heavy and require dedicated hardware, meaning that they can be quite inaccessible. It also means that they could be questioned from a sustainability perspective. Though with that said, I do believe that attention-aware video compression solutions is something we will see more of in the future.
Sources:
[1] Yun Zhang, Linwei Zhu, Gangyi Jiang, Sam Kwong, and C.-C. Jay Kuo. 2021. A Survey on Perceptually Optimized Video Coding. CoRR abs/2112.12284 (2021). arXiv:2112.12284 https://arxiv.org/abs/2112.12284
[2] Zhicheng Li, Shiyin Qin, and Laurent Itti. 2011. Visual attention guided bit allocation in video compression. Image and Vision Computing 29, 1 (2011), 1–14. https://doi.org/10.1016/j.imavis.2010.07.001
[3] Xuebin Sun, XiaoFei Yang, Sukai Wang, and Ming Liu. 2020. Content-aware rate control scheme for HEVC based on static and dynamic saliency detection. Neurocomputing 411 (2020), 393–405. https://doi.org/10.1016/j.neucom.2020.06.003
[4] Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibing Ling, and Ali Borji. 2019. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).