Video transcoding is a real beast: in addition to being hard to get right, it’s also very CPU-intensive, and probably the most important daily subject here for us in Dailymotion’s video architecture team.
A while ago, we asked ourselves this: what if we could save time and power while still retaining a decent output quality for the massive amount of transcodings we perform everyday? Sounds appealing right? If you want to know more, then fasten your seatbelt and read-on about hardware-assisted video transcoding.
Dailymotion receives about 150 000 new videos everyday, each of them transcoded into 4 to 8 qualities to provide the best possible streaming experience for our users. You do the math, that’s about 20 million transcoding tasks per month; so any optimization you can get in the process is always more than welcome.
There are many solutions for video transcoding: you can either use an off-the-shelf software, or buy a dedicated hardware that will do the job for you (“blackbox”-style). Software transcoding often rhymes with flexibility, whilst dedicated hardware usually brings speed.
The legacy transcoding farm at Dailymotion is a set of powerful CPU-based servers (systems with up to 56 logical Xeon cores) running a specially-crafted version of FFmpeg. Two years ago, whilst planning for higher transcoding capacity to accommodate our ever-increasing audience, we started investigating innovative solutions, bearing in mind the need to keep the flexibility we’ve had so far with the pure software solutions. For instance, moving to a proprietary hardware solution was not really an option because of the lockup it would have imposed, and the fact that these boxes often come with their own task scheduling workflow; something we already had and were very happy with (this part is probably for another blogpost).
Instead, we decided to take a chance on a “hardware-assisted” solution offered by Intel called “QuickSync Video” (QSV). We downloaded the nice community SDK (“Intel Media Server SDK”) for Linux, had a quick glance at the documentation, and the game was on!
The (pretty big) Intel Media SDK is comprised of several components, among which we only needed two:
- a set of user-space libraries, open-source for most of them, except the ones that are talking directly to the GPU.
- a set of kernel patches (up to v4.4 for now), to adapt the “i915.ko” Intel display driver to work with the SDK.
I won’t go into detail as to how we integrated this on our servers (it’s pretty straightforward if you follow Intel’s documentation), so as to focus on how we actually used it and what it did or did not bring us.
Since our workflow is entirely based on FFmpeg, we enabled QSV support (partially available since version 2.8), added a bunch of patches to fix some early-stage bugs and enabled hardware-acceleration for more steps of the transcoding pipeline (such as scaling, trans-rating or de-interlacing).
And voila: we were able to seamlessly compare the software-only and the hardware-assisted solutions head-to-head.
We used the following two systems for our tests:
- a CPU-only system based on dual-socket Xeon E5–2683V3 CPU @3Ghz (14 cores MT, 56 logical threads total).
- a GPU-assisted system based on a Xeon E3–1585L V5 (4 cores MT, 8 logical threads total). The embedded GPU is the best Intel can do as of today (with 72 video processing offloading units). We are using a HPE Moonshot 1500 chassis (see picture above), than can handle as many as 45 Xeon cartridges simultaneously.
It`s interesting to highlight that the Xeon E5–2683 has a 240W TDP, while the Xeon E3–1585L tops at 45W. It means the power gain factor is already at least 5 times (provided the transcoding horsepower is equivalent).
We then ran 2 typical transcoding scenarios to compare both solutions:
- from a 1080p24 video file, we generated 6 AVC qualities with resolutions ranging between 1920x1080@6Mb/s and 176x144@93Kb/s.
- from a 480p25 video file, we generated 3 AVC qualities with resolutions ranging between 848x480@700Kb/s and 176x144@93Kb/s.
Speaking of quality, software transcoding is using a 2-pass profile, while QSV can only run in single-pass mode. To cope with the resulting inevitable quality loss, we used a special feature of the QSV encoder called “Look-Ahead”: it basically tells the encoder to analyse a few seconds forward in the stream, to compute the best bitrate/bitbucket to use for the few next frames. The visual improvement was definitely noticeable and the resulting quality was very close to software 2-pass encoding. On the downside, it led to a ~20% performance drop compared to the regular VBR algorithm. Nethertheless, we chose to keep this option activated for the best visual experience possible.
Another very nice QSV feature is the option to constrain the output bitrate to a maximal value, within a user-selectable time-window. It is fundamental to get this right for ABR protocols like HLS or DASH, where the current quality level is based on the client available bandwidth: large fluctuations in the bitrates leads to excessive back-and-forth switches and potential re-buffering. The x264 software encoding library has a similar option, but the results are way more volatile.
The following FFmpeg versions were used during the tests:
- FFmpeg v2.8 (with patches from Intel) for hardware-assisted transcoding.
- FFMpeg v3.2.2 (with all possible SIMD optimizations enabled) for software transcoding.
We also tried to use the same parameters, when possible:
- AVC profile and level (depending on the output resolution and encoding bandwidth).
- Keyframes interval (forced every 10 seconds).
- Frame-rate (left untouched, no trans-rating nor pull-down).
- AAC audio.
- MP4 container.
QSV was used for the source material decoding, frame-processing (scaling, de-interlacing, …) and encoding steps. For the software counterpart, we used the maximum threads count recommended by the libx264 author (16). The results are expressed in Frames transcoded Per Second (FPS), aggregated among all concurrent tasks in tests 2 and 3.
1st test: single 1080p transcoding
2nd test: concurrent 10x1080p transcodings
3rd test: concurrent 20x480p transcodings
What we learned from these figures:
- QSV is much… much faster for single transcoding workloads (more than 12 times faster than software when transcoding 1080p to 1080p), which means smaller publishing durations for our users.
- Unlike software transcoding, you don’t have to schedule concurrent tasks on the same box to get the maximum performance (at least for high resolutions): the complete GPU horsepower is available for a single task (and divided among the concurrent tasks you may run otherwise), there is no “ceiling” effect.
- Software is better for multiple tasks with low-resolution source material (and in some cases with low-resolution output): the more tasks you do concurrently, the better it gets (up to 2 times faster than QSV).
- 1-pass encoding is 30–50% faster than 2-pass (of course with a toll on quality).
At the end of the day, if you consider that QSV is about 5 times more energy-efficient than CPU-based solutions and also reduces the final videos availability time, it looks pretty darn good. Of course, the amount of CPU-only systems that you’ll be able to replace by QSV will widely depend on the kind of videos you get in your workflow and the encoding profiles/step you choose to have. But globally, QSV encoding really makes a huge difference for high-resolution profiles, and this is even more sensitive on 2K/4K videos (not depicted in this post).
We switched completely to the hardware-assisted transcoding solution presented in this post a few months ago (replacing about 160 blade servers by 90 GPU cartridges), and haven’t looked back since!
Happy transcoding to everyone and may the GPU be with you!
Intel® Media SDK | Intel® Software
Design products delivering visually stunning media by enabling RAW photo and 4K RAW video processing. This feature is…
Stanislav Dolganov designed and implemented experimental support for motion estimation and compensation in the lossless…