PFClean Restoration Workstation and Performance Profiling
How we profile performance and improved the latest versions of PFClean on our test bed workstation
In 2016 we developed a workstation specifically to create a performance test bench for PFClean, with a focus on improving GPU processing. In first part of this article we will share with you the component choices we made and the reasons why those decisions are still relevant now.
In the second part of the article are some results from our testing of our latest release, PFClean 2018, versus PFClean 2016 and 2017. During the development of PFClean 2016, 2017 and 2018 this stable platform has been instrumental in our ability to radically improve processing performance across many aspects of the application.
Our Component Choices
The aim was to build a platform that represented a typical user level workstation, that didn’t cost an arm and a leg, but in our opinion provided the best bang for the buck platform for day in, day out high resolution digital restoration projects that have become the norm, rather than the exception.
The component choices have been broken down into sections including GPU, processor, chassis, drives and RAM and each section outlines the goals we strived to achieve from each component choice. If you are building a workstation yourself, the B.O.M. is suppled at the end of the article.
GPUs for PFClean and Why You Need Them
Traditionally in many applications, processing has taken place on the CPU as a series of discrete actions performed sequentially. With low core counts/bandwidth and clock speeds in older processors this led to bottlenecking in processing times, especially in tools that require multiple layers of motion estimation, detection and fixing as many of our algorithms do.
The Solution? General Processing on The GPU
PFClean was the first application commercially available in any market, to use a technology called General Processing on the GPU (GPGPU). By offloading compute intensive tasks to the Graphics Processing Unit, PFClean is able to utilise the massively parallel architecture of the GPU to increase the speed of your restoration dramatically. In PFClean you are able to employ multiple GPUs for both Display and Batch processing, as well as multiple GPUs over a network to rapidly export your restored material. PFClean continues to be at the forefront of this technology with its Digital Wet Gate & Telerack, which are designed from the ground up to be a GPU processing only engine.
What Do You Need to Boost Your Throughput?
PFClean uses OpenCL and hence requires an an OpenCL 1.2 compliant card to run. For the purposes of this blog post, AMD’s Radeon Pro Duo was chosen; This card has the benefit of having two GPU’s on the one PCB with 8gb of GDDR5 memory each totalling 16gb. This card is the sweet spot in AMD’s line of professional cards and offers great performance, thermal characteristics, power consumption and price.
Please note — When using a dual GPU setup, above 4G processing must be enabled in the BIOS.
In terms of performance you would have to spend a large amount more for multiple dual socket cards, for fairly modest performance improvements. As a rule for 4k content, it is recommended that a minimum of 8GB of VRAM per card is required for acceptable performance; the Pro Duo with its 16GB of RAM fulfils this requirement, although for 4k, AMDs newer Vega 64 with greater memory bandwidth and more stream processors, might be a much better option.
Gaming cards typically are a non-reference design and are intended for short stints at full load, therefore stable performance over time cannot be guaranteed. In addition, they run hot and also have less RAM along with their drivers not being tested as thoroughly as their workstation siblings. In short, it could mean the difference between a large project rendering, or not.
How Many Processors? How Many Cores? How Many GHz?
It used to be the case that the processor with the most cores and the highest clock speed did the best job. Nowadays this is simply no longer the case, it has been observed that Intel has even started to segregate what it terms workstation processors and server processors as. With portions of the processing being farmed out to the GPU, the CPU is dedicated to management of tasks rather than doing the heavy lifting itself. Although PFClean is massively multithreaded, Dual 28 core Xeons won’t yield high performance gains, lower core counts with very high clock speeds and a large memory bandwidth on the other hand, will balance the GPU processing nicely. 6, 8 and 10 core CPU’s are considered to be the Ideal sweet spot for workstation purposes.
The Choice of CPU
It’s recommended that 8 physical processing cores are required for good performance in PFClean. We have chosen dual 6 core E5 v3 xeon processors (12 cores total) with 2.4GHz base clock and 3.2GHz boost; they are a good balance between multi threaded performance, PCIe lanes and cost. Additionally their relatively low power draw and low heat output make them ideally suited to running for extended periods of time, without taxing either the power supply or the thermal management of the system. Performance increases can be had with dual 8 core Xeons, but the money might be better spent on a more powerful GPU solution for your system and faster storage. The advantage of dual socket design is more PCIe lanes and thereby more expansion room for GPU’s, Large PCIe SSD cards and thunderbolt I/O.
Off The Shelf
Can you use desktop originated CPUs with PFClean? The short answer is, Yes. The long answer is it’s may not be such a good idea, as many lower end desktop CPUs such as core i7’s and i5’s, won’t necessarily have the multi threaded performance or the memory bandwidth for high resolution restoration work. And when factoring in a dual GPU configuration and a PCIe SSD you will quickly find your PCIe lanes being saturated and overall system performance diminished. In addition, the desktop CPUs won’t have the level of stability and longevity which is expected from the professional line of Xeon’s. While many of the high end i7 processors work extremely well in practice, they can’t match the level of support offered by their Xeon counterparts. However the newer enthusiast level intel i9 desktop processors while they are untested, do on paper meet the requirements for bandwidth, threads and speed.
Too Many Choices
PC Chassis is a loose term for a box to put all your bits in; there are myriad options and the term can take on different definitions from manufacturer to manufacturer. They mainly fall into two camps, pre-configured bare bones system that you just install your selected components into, and completely empty boxes that allow you to scratch build a system, leaving you to decide every single component in the system. Pre-configured systems with a motherboard and power-supply built in, offer a speedy and reliable, warranty supported way to customise a system to your needs, whilst eliminating some of the more fiddly time consuming, installation and management processes.
From Firm Foundations
The main goal here is to select a chassis with room to expand your PFClean workstation and maintain a level of future proofing. The Supermicro SYS-7038A-I is a pre-configured chassis with motherboard and power supply that offers the flexibility of being a modern dual socket board with enough space in the chassis and overhead in the power supply to run multiple GPUs/expansion cards, and support for up to 18 processing cores per chip. Additionally, it also has a relatively low entry cost and ease of build vs other comparable systems. I/O is another big factor and the system does offer USB 3.0 front headers and a TBT 2 thunderbolt upgrade path in the form of a PCIe card in the PCH lane, giving you high transfer speeds when moving footage onto the system. Supermicro also has BIOS updates available to upgrade the socket to take Xeon e5 V4 processors; something that a domestic desktop motherboard in many cases will not be able to do. With this particular chassis it is possible to get your system up and running within a couple of hours.
Every CPU will have a maximum number of PCIe lanes it uses for sending packets of information backwards and forwards to your GPU and other PCIe devices. A modern Xeon socket such as the one we have chosen for our workstation build (Xeon E5–2620V3) will have up to 40 PCIe lanes at its disposal per CPU, providing more than adequate lanes for two high end GPUs and PCIe storage. Typically a dual GPU workstation with PCIe storage and network adapter will use up to 40 lanes, 20 per CPU on a Xeon dual socket system.
A high end consumer grade CPU such as the intel i7 7700K will only have a maximum of 16 PCIe lanes meaning that once a GPU such as the AMD w9100 is installed there are zero lanes left for other devices such as network adapters and storage, this means the card will actually run using 8 PCIe lanes, hampering performance. This is one area that demonstrates the need for a professional grade solution where PCIe lanes can become a serious bottleneck for restoration work.
Some manufacturers use PCIe switches via an extra chipset located on the motherboard to enable more PCIe lanes. Although this may allow for two 16 lane devices to be connected and used via a single 16 lane connection to the CPU, it will not provide full sustained throughput from both devices at the same time.
How Fast and How Much Storage?
When building a system for PFClean, it is important to keep in mind the most common formats you are working with and how much material you have. A 2k dpx feature length film will roughly take up 1–1.5TB of storage and will require a minimum of 350MB/s + overhead to playback real time, whereas 4k will be 4 times this amount in storage and speed. Other factors to consider are space for renders and a 30% overhead for throughput, so that you are not putting so much strain on the drives. As a rule, never put more storage into your system than you can safely and regularly backup.
What About Caching?
Caching is a big part of the PFClean workflow and involves storing large amounts of temporary processing data for reuse very quickly, the final output(export) is considered the render. What makes an SSD perfect for these operations vs a traditional HDD is that the access times are up to 100+ times quicker. We have chosen an intel PCIe NVME 1.2TB SSD drive. Its high performance, low latency design makes it ideal for handling lots of small files, and with sequential writes at 1400 MB/s and read performance of over 2400 MB/s it will help when handling 4K restoration projects.
In the system we have 4 x Seagate 3TB SATA HDDs. This will give, in RAID 0 configuration, a theoretical storage capacity of 12TB, and an actual capacity of 10.8TB. Disk speeds will be in excess of 800MB/s read and 700MB/s write, which provides plenty of throughput for the 350MB/s required for 2k dpx playback. This would be ideal for two full-length feature films (120mins) including raw scans, cached files and renders with plenty of overhead. HDDs make great storage drives for your footage due to their high capacity and low cost, whereas to build the equivalent storage array using SSDs would cost thousands. It is possible to work with 4k on these drives with the assistance of the NVME SSD for caching and more RAM, but greater performance can be had by replacing these drives with SSDs. We have also included a professional grade Samsung boot drive for the OS, these drives have a proven track record and are recommended by most machine builders.
We have chosen 32GB of DDR4 system RAM, which will provide each of the 12 processing cores with 2.6GB, and will easily cache shots at 2k of normal length. As a guide there should be at least 2GB per physical core. The board can take up to 2TB of ECC 3DS LRDIMM and would ideally be upgraded to absolute minimum of 64/128GB RAM for 4K restoration.
Building your own system
If you want to give building a system a go I have provided some links below to get you started. Don’t worry it’s actually very easy to do and in a short space of time you can have a 4K capable system. Our test system was built for under £3000 (not including GPU’s).
Bill of Materials (BOM):
- Chassis Supermicro SYS-7038A-I (x1) Manufacturer Link
- Processors Intel Xeon E5–2620V3 (x2) + Heatsinks for E5-v3 Socket Manufacturer Link
- GPU AMD Radeon Pro Duo Manufacturer Link
- Boot Drive Samsung SSD 80 EVO 500gb (x1) Manufacturer Link
- 12TB RAID 3TB Seagate ST3000DM001 (x4) Manufacturer Link
- CACHE DISK Intel Intel DC P3500 1.2TB NVME AIC SSD (x1) Manufacturer Link
Since building this system in 2016, we have used it as our primary test system, to try new features such as the Digital Wet Gate, as well as test different approaches to find and eliminate performance bottlenecks in the existing tool set. For this, we devised several test cases, varying in the number of frames, number of clips and image resolution as well as a number of nodes in the Workflow Manager. For our longer term comparison, these numbers always reflect a real world user scenario, more specifically: the time it took to export the project, with the original footage located on the RAID, and the exported files written to the cache disk, without any intermediate caches.
One of these test projects consists of 4K Full Aperture DPX footage, processed in a Workbench with a stack consisting of Auto De-Flicker, Auto Dirt Fix and Fix Scratch. The Workbench is connected to a File Out node. The first graph illustrates the time it takes to export 500 frames in a single clip. Since this project consists of a single clip, we only use 1 of the 2 available GPUs for processing.
For PFClean 2017, we introduced the GPU optimised Digital Wet Gate, which is not part of this particular project, as it didn’t exists in 2016. However, using many lessons learned while developing the Digital Wet Gate and evaluating it’s performance on this test machine, it was possible to reduce average export times for PFCLean 2017 over 2016 for this particular test project by a tenth .
While developing 2018, the focus firmly shifted to primary architectural modifications where areas of the application such as the Workbench could benefit. As the graph above clearly shows, this effort has paid off, with this particular test exporting 7 times faster in PFClean 2018 than the 2017 version.
In an extension of the above test, we scale up to 10,000 4K frames, divided into 43 clips, which loosely represents a 35mm film reel. As there are multiple clips to process, both older versions of PFClean profit from the additional GPU in the test system, so all versions were configured to use 2 GPUs for processing.
PFClean 2018’s improvements are clear, in Test 2, the performance gains accrued over the last 3 years amount to seven times faster exporting work (rendering) than the 2016 release. Here are some of the test results;
- 2017 > 2018: 5.7 times faster (570%)
- 2016 > 2018: 7.1 times faster (710%)
In many cases, these impressive performance gains were achieved by hard work from the development team. Other enabling factors are the improvements to GPU architectures and their respective OpenCL drivers. It is important to note that all tests are run on identical hardware i.e. if the GPU’s are upgraded, then all tests are re run across all versions (However, the GPU’s are the only components that have changed/upgraded since building the test platform).
The above system, in combination with PFClean 2018, is capable of outputting close to a feature length film in the time it took to output a reel in 2017. High volume commercial film and video restoration is possible on affordable hardware. A well considered, balanced workstation can provide the necessary throughput required to complete projects on time and on budget. In many cases over specifying components yields diminishing returns on investment.