Video encoding and decoding with Vulkan Compute shaders

Video on the internet has largely became a solved problem. Most devices with a video output ship with decoding and encoding accelerator chips. APIs, like the Vulkan® Video set of APIs, allow for direct access to them. Newer codecs are royalty free with open specifications, or become royalty free thanks to time, making support for codec standards accessible to everyone.

Few, however, remember how much decoding 720p H.264 stressed out the majority of CPUs 18 years ago, or the optimizations, competition between software implementations, and the long road through which decoding in real-time was possible until decoding hardware arrived, and APIs were written.

Yet, in a different area, this exact problem is what users are facing today. Editors scrubbing over days of raw camera footage, colorists working with 16k 16bit RGB, VFX artists encoding 32-bit floating-point Log video, archivists working with extreme resolution lossless film scans. Users back then may have tolerated a few frame drops here and there, but these days, liquid cooled CPUs with 64 cores and hundreds of gigabytes of RAM are often what's keeping them from enjoying what they do, to hating it.

This post covers how Vulkan Compute is used in FFmpeg to transparently implement decoding and encoding of such video, on consumer GPUs, thanks to its capabilities to parallelize.

Codecs

Codecs are algorithms designed to take advantage of similarities in a signal and compress it to a much lower size, for storage or transmission. For example, compressing JPEG, the C. Elegans of compression, requires transforming the pixels into 2D frequency representation (somewhat parallellizable, where each row can be done first, and then each column), DC value prediction (serialized), quantization to remove perceptually irrelevant information (fully parallelizable) and finally Huffman coding (extremely serialized).

[jpeg coding/dc prediction diagram]

Decoding is essentially the same process, but in reverse.

The process has a number of fully serial bottlenecks. Whereas, GPUs are designed around executing work which is independent and uncorrelated.

Compromises

In the past, the obvious solution has been to have hybrid decoding. This is where the serial steps, such as coefficient decoding are performed on the CPU, then uploaded to the GPU, so the parallelizable steps are offloaded and executed efficiently.

Unfortunately, not only do GPUs prefer independent work, they're also usually quite far away from where the system RAM is. Even with DMA and huge amount of data throughput, latency becomes the limiting factor, and usually renders decoding or encoding slower than executing the parallel steps on the CPU, particularly with modern CPUs featuring SIMD.

Most experiments in hybrid encoders or decoders have resulted in unsatisfactory speedups. dav1d, for example, tried to parallelize the final filters (which themselves were complex yet parallelizable pieces of code), with unfortunaly no gain over CPUs, even on mobile devices. x264 had basic OpenCL integration, but the latency drawback of uploading frames was what made it too inefficient to improve performance, and eventually bitrotted.

This has resulted in hybrid implementations being somewhat stigmatized in the multimedia community. If compute-based codec implementations were to become consistently performant, easily accessible and maintained, they would need to be full GPU-only implementations.

Where there's a will...

Most codecs are designed to run on special ASIC hardware. We have released a set of extensions (the Vulkan Video set of APIs) to use such hardware, often featured on modern GPUs. And even ASICs don't have infinitely fast logic to decode video. That means that most codecs take a compromise, where some work can be parallelized, down to a slice or a block.

Most popular codecs were designed decades ago, and while the minimum unit size has stayed constant, video resolution has only gone up. Also, while GPUs are mostly all about performing many parallel computations, these days they do come with features which allow cross-invocation communication.

This means that nowadays, using modern features, its usually possible, and worth it to implement codecs entirely using compute shaders.

Accessibility

FFmpeg is a free and open source collection of libraries and tools to make working with video or audio streams possible, regardless of the format or codec used. Whilst its famous for its codec implementations with handwritten assembly optimizations for multiple platforms, it also provides easy access to hardware accelerators, amongst which are the Vulkan Video set of APIs.

Crucially, hardware acceleration in FFmpeg is built on top of the software codecs. Parsing of headers, scheduling frames, slices, and error correction/handling all happen in software. Decoding of all video data is the only part offloaded. This allows for very robust and tested code to be combined with hardware acceleration. It also allows users to switch between software and hardware implementations dynamically via a toggle, with no differentiation whether hardware decoding is implemented using Vulkan Video APIs, or via Vulkan Compute shaders.

The widespread usage of FFmpeg in editing software, media players and browsers, combined with the ability to add hardware accelerator support to any software implementation, was an ideal starting point for making compute-based codec implementations accessible, rather than dedicated library implementations.

FFv1

The FFmpeg Video Codec version #1, has become a staple of the archival community and in applications where lossless compression is required. It's royalty-free, with open specifications, and is an official IETF standard.

The work of implementing codecs in compute shaders in FFmpeg begun here. The FFv1 encoder and decoder are very slow to run on a CPU, despite supporting up to 1024 slices. This was partly due to the huge bandwidth needed for high resolution RGB video, and the somewhat bottlenecked codec design.

FFv1 version 3 was designed over 10 years ago, and it was thanks to the archival community, who adopted it, that it gained wide usage. However, the bottlenecks were making encoding and decoding of high resolution archival film scans prohibitively time consuming.

Thus, the FFv1 encoder and decoder were written. They started out as conversions of the software encoder and decoder, but were gradually more and more optimized with GPU-specific functions.

The biggest challenge when encoding FFv1 is working with the range coder system, which lacks the optimizations that AV1's range coder has. Each symbol (pixel difference value) has each bit having its own 8-bit adaptation value, therefore needing to lookup 32 contiguous values randomly from a set of thousands when encoding or decoding. We speed this up by having a workgroup size of 32, with each local invocation looking up and performing adaptation in parallel, while a single invocation performs the actual encoding or decoding.

[diagram of RCT -> prediction -> contexts -> coding]

For RGB, an Reversible Color Transform is performed to decorrelate values further. Originally, a separate shader was used for this, which encoded to a separate image. However, the bandwidth requirements to do this for very high resolution images outweighted the advantages. Since only 2 lines are needed to decode or encode images, we allocate width X horizontal_slices*2 images, and perform the RCT ahead of encoding each line with the help of the 32 helper invocations.

APV

APV is a new codec designed by Samsung to serve as a royalty-free, open alternative for mezzanine video compression. Recently, it too became an IETF standard. Currently, its gaining traction with the VFX and professional media production communities.

Unlike most codecs mentioned in this article, APV was designed for parallelism from the ground up. Similar to JPEG, each frame is subdivided into components, and each component is subdivided into tiles, with each tiles featuring multiple blocks. Each block is simply transformed, quantized via a scalar quantizer (simple division), and encoded via variable length codes.

To implement it as a compute shader, we first handle decoding on each tile in one shader, and run a second shader which transforms a single block's row per invocation.

[rectangles diagram]

ProRes

ProRes is the de-facto mezzanine codec used during for editing, camera footage, mastering. Its a relatively simple codec similar to JPEG and APV, which made it possible to implement a decoder, and due to popular demand, an encoder.

For decoding, we do essentially the same process as with APV. For encoding however, we do proper rate control and estimation by running a shader to find which quantizer makes a block fit within the budget in bits.

Unfortunately, unlike other codecs on the list, ProRes codecs are not royalty-free, nor have open specifications. The implementations in FFmpeg are unofficial. But due to their sheer popularity, such implementations are necessary for interoperability with much of the professional world. Nevertheless, the developers dogfood on the decoders, and their output is monitored to match the official implementations.

ProRes RAW

ProRes RAW features a bitstream which shares little to no similarities with ProRes, made for compressing RAW (not debayered) sensor data. It uses a DCT performed on each component, and a coefficient coder which predicts DCs across components and efficiently encodes AC values from multiple components in a normal zigzag order. The entropy coding system is not exactly a traditional variable length code (does not satisfy the Kraft–McMillan inequality), but closer to exponential coding.

Slices feature multiple blocks, with each component being able to be decoded in parallel. Unlike FFv1, there are no limitations on the number of tiles per image, therefore leading to potentially needing to decode hundreds of thousands of independent blocks. This is great for parallelism, leading to efficient implementations.

The decoder was implemented in a 2-pass approach, with the first shader decoding each tile, and the second shader transforming all blocks within each tile with row/column parallelism (referred to as shred configuration due to being able to fully saturate a GPU's workgroup size limit).

DPX

DPX is not a codec per say, but rather a packed pixel packing container with a header. Its an official SMPTE standard, and rather popular with film scanners. Rather than being optimally laid out and tightly packing pixels, it can packs pixels in 32-bit chunks, padding if needed. Or it can... not pack pixels, depending on a header switch.

It being an uncompressed format with loose regulations made decades ago means that its rife with vendors being creative when it comes to interpreting the specifications, in ways that completely break decoding. Thankfully, there's a text "producer" field left in the header for such implementations to sign their artistry with, that we use to figure out how to correctly unpack without seeing alien rainbows.

All of this comes down to just writing heuristics in shaders. The overhead will never be the calculations needed to find a collection of pixels, but actually pulling data from memory and writing it elsewhere.

[picture of packing]

VC-2

VC-2 is another mezzanine codec. Authored by the BBC, based on its Dirac codec, its royalty-free, with an official SMPTE specifications. Its primary use-case was primarily real-time streaming, particularly fitting high resolution video over a gigabit connection with sub-frame latency. Unlike APV or ProRes, its based on wavelet transforms. Each frame is subdivided into power-of-two sized slices.

[image of slices and transforms]

Wavelets are rather interesting as transforms. They subdivide a frame into a quarter-resolution image, and 3 more quarter resolution images as residuals. Unlike DCTs, they are highly localized, which means they can be performed individually on each slice, yet when assembled they function as if the entire frame was transformed. This eliminates blocking artifacts that all DCT-based codecs suffer from.

This also means they're less efficient to encode as their frequency decomposition is compromised. Also, their distortion characteristics are substantially less visually appealing than the blurring of DCTs. This was one of the main reasons they failed to gain traction in post-2000s codecs.

The resulting coefficients are encoded via simple interleaved Golomb-exp codes, which, while not parallelizable, can be beautifully simplified in a decoder to remove all bit-parsing and instead operate on whole bytes.

JPEG

The codec given as an example at the start, as serialized encoding, turns out to have a very interesting attack that not only opens the doors to its parallelization, but also towards paralellizing arbitrary data compression standards such as DEFLATE.

The idea is that although VLC streams lack any way to parallelize, VLC decoders, and in fact any codes that satisfy the Kraft–McMillan inequality, can spuriously resynchronize. After a surprisingly short delay, VLC decoders tend to output valid data.

All that's needed is to run 4 shaders to gradually synchronize the starting points within each JPEG stream. JPEG has multiple variants too, such as progressive and lossless profiles, which can also be paralellized to such an extent.

[diagram]

DC prediction can be done via a parallel prefix sum, which is amongst the most common operations done via compute shaders. DCTs can be done via a shred configuration, as with other codecs.

Future

With the release of FFmpeg 8.1, we've implemented FFv1 encoding and decoding, ProRes and ProRes RAW decoding, and DPX unpacking. They are automatically enabled and used if Vulkan decoding is enabled.

The VC-2 encoder and decoder implementations need additional work to be merged, as well as the JPEG decoder and APV decoder.

As far as new codecs go, apart from those with few practical applications or reasons to implement on a GPU, only JPEG2000 and PNG are left.

Unfortunately, JPEG2000, and JPEG2000HT, are unlike anything on the list, combining the worst features of other codecs with a semi-serialized coding system that requires extensive understanding and bitstream complex enough to make most modern bureaucracies blush. Decoding of JPEG2000 in software is amongst the slowest of all popular codecs due to its ASIC-centric design and underengineered arithmetic coder. But, it's the primary codec used in digital cinema, medicine and forensics.

PNG would depend on how well DEFLATE could be parallelized.

Vulkan Compute

Vulkan is sometimes looked at as a 3D API with compute capabilities. But its compute capabilities have come a long way and can now compete and even outperform its dedicated compute API friends and rivals. These days, it features pointers, extensive subgroup operations, shared memory aliasing, native bitwise operations, 64-bit addressing and direct access to GPU matrix units, which lets programmers optimize on a lower level than abstraction APIs.

The compute portion of the API is nowhere near its full potential either. SPIR-V is a much more expressive intermediate representation, that the current Vulkan Compute API has to target a subset of its features. Support for the full set of ever expanding SPIR-V features continues to progress, with untyped pointers, 64-bit addressing, and soon, bitwise operations on data type lengths other than 32-bits.

OpenCL?

For portable computation on specialized hardware, OpenCL is a much older and more established API. Amongst the reasons that FFmpeg chose to use Vulkan Compute, from being able to use code written for the Vulkan Video APIs, was the difference in support from implementations. OpenCL is much more specialized than Vulkan, which has lead to vendors to treat it differently. There's sometimes internal conflict and a shortage of time, as most vendors tend to prefer their own set of specialized and optimized compute APIs. As a result, updates are rarer, and open source implementations are uncommon due to the amount of work and necessary interest in such a niche to maintain them. On the other hand, Vulkan is much more ubiqutos - from tiny SoCs, to tablets, embedded GPUs, dedicated GPUs, and professional server GPUs. Support for new extensions is incentivized thanks to the industry-led approach. Constant automated testing is performed via a very extensive test suite. Utilities to debug and optimize code exist. And a very extensive network of developers ensure that whatever weird trick you find, someone else has found it first, and made sure it works, or has directly contributed to the specifications.

cyanreg/Article.md

Select an option

No results found