Fast convolutions (II)

I will analyze the algorithmic complexity of the convolution algorithms described in my previous posts.

To make things simpler, let’s assume that the dimensions of the image are >= the dimensions of the convolution kernel, and that both are square, with dimensions SxS and sxs, respectively.

1- Naive algorithm — O(n^4)

“w·h operations for each of the W·H image pixels”.

i.e., S·S·s·s operations. This is quadratic with a heavy constant for tiny kernels, but quickly becomes quartic for medium-to-large kernels.

The auxiliary buffer can be identical in size and bit-depth to the original image. So the memory usage factor is 2x.

2- Separable convolution — O(n^3)

“One 1D convolution for each row + One 1D convolution for each column”.

i.e., 2·S·S·s operations. This is quadratic with a bad constant for small kernels, but becomes cubic for large kernels.

Remarkably, the total running time depends on the dimensions of the image -and- the dimensions of the kernel.

Again, the auxiliary buffer can be identical in size and bit-depth to the original image. So the memory usage factor is 2x.

3- Convolution theorem — O(n^2·log(n))

“Two FFTs + One point-wise product + One IFFT”.

Let’s call S’ to the closest power-of-2 such that S’>=S. Then a proper implementation of the FFT/IFFT does (approx.) 2·S’·S’·log(S`) operations, while the point-wise product does S’·S’ operations. This makes the algorithm O(S’·S’·log(S’)) with some heavy (quadratic) overhead due to the memory copying, padding, and the point-wise product.

Remarkably, the total running time is independent of the size of the kernel.

This algorithm is quite memory hungry, though, because two S’xS’ complex-number buffers are required. This means two floating-point numbers per entry (i.e., the real/imaginary coefficients). The algorithm starts by copying the image/kernel to the real part of the corresponding complex-number buffer, leaving the imaginary coefficient and the surface excess filled with zeroes. Then the FFTs/product/IFFT happen in-place.

So the auxiliary memory required is 4·S’·S’ floating-point numbers.

In the worst case where S is a power-of-2 plus 1, S’ gets nearly twice as large as S. If the original image is 8-bit and we are using single-precision floating-point math for the FFT/IFFT, this means a memory usage factor of 64x. In the case of an HDR (single-precision floating-point) grayscale image, we are speaking of a worst case scenario of 16x. In average, however, the memory usage factor is around 8x. If S is a power-of-2, then the memory usage factor goes down to 4x.

Heavy glare using the FFT-based method in an HDR image by Paul Debevec

This image with heavy glare has been output with some Arion-related experimental tonemapping code I am working on these days.

Conclusions:

Assuming that we are only interested in sheer performance:

1- The FFT-based method is (by far) the fastest for large images/kernels. Interestingly, the algorithm is not affected by the size of the kernel, which can be as large as the (padded) image itself without a penalty.
2- The FFT-based method becomes even faster if the same kernel is applied multiple times. The kernel FFT can be calculated just once, and then be re-used.
3- Due to the heavy setup overhead in the FFT-based method, the separable method can be faster for small (separable) kernels where s is in the range of log(S’).

Last, but certainly not least, there is a much faster and more light-weight algorithm for the special case of Box/Gaussian Blur. I will talk about this in a separate blog entry.

Fast convolutions (I)

In my previous post it was stated that the convolution of a WxH image with a wxh kernel is a new WxH image where each pixel is the sum of wxh products obtained as the central pixel of the kernel slides across each of the pixels in the original image. This double-double loop leads to an impractical O(n^4) algorithm complexity.

Fortunately, we can do better, but the key here is not in optimizing the code, but in making use of some mathematical weaponry. Let’s analyze the options that we have:

1- Naive implementation.
2- Separable convolution kernels.
3- The convolution theorem.
4- Can we do EVEN better?

1- Naive implementation.

Pros:

– Trivial implementation in just a few lines of code.
– Works for any input size, and for any type of kernel.
– Trivial clamping at the boundaries.
– Allows for multi-threading.

Cons:

– Ridiculously inefficient: O(n^4).
– Needs an auxiliary WxH buffer.
– By default, the output of the operation is returned in the auxiliary buffer (not in-place).
– Not very cache friendly due to the constant jumps across rows.
– Applying the same kernel multiple times has the same cost, every time.

When should I use this method?

– Never, unless the input data is tiny and clean code is more important than sheer performance.

2- Separable convolution kernels.

A separable convolution kernel is one that can be broken into two 1D (vertical and horizontal) projections. For these projections the (matrix) product of the 1xh vertical kernel by the wx1 horizontal kernel must restore the original wxh 2D kernel.

1D vertical and horizontal Gaussian convolution

Conveniently enough, the most usual convolution kernels (e.g., Gaussian blur, box blur, …) happen to be separable.

The convolution of an image by a separable convolution kernel becomes the following:

1- Convolute the rows of the original image with the horizontal kernel projection.
2- Convolute the columns of the resulting image with the vertical kernel projection.

Note: These two steps are commutative.

2-step separable vs. brute-force 2D Gaussian convolution

Pros:

– More efficient than the naive implementation: O(n^3).
– Trivial implementation in just a few lines of code.
– Trivial clamping at the boundaries.
– Works for any input size.
– Since this is a two-step process, the convolution can be returned in-place.
– Cache-friendly.
– Allows for multi-threading.

Cons:

– Only works with separable kernels.
– Needs an auxiliary WxH buffer.
– Applying the same kernel multiple times has the same cost, every time.

When should I use this method?

– I do not use this method anywhere in RCSDK. You will understand why soon.

3- The convolution theorem.

“The convolution in the spatial domain is equivalent to a point-wise product in the frequency domain, and vice-versa.”

This method relies on the (Fast) Fourier Transform, which is one of the most beautiful mathematical constructs, ever. Seriously!

The convolution of an image by a generic kernel becomes the following:

1- Compute the Fourier Transform of the image.
2- Compute the Fourier Transform of the kernel.
3- Multiply both Fourier Transforms, point-wise.
4- Compute the Inverse Fourier Transform of the result.

Brute-force 2D Gaussian vs. the convolution theorem

Magic

Pros:

– Even more efficient: O(n^2·log(n)).
– Works with any convolution kernel, separable or not.
– Should the kernel be applied multiple times, the FFT of the kernel can be computed just once, and then be re-used.
– The FFT/IFFT and the convolution product are cache-friendly.
– The FFT/IFFT and the convolution product allow for multi-threading.

Cons:

– Definitely not easy to implement, unless you own an FFT module that suits your needs.
– The FFT operates on buffers with power-of-2 dimensions. This means that the input image (and the kernel) must be padded with zeroes to a larger size (i.e., extra setup time + memory).
– Both the image and the kernel are transformed, which requires two auxiliary buffers instead of one.
– Each FFT produces complex-number values, which doubles the memory usage of each auxiliary buffer.
– The dynamic range of the FFT values is generally wilder than that of the original image. This requires a careful implementation, and the use of floating-point math regardless of the bit-depth and range of the input image.
– This method doesn’t do any clamping at the boundaries of the image, producing a very annoying warp-around effect that may need special handling (e.g., more padding).

When should I use this method?

– Every time that a large generic convolution kernel (i.e., not a simple blur) is involved.
– The best examples I can think of in RCSDK/Arion are glare and bloom.

4- Can we do EVEN better?

Oh yes. We can do much better, at least in the very particular case of blur. I will talk about this in detail in a dedicated blog entry, at some point.

Some implementation remarks:

All the algorithms presented above allow for multi-threading. Naturally, MT does not change the algorithm complexity, since the maximum number of threads is fixed, but you can get very decent speeds in practical cases if you combine sleek code with proper multi-threading. In the separable case (2), MT must be used twice. First for the rows, and then for the cols. In the FFT case (3), the FFT itself can be multi-threaded (in a rows/cols fashion as well). The huge point-wise product in frequency space can be MT’ed too.

Since convolutions are usually applied on very large images, writing cache-friendly code can make a big difference. Assuming that the memory layout of your image/kernel is per rows, make sure to arrange your loops so memory accesses are as consecutive as possible. This is immediate for the loops that do a 1D convolution on each row. However, for the loops that do a 1D convolution on each column, it may help to use a local cache to transpose a column to a row back and forth.