Downsampling and Gaussian blur

I talked about several strategies to optimize convolutions in some of my previous posts. I still got to talk about how to approximate a Gaussian blur using a multi-step Box blur in a future post. However, there is yet another good technique to optimize a Gaussian blur that may come handy in some cases.

This post is inspired by a need that I had some days ago: Say that you need to do a 3D Gaussian blur on a potentially humongous 3D data buffer. Working with downsampled data sounds ideal in terms of storage and performance. So that’s what I am going to talk about here:

What happens when downsampling is mixed with a Gaussian blur?

The idea:

Here’s the 0-centered and un-normalized 1D Gaussian function:

G( x , \sigma ) = e^\frac{ - x^2 }{ 2 \sigma ^2 }

The sigma parameter in the Gaussian function stretches the bell shape along the x axis. So it is quite straightforward to understand that if one downsamples the input dataset by a scale factor k, then applies a (smaller) Gaussian where sigma’=s/k, and finally upscales the result by the same scale factor k, the result will approximate a true Gaussian on the original dataset where sigma=s.

In cleaner terms: if one has an input dataset (e.g., an image) I and wants to have it blurred by a Gaussian where sigma=s:

1- I’<=I downsampled by a certain scale factor k.
2- I”<=I' blurred by a small Gaussian where s’=s/k.
3- I”’<=I'' upscaled by a scale factor k.

How good is this approximation?

The Sampling Theorem states that sampling a signal at (at least) twice its smallest wavelength is enough. Which means that downsampling cuts frequencies above the Nyquist limit (half the sampling rate). In other words: Downsampling means less data to process, but at the expense of introducing an error.

Fortunately, a Gaussian blur is a form of low-pass frequency filter. This means that blurring is quite tolerant to alterations in the high part of the frequency spectrum.

Visual evaluation:

In the examples below I am downsampling with a simple pixel average, and I am upscaling with a simple bilinear filter. The 2×2 grids below compare:

1- Top-left – The original image I.
2- Top-rightI downsampled and upscaled by k (note the blocky bilinear filter look).
3- Bottom-left – The resulting image I”’.
4- Bottom-rightI blurred by a true Gaussian where sigma=s.

In these examples, I chose k=sigma for simplicity. This means that the small Gaussian uses sigma’=1.

Gaussian blur where sigma=4

Gaussian blur where sigma=16

Gaussian blur where sigma=64

Conclusion:

As shown, the approximation (bottom-left vs. bottom-right) is pretty good.

The gain in speed depends on multiple implementation factors. However, as I explained above, this post was inspired by a need to cope with a cubic memory storage problem when doing Gaussian blurs on a 3D buffer. Working with a heavily downsampled buffer clearly helps in that sense. And it is needless to say that decreasing the amount of data to process by k^3 also brings a dramatic speed boost, making it possible to use tiny separable convolutions along the 3 (downsampled) axes.

Note that one might pick any downsampling scale factor k>=1. The higher the value of k, the higher the approximation error and the smaller and faster the convolution.

The choice k=sigma offers a good trade-off between approximation error and speed gain as shown above.

CIE, XYZ, Yxy, RGB, and gamuts

Despite digital images are often given in RGB, RGB is not always a convenient space for image processing. Splitting RGB colors into their luminance and chrominance, and doing the opposite, are very common operations when it comes to image filtering.

The Yxy encoding is a very good solution due to its strong physical/perceptual background. One can go from RGB to XYZ (selecting a certain color-space transform matrix), and then go from XYZ to Yxy using the following formulas:

x = \frac{ X }{ X + Y + Z }

y = \frac{ Y }{ X + Y + Z }

The inverse transform to go from Yxy to XYZ is given by the following formulas:

X = Y \cdot \frac{ x }{ y }

Z = \frac{ X }{ x } - X - Y

The color-space transform matrix that turns RGB values into XYZ values (and its inverse) is a simple 3×3 affine transform matrix. The exact values of the matrix depend on the color-space you are working in. For example, the matrix for sRGB can be found in the Wikipedia sRGB page. Actually, one can build a custom RGB-to-XYZ matrix by defining the xy coordinates of the three primary colors (R, G, and B) and the position of the white-point. There is an excellent explanation of this on this page by Ryan Juckett.

XYZ colors, and the Yxy encoding have some very interesting practical properties. Below are some of them.

1- All human-visible colors have positive X, Y, and Z values.

This means that matrices to go from RGB to XYZ can only have positive coefficients, and a valid (positive) RGB color can only map to a positive XYZ triplet. The opposite is generally not true.

2- The Y value of an XYZ color represents the relative luminance of the color as percieved by the human eye.

So if one computes the Y (luminance) field of an RGB image, one gets a grayscale version of the original image.

3- Y (luminance) and xy (chrominance) are fully independent.

This means that if one encodes an RGB color in Yxy, then amplifies or diminishes the luminance Y’=k·Y, and then go back to RGB from Y’xy, the result is k·RGB.

This is a fundamental property in tonemapping algorithms, which typically do range compression on the luminance, and then just re-plug the original chrominance field to the compressed luminance field.

This property also means that one can do any alteration on the chrominance values (xy) and then go back to RGB while preserving the original luminance (Y). This comes handy when implementing image filters such as Tint (xy rotation around the whitepoint) or Saturation (xy amplification away from the whitepoint).

4- The range of Y is [0..INF), but the range of xy is constrained to [0..1]x[0..1].

Since the chrominance values (xy) are normalized they are independent of the luminance value, and hence they have a limited range. Here’s what happens when one selects a fixed luminance (Y=1, for example) and draws the RGB colors corresponding to all the chrominances in x=[0..1] and y=[0..1]:

sRGB triangle (gamut)

The darkened colors are invalid (not plausible) results where at least one of the RGB components is negative. So the only valid values that xy can get are actually constrained to this triangle, which is, in fact, much smaller than the unit square. As a matter of fact, the maximum distance between two chrominance values in Yxy (in sRGB as displayed above) is around 0.6, derived from the length of the longest edge of the gamut triangle.

Different color-space transforms have larger or smaller gamut triangles, as depicted in this CIE 1931 xy gamut comparison.

Note that the gamut triangle is the same regardless of the color luminance.

Glare patterns

Glare in photography is due to Fraunhofer diffraction as light from distant objects passes through the camera diaphragm.

There is a magical connection between Fraunhofer diffraction (physics) and the Fourier Transform (math). As a matter of fact, the intensity of the Fraunhofer diffraction pattern of a certain aperture is given by the squared modulus of the Fourier Transform of said aperture.

Assuming a clean and unobstacled camera, the aperture is the diaphragm shape. Here you have the diffraction patterns that correspond to some basic straight-blade (polygonal) diaphragms.

Glare patterns

Interestingly, the Fourier Transform produces one infinite decaying streak perpendicular to each polygon edge. When the number of edges is even, the streaks overlap in pairs. That is why an hexagonal diaphragm produces 6 streaks, and an heptagonal diaphragm produces 14.

The leftmost pattern happens to be the Airy disk. The Airy disk is a limit case where the number of polygon edges/streaks is infinite.

The examples above were generated at 256×256. The visual definition of the pattern naturally depends on the resolution of the buffers involved in the computation of the Fourier Transform. However, note that the FT has an infinite range. This means that for ideal polygonal shapes, the streaks are infinitely long.

In the practical case, buffers are far from infinite, and you hit one property of the Fourier Transform that is often nothing but an annoyance: the FT is cyclic. The image below depicts what happens when one pumps up the intensity of one of the glare patterns obtained above: the (infinite) streaks, warp-around the (finite) FT buffer.

Cyclic glare pattern

Bonus: Here’s some real-life glare I captured this evening at the European Athletics Championships.

Real-life glare

Fast convolutions (III)

These are some more remarks about the performance of the convolution methods described so far. I will be leaving the brute-force algorithm out for obvious reasons.

The plot below represents input size (in pixels) in the x axis, and convolution time (in seconds) in the y axis. So, if you go to x=4096 that means “a convolution of a 4096×4096 image by a 4096×4096 kernel”.

Even competition between FFT-based vs. separable convolution methods

Two conclusions can be made from the plot, which confirm what was explained in my previous post:

1- For large kernels (as large as the image itself) the separable convolution method is O(n^3) and times get to absurd levels very quickly. If you are dealing with large generic images/kernels, the FFT-based method is the way to go.

2- The FFT-based method uses the Fast Fourier Transform, which is O(n^2·log(n)) thanks to some decomposition technique that requires the size of the input data to be (padded to) a power-of-2. For this reason, it takes the same amount of time to do a convolution on a 257×257 image/kernel than on a 512×512 image/kernel, because both cases operate on 512×512 buffers after all. This is why the graph for the FFT method is stepped. When x crosses a power-of-2, the running time goes up all of a sudden and stays stable until the next power-of-2.

The plot was generated with my current implementation of both convolution methods in RCSDK. My FFT uses the Cooley-Tukey algorithm, and everything (i.e., FFT, IFFT, point-wise products, and 1D separable convolutions) makes use of multi-threading. There’s always room for improvement, but the running times seem pretty decent, as we’re speaking of <2s for images up to 4096×4096 in a 16-thread CPU. An implementation in CUDA would be (orders of magnitude) faster, though. :-P

A couple more remarks:

1- The graph of the FFT-based method wouldn’t change if smaller kernels were used, as the algorithm requires the kernel to be padded to the size of the image. However, the graph of the separable method becomes less steep when very small kernels are used:

Separable convolution with a fixed 11×11 kernel (in orange)

2- The running time of the FFT/IFFT is not exactly a trivial subject. Speed in FFT/IFFT algorithms depends not only on the size of the data, but also on the data itself. While generating the above plots, I came across some ‘anomalies’ in the time measurements. Some kernels or some images produced faster or slower results. Some combinations would even produce non-flat steps in the staircase-looking graph. That’s normal, but worth mentioning.