To CUDA 5.0 from CUDA 4.x

Shortly after a new version of CUDA is released, it is time for me to upgrade/adapt Arion.

One would expect that after the new toolkit/compiler/drivers are installed, the source code should simply compile and run successfully. I am sure that this is the case for small projects, but it certainly has never been the case for Arion, which is a fairly large and complicated CUDA kernel where I make use of almost every functionality provided by the CUDA API.

I remember the agony that it was to move from CUDA 2.4 to CUDA 3.0, then from CUDA 3.0 to CUDA 4.0, and a few days ago, from CUDA 4.2 to CUDA 5.0. For minor point releases, the transition has been trivial, but for major version changes, it’s been hell:

– The NVCC compiler crashes, yielding an opaque Internal Compiler Error.
– The NVCC compiler gets stuck forever.
– The code compiles, but then it crashes on execution.
– The code compiles and runs, but with wrong behavior.

Since I have suffered these symptoms and have had to overcome them, I would like to leave a record here, in the hope that maybe my personal experience will somewhat help other CUDA developers out there.

1- The compiler crashes or freezes.

It is common in CUDA to do function inlining. It is common to either use __forceinline__ to hint the compiler, or to let the compiler decide by itself whether to inline each function call or not. This is good for the sake of performance, but inlining too much brings two obvious effects:

– The total size of code becomes much larger (code bloating).
– Optimizations, which are ‘local to the parent function’ become much harder/slower.

For some reason (and this can only be due to bugs in the nVidia CUDA compiler), code that compiles and runs in one version of the NVCC compiler, may not compile at all in the next version… due to (apparently) ‘excessive’ code size/complexity.

Nothing is more discouraging than seeing the compiler crash giving you no hint as to what the problem is. Once you get an Internal Compiler Error you’re in total despair, not knowing when or if your code will ever run again without dismantling it all and then re-adding its pieces one at a time (which may take days of ugly unconstructive work…).

... Internal Compiler Error (0xC0000005 at address ...)

This type of problem has happened to me with each and every major version change of CUDA. And, fortunately, the solution has always been more or less the same: play with __noinline__ and __forceinline__.

Actually, since a few versions ago, all my CUDA functions use some prefixes which I can re-define anytime to __noinline__ in case of emergency. By default, these prefixes are defined to __forceinline__ for those functions where inlining brings a significant performance boost, and to __noinline__ for those functions where it makes sense to split compilation. Re-defining these prefixes to __noinline__ makes Arion compile much faster and run much slower, but, importantly, this trick has always (so far) made NVCC swallow my code, and has given me a start point to re-think my inlining policy to maximize performance without making the compiler explode.

Note that using __forceinline__ (or letting the compiler decide by itself whether inlining should be used) will always lead to much longer build times and fatter executables, although in general, performance will be better (or much better, even). It may take quite a bit of trial and error to figure out what functions really benefit from inlining without risking a compiler crash, or without increasing your build times beyond reasonable.

2- Memory alignments.

This does not qualify as a compiler bug at all, but as a developer, this is the type of thing that may leave you stranded and stuck all of a sudden.

In CUDA 5.0 it has become necessary to keep structure fields aligned, while this was not necessary in previous versions. At least the compiler casts a warning telling you where the problem is, so all in all, this is not a big deal. Actually, I think that the compiler self-fixes the problem for you, but then if you make some assumptions on the offsets of data in memory your code will not work correctly.

In my case, I have some structures with many fields that store some 3D or 4D vectors. In CUDA 5.0 it is mandatory that said fields start at an offset which is a multiple of 16-byte from the beginning of the structure. You can fix that by re-arranging your fields or by adding some 4-byte ints for padding.

In general, it is a good idea to keep data padded and aligned anyway (although it is easy to fail to do so, as well).

3- cudaMemcpyToSymbol

nVidia has discontinued some functionality that was tagged as deprecated in CUDA 4.x. Of course it is strictly my fault to have been unaware of such deprecation. But then again, this is the kind of thing that you may easily overlook and which may make you go crazy for hours until you figure out at which location your code is not doing what it should.

In my case, cudaMemcpyToSymbol, which was accepting the names of constants as C-strings, now needs you to pass a pointer to the constant directly. Again, this is not a big deal… once you know what to do. It took me some hours of code tracing until I hit the problem, and found out in the docs that, in fact, what I was doing is not supported anymore.

As an advice which I have adopted myself… make sure that you check the cudaError_t value returned by the CUDA calls from your C/C++ code. In my particular case, cudaMemcpyToSymbol was returning cudaErrorInvalidSymbol = 13.

Somebody else commented on this on stackoverflow.com.

Some final notes (about older CUDA versions).

In addition to these problems, I remember that previous versions of the NVCC compiler were clearly buggy. I remember how code would compile or not before/after refactoring code (just by moving lines of code up and down). Some times, using a loop break instead of a loop termination condition would make code crash, and some times assigning compound values (such as 3D vectors) would crash, while assigning them per component would work. These compiler bugs where happening to me all the time in the CUDA 2.x/3.x days, although I must say that the CUDA compiler seems to be significantly more consistent in this regard since 4.x/5.0.