Running Code in 9800 GT is different than in GTX 580? GPU Upgrade Problem
Hi all,

I recently upgraded my system from a Geforce 9800 Gt to a Geforce GTX 580. Apparently all was well and I installed version 4.1 of the toolkit and the SDK samples run just fine.

It turns out that when I run my own code, the behavior is completely anomalous. Apparently in some of the operations of my kernels the results are inconsistent and ultimately I get a completely unexpected result.

Is the code for these GPUs (GTX) different? Do I have to make a brand new one?

If anyone has had a similar problem or knows how to solve this one, Please help me.

Greetings!
Hi all,



I recently upgraded my system from a Geforce 9800 Gt to a Geforce GTX 580. Apparently all was well and I installed version 4.1 of the toolkit and the SDK samples run just fine.



It turns out that when I run my own code, the behavior is completely anomalous. Apparently in some of the operations of my kernels the results are inconsistent and ultimately I get a completely unexpected result.



Is the code for these GPUs (GTX) different? Do I have to make a brand new one?



If anyone has had a similar problem or knows how to solve this one, Please help me.



Greetings!

Carlos Alejandro Trujillo Anaya

Physics Engineer

M.eng. Student

Faculty of Mines, Medellín, Antioquia, Colombia.

National University of Colombia (Universidad Nacional de Colombia).

calelo36@gmail.com

catrujila@unal.edu.co



Windows 7 64Bits. CUDA Toolkit 4.1 32 bits. Visual Studio 2008 with SP1.

Intel® Core™ i7 running at 3.6GHz and 8GB of RAM memory.

Graphics card: Geforce GTX 580, 512 stream cores, 1536MB, 772MHz.

#1
Posted 02/28/2012 11:57 AM   
Run you code under [font="Courier New"]cuda-memcheck[/font] to make sure you have no stray memory accesses. Fermi GPUs detect more of them that would previously go unnoticed and subsequently fails the kernel execution. Do check return codes everywhere?

Check that all of your shared memory variables are either declared [font="Courier New"]__volatile__[/font] or accesses are properly guarded by [font="Courier New"]__syncthreads()[/font]. Because Fermi is a proper load-store architecture even with regard to shared memory accesses, it relies much more on this where improper code would still work with compute capability 1.x devices. Even Nvidia's original SDK examples were sloppy with regard to this.

There's also the [url="http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Fermi_Compatibility_Guide.pdf"]Fermi Compatibility Guide[/url] but I don't think it contains much more info than this.
Run you code under cuda-memcheck to make sure you have no stray memory accesses. Fermi GPUs detect more of them that would previously go unnoticed and subsequently fails the kernel execution. Do check return codes everywhere?



Check that all of your shared memory variables are either declared __volatile__ or accesses are properly guarded by __syncthreads(). Because Fermi is a proper load-store architecture even with regard to shared memory accesses, it relies much more on this where improper code would still work with compute capability 1.x devices. Even Nvidia's original SDK examples were sloppy with regard to this.



There's also the Fermi Compatibility Guide but I don't think it contains much more info than this.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 02/28/2012 12:26 PM   
[quote name='tera' date='28 February 2012 - 12:26 PM' timestamp='1330431962' post='1376109']
Run you code under [font="Courier New"]cuda-memcheck[/font] to make sure you have no stray memory accesses. Fermi GPUs detect more of them that would previously go unnoticed and subsequently fails the kernel execution. Do check return codes everywhere?

Check that all of your shared memory variables are either declared [font="Courier New"]__volatile__[/font] or accesses are properly guarded by [font="Courier New"]__syncthreads()[/font]. Because Fermi is a proper load-store architecture even with regard to shared memory accesses, it relies much more on this where improper code would still work with compute capability 1.x devices. Even Nvidia's original SDK examples were sloppy with regard to this.

There's also the [url="http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Fermi_Compatibility_Guide.pdf"]Fermi Compatibility Guide[/url] but I don't think it contains much more info than this.
[/quote]

Thanks Tera, your advice has been very useful for me.
[quote name='tera' date='28 February 2012 - 12:26 PM' timestamp='1330431962' post='1376109']

Run you code under cuda-memcheck to make sure you have no stray memory accesses. Fermi GPUs detect more of them that would previously go unnoticed and subsequently fails the kernel execution. Do check return codes everywhere?



Check that all of your shared memory variables are either declared __volatile__ or accesses are properly guarded by __syncthreads(). Because Fermi is a proper load-store architecture even with regard to shared memory accesses, it relies much more on this where improper code would still work with compute capability 1.x devices. Even Nvidia's original SDK examples were sloppy with regard to this.



There's also the Fermi Compatibility Guide but I don't think it contains much more info than this.





Thanks Tera, your advice has been very useful for me.

Carlos Alejandro Trujillo Anaya

Physics Engineer

M.eng. Student

Faculty of Mines, Medellín, Antioquia, Colombia.

National University of Colombia (Universidad Nacional de Colombia).

calelo36@gmail.com

catrujila@unal.edu.co



Windows 7 64Bits. CUDA Toolkit 4.1 32 bits. Visual Studio 2008 with SP1.

Intel® Core™ i7 running at 3.6GHz and 8GB of RAM memory.

Graphics card: Geforce GTX 580, 512 stream cores, 1536MB, 772MHz.

#3
Posted 05/04/2012 02:34 AM   
Also you should have an error checking mode where you check the result of every cuda* call and cudaDeviceSynchronize() after every kernel call. Obviosly, the sync will slow things down, but enable it as a debug check.

It is possible that one of your kernels is failing to run at all due to too big a block size.
Also you should have an error checking mode where you check the result of every cuda* call and cudaDeviceSynchronize() after every kernel call. Obviosly, the sync will slow things down, but enable it as a debug check.



It is possible that one of your kernels is failing to run at all due to too big a block size.

#4
Posted 05/04/2012 02:02 PM   
Scroll To Top