Is it possible to have be.exe compiled has 64bits? Issue with CudaToolkit4.0 64bits
Hi to All!

This is a topic that I created at nvidia developer forums but got no solution...
So sorry for this kind of double post :)

Original post:
http://forums.developer.nvidia.com/devforum/discussion/6646/is-it-possible-to-have-be-exe-compiled-has-64bits

[quote]
Hi All,

Before anything else here is my test machine and CUDA version:
Win7 64bits with 6GB of RAM.
CUDA Tookit 4.0 also 64bits.

Now my problem:
I have a pretty big kernel to compile and it stops at be.exe.
After some investigation I found that be.exe stops with an out of memory error when it started using more than 2GB of RAM, and that makes sense, because be.exe is a 32bits application (even thou the toolkit is the 64bits version).
Hoping that my kernel compile wouldn't exceed 3GB and that be.exe would be compatible with the flag /LARGEDADDRESSAWARE, I changed it.
But alas... The compilation quickly consumed the 3GB and the same error occurred...

Now I am stuck... The only chance is to have be.exe compiled has a 64bits executable, then it would be able to use all the RAM available (unfortunately cutting stuff out of the kernel is not a possibility)....

Or there is any workaround that I am missing?

Best regards to all!
Marco Silva

Addendum:
I need to compile sm_1x code, so the new LLVM based compiler won't be used...

[/quote]

Thanks in advance for any kind of help!

Best regards
Marco Silva
Hi to All!



This is a topic that I created at nvidia developer forums but got no solution...

So sorry for this kind of double post :)



Original post:

http://forums.developer.nvidia.com/devforum/discussion/6646/is-it-possible-to-have-be-exe-compiled-has-64bits





Hi All,



Before anything else here is my test machine and CUDA version:

Win7 64bits with 6GB of RAM.

CUDA Tookit 4.0 also 64bits.



Now my problem:

I have a pretty big kernel to compile and it stops at be.exe.

After some investigation I found that be.exe stops with an out of memory error when it started using more than 2GB of RAM, and that makes sense, because be.exe is a 32bits application (even thou the toolkit is the 64bits version).

Hoping that my kernel compile wouldn't exceed 3GB and that be.exe would be compatible with the flag /LARGEDADDRESSAWARE, I changed it.

But alas... The compilation quickly consumed the 3GB and the same error occurred...



Now I am stuck... The only chance is to have be.exe compiled has a 64bits executable, then it would be able to use all the RAM available (unfortunately cutting stuff out of the kernel is not a possibility)....



Or there is any workaround that I am missing?



Best regards to all!

Marco Silva



Addendum:

I need to compile sm_1x code, so the new LLVM based compiler won't be used...







Thanks in advance for any kind of help!



Best regards

Marco Silva

#1
Posted 04/10/2012 03:05 PM   
A typical situation where Open64 runs out of memory is when it tries to optimize really large kernels (even kernels that look small-ish in source code can balloon in size due to inlining of all called functions). In order to get the code to compile, you could try lowering the optimization level for Open64. The default is -Xopencc -O3. I would suggest working backwards to -Xopencc -O2, -Xopencc -O1, -Xopencc -O0. Another workaround may be to split the large kernel into two or more smaller kernels.

[Later:]

The compiler team reminds me that you can also use the __noinline__ attribute with device functions to limit the amount of inlining performed in the kernel. Note that due to architectural restrictions with sm_1x, __noinline__ cannot be used with all device functions; the compiler will warn about functions that it must inline and ignores the __noinline__ attribute for these.
A typical situation where Open64 runs out of memory is when it tries to optimize really large kernels (even kernels that look small-ish in source code can balloon in size due to inlining of all called functions). In order to get the code to compile, you could try lowering the optimization level for Open64. The default is -Xopencc -O3. I would suggest working backwards to -Xopencc -O2, -Xopencc -O1, -Xopencc -O0. Another workaround may be to split the large kernel into two or more smaller kernels.



[Later:]



The compiler team reminds me that you can also use the __noinline__ attribute with device functions to limit the amount of inlining performed in the kernel. Note that due to architectural restrictions with sm_1x, __noinline__ cannot be used with all device functions; the compiler will warn about functions that it must inline and ignores the __noinline__ attribute for these.

#2
Posted 04/10/2012 06:45 PM   
Hi njuffa!

thank you for your reply!

On the original discussion the user Tera found the solution!
If I use the switch -nvvm in nvcc, I can use the new LLVM compiler even on sm1x code!
So after almost 2 days compiling, the cubin file was created and worked successfully!

Best regards,
Marco Silva
Hi njuffa!



thank you for your reply!



On the original discussion the user Tera found the solution!

If I use the switch -nvvm in nvcc, I can use the new LLVM compiler even on sm1x code!

So after almost 2 days compiling, the cubin file was created and worked successfully!



Best regards,

Marco Silva

#3
Posted 04/12/2012 02:26 PM   
It is good to hear that you found an approach that works for you, but please note that to the best of my knowledge, this is not a supported configuration, meaning use of the NVVM frontend with sm_1x is not covered by our internal testing, and it may or may not work.
It is good to hear that you found an approach that works for you, but please note that to the best of my knowledge, this is not a supported configuration, meaning use of the NVVM frontend with sm_1x is not covered by our internal testing, and it may or may not work.

#4
Posted 04/12/2012 05:00 PM   
That's not good at all...

In my project I have already many kernels to solve the inline issue.
But each kernel grows to quickly, so even multiple kernels won't do the trick...

I will try to use also the __noinline__ attribute in some key functions and see if this lowers the mem use.

With the -nvvm switch at least the sm1x for 32bits worked. Or so it seams :)
I will do some more testing (the 64bits version is compiling ATM) to see if anything is wrong.

When you say that it may or may not work, means that the compiler can crash, or that it may produce bad code?
That's not good at all...



In my project I have already many kernels to solve the inline issue.

But each kernel grows to quickly, so even multiple kernels won't do the trick...



I will try to use also the __noinline__ attribute in some key functions and see if this lowers the mem use.



With the -nvvm switch at least the sm1x for 32bits worked. Or so it seams :)

I will do some more testing (the 64bits version is compiling ATM) to see if anything is wrong.



When you say that it may or may not work, means that the compiler can crash, or that it may produce bad code?

#5
Posted 04/12/2012 06:47 PM   
As there are occasional bugs that cause tested paths to terminate abnormally or generate incorrect code, that is obviously a distinct possibility for any path that is not tested. In general I tend to steer people away from unsupported compiler flags for production code. Proceed at your own risk.
As there are occasional bugs that cause tested paths to terminate abnormally or generate incorrect code, that is obviously a distinct possibility for any path that is not tested. In general I tend to steer people away from unsupported compiler flags for production code. Proceed at your own risk.

#6
Posted 04/12/2012 07:17 PM   
Ok, so this is probably not best solution...

There are no plans for the be.exe and inline.exe to be 64bits :)? That would be perfect!

If not, I have to go with the noinline tags, and see what I can get...
Ok, so this is probably not best solution...



There are no plans for the be.exe and inline.exe to be 64bits :)? That would be perfect!



If not, I have to go with the noinline tags, and see what I can get...

#7
Posted 04/12/2012 08:55 PM   
Scroll To Top