nvcc compiling with Tesla K80

Ewen · August 19, 2015, 3:24pm

Dear all,

I want to compile a .cu file in order to get a .ptx file that I will use in Matlab. The procedure works with a Quadro K2100M but fails with a Tesla K80. I have cuda toolkit 6.5 and I am using the following line for compiling:

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\bin\nvcc” -ptx -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin” -arch=sm_37 -o GPU_test_kernel.ptx GPU_test_kernel.cu

The compilation is successful but I cannot use the ptx in Matlab since I got an error. I guess the problem comes from the compilation itself for the Tesla K80 GPU.

Any help or comment would be greately appreciated !
Thank you in advance,
Regards

Ewen

njuffa · August 19, 2015, 4:47pm

-arch=sm_37 is the correct architecture specification for K80. What is the exact nature of the problem with Matlab? “I got an error” is not very specific. What error is reported, by which Matlab component? What does the Matlab documentation say about possible causes for such an error? Since the issue seems to be a problem with Matlab, rather than CUDA, have you tried getting help from a discussion board or Q&A site targeted at Matlab users? If so, what diagnosis or advice did you receive?

Ewen · August 31, 2015, 9:06am

Thanks for your reply.
You were right; I have actually solved the problem by using another version of Matlab … (R2014).
Regards

Mcewen · September 2, 2015, 8:11pm

Hello,

Sorry, I had some account troubles. I created a new one.

I have actually performed some benchmark tests in single precision using Matlab R2014a. I used the same .ptx file compiled using nvcc without architecture specification. It worked fine with my 4 available GPUs having different compute capabilities: Quadro K2100M, GeForce GTX 970, GeForce Titan X and Tesla K80. I logically observed that the slowest is the Quadro followed by the GTX 970. I am nevertheless surprised that the Titan is about 30% faster than the Tesla, which is supposed to be more effective…

I have then three questions :

Are my results normal about Titan beating the Tesla?
Specifying the architecture specification when nvcc compiling would change something?
I know there is a GPU boost mode for the Tesla. Is it the solution to get the fastest computations?

Thanks a lot!

Ewen

Robert_Crovella · September 2, 2015, 8:40pm

It’s possible for a Titan X to beat a Tesla GPU. It depends a lot on the code. A Tesla K80 should be a lot faster than a Titan X on code that is dominated by double-precision calculations.

Specifying the architecture probably won’t make a huge difference - your ptx code is jit-compiled at runtime to compatible machine code for whichever device you are running on.

A Tesla K40 or K80 is likely to pick up an additional 10-20% (or possibly more) performance with fully boosted clocks. This will also depend on the code being run. Highly dense computation codes (such as matrix multiply - eg. SGEMM/DGEMM) are not likely to benefit much from boost mode.

njuffa · September 3, 2015, 5:18am

I am a bit puzzled by this statement: “Highly dense computation codes (such as matrix multiply - eg. SGEMM/DGEMM) are not likely to benefit much from boost mode.” Could you explain the reasoning behind it? After all, GEMM is computationally limited, and therefore benefits from an increase in core clock in a linear fashion.

If the thought is that GEMM running at the fastest available boost clock would cause the GPU to exceed the thermal or power envelope and lead to clock throttling, my experience with K40 is that this does not necessarily occur. In fact, I had a difficult time approaching these limits no matter what kind of prolonged GEMM computation I tried. Your mileage may vary, and I cannot speak for the K80, as I have never used one. Obviously, there are also computational kernels that cause the GPU to draw more power than when running GEMM.

There are many different flavors of GEMM under the hood, not only based on data type but also based on sizes and aspect ratios of matrices. There are some differences in power consumption between these flavors. Other differences in power consumption exist because of natural variations in the power characteristics of each individual card, and because of different cooling and thus operating temperature. Lastly, different applications exhibit different “duty cycles” when they they call GEMM interspersed with other kernels.

I always encourage CUDA users to try running at the highest available boost clock on a Tesla, there is a high probability that the card will be able to sustain it, assuming proper cooling.

Robert_Crovella · September 3, 2015, 5:56am

My previous comments were probably not well worded in this respect.

I withdraw my previous comments and refer you to typical nvidia published information on the subject:

[url]http://www.nvidia.com/content/PDF/kepler/nvidia-gpu-boost-tesla-k40-06767-001-v02.pdf[/url]

[url]http://devblogs.nvidia.com/parallelforall/increase-performance-gpu-boost-k80-autoboost/[/url]

njuffa · September 3, 2015, 6:49am

That is a great blog post on the use of boost clocks. Boost, baby, boost! :-)

Mcewen · September 3, 2015, 9:02am

Thank you for your replies!

And can the Tesla beat the GeForce if I use the GPU boost mode (in SP) ?

Njuffa, as you are a Tesla user, which versions of the toolkit and driver do you recommand ? I have seen conversations mentionning a lot of discrepancies according to verions… Do you use Matlab too ? which version :) ?

Thanks in advance!

njuffa · September 3, 2015, 3:18pm

I am a former Tesla user. I have never used Matlab. I have used neither a K80 nor a Titan, and know nothing about the performance characteristics of your application. So I cannot predict what the relative performance on your application will be, in particular since both K80 and Titan have auto-boost as far as I am aware.

In my view, the reasons for using a Tesla GPU would be either the need for the highest double-precision performance, a need for integration in rackable servers, or a requirement for the most robust operation under 24/7 operation conditions.