Preferred language for learning CUDA language concepts?
Would you rather learn CUDA programming concepts (threads, blocks, grids, streams, shared memory, and so on) in CUDA C/C++ or the newly released CUDA Python? You can read more about CUDA Python in this great Anandtech article: http://www.anandtech.com/show/6839/nvidia-and-continuum-analytics-announce-numbapro-a-python-cuda-compiler
Would you rather learn CUDA programming concepts (threads, blocks, grids, streams, shared memory, and so on) in CUDA C/C++ or the newly released CUDA Python?

You can read more about CUDA Python in this great Anandtech article: http://www.anandtech.com/show/6839/nvidia-and-continuum-analytics-announce-numbapro-a-python-cuda-compiler

#1
Posted 04/19/2013 04:07 PM   
For performance critical code I just cannot understand why anyone would want to use Python. I know once the GPU takes over there probably is (I am assuming) little difference for the device operations, but just copying over the data and the general CPU operations will still be slowed down by Python. Take n-body on the cpu for example; [url]http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody[/url] c++ took 10 seconds and Python 3 took 18 min. There are many more examples of the speed difference on that site. Only Ruby was outperformed by Python. People buy Nvidia GPUs to reduce overall computing time, not be dragged down by the slow link in the chain. Hell if you want the high-level abstraction just use Matlab, which even on the CPU is hundreds of times faster than Python. And one can use GPUs from Matlab as well. Also is Python really that much easier than C ? I appreciate the Nvidia is trying to be helpful. But my experience so far shows that the CPU time always affects the overall time, and much of the code I have written(using CUDA) does some portion of work on the CPU and other tasks on the GPU. Nothing is faster than using pointers, as I learned by the huge time difference between using thrust::device_vector<T> vs cudaMalloc and thrust::pointers for sorting.
For performance critical code I just cannot understand why anyone would want to use Python. I know once the GPU takes over there probably is (I am assuming) little difference for the device operations, but just copying over the data and the general CPU operations will still be slowed down by Python.

Take n-body on the cpu for example;

http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

c++ took 10 seconds and Python 3 took 18 min. There are many more examples of the speed difference on that site. Only Ruby was outperformed by Python.

People buy Nvidia GPUs to reduce overall computing time, not be dragged down by the slow link in the chain.

Hell if you want the high-level abstraction just use Matlab, which even on the CPU is hundreds of times faster than Python. And one can use GPUs from Matlab as well.

Also is Python really that much easier than C ?

I appreciate the Nvidia is trying to be helpful. But my experience so far shows that the CPU time always affects the overall time, and much of the code I have written(using CUDA) does some portion of work on the CPU and other tasks on the GPU.

Nothing is faster than using pointers, as I learned by the huge time difference between using thrust::device_vector<T> vs cudaMalloc and thrust::pointers for sorting.

#2
Posted 04/19/2013 07:55 PM   
Thanks for the response CudaaduC! While typical Python is indeed a performance bottleneck, using the NumbaPro Python compiler (yes, compiler) that's part of the Anaconda Accelerate suite from Continuum Analytics, you can get close to CUDA C application speeds! See a great blog post on this here: http://continuum.io/blog/monte-carlo-pricer While debating the "efficiency" in programming in Python vs. C is a worthwhile debate, eventually it'll just come down to personal preference. However, you do have to admit that Python is a language that is very rapidly growing. It's also, IMHO, an easier language to teach a new concept to someone in vs. C. There's really two sides of the debate: what's the "better" language for teaching a programming concept in vs. what's the better language to write production code in. And until Accelerate with CUDA support from Continuum Analytics, the latter was clearly C/C++. I'm excited that is no longer such a simple question to answer!
Thanks for the response CudaaduC! While typical Python is indeed a performance bottleneck, using the NumbaPro Python compiler (yes, compiler) that's part of the Anaconda Accelerate suite from Continuum Analytics, you can get close to CUDA C application speeds!

See a great blog post on this here: http://continuum.io/blog/monte-carlo-pricer

While debating the "efficiency" in programming in Python vs. C is a worthwhile debate, eventually it'll just come down to personal preference. However, you do have to admit that Python is a language that is very rapidly growing. It's also, IMHO, an easier language to teach a new concept to someone in vs. C.

There's really two sides of the debate: what's the "better" language for teaching a programming concept in vs. what's the better language to write production code in. And until Accelerate with CUDA support from Continuum Analytics, the latter was clearly C/C++. I'm excited that is no longer such a simple question to answer!

#3
Posted 04/22/2013 03:28 PM   
As someone working in scientific computing using both Python and CUDA (via PyCUDA): The attractiveness of Python (especially with numpy) is the ability to shorten my development time, and the attractiveness of CUDA is to shorten my run time. You are correct that these two goals have some interference, and depending on the number of times I need to run a particular program, the acceptable tradeoff between development and run time will vary. Improved integration of Python and CUDA helps bend that curve so that I can keep using Python for a broader class of programs before I need to switch part or all of the implementation to statically-compiled host languages. These days I'm using C++ only for a few programs, and Python + PyCUDA is good enough for the rest.
As someone working in scientific computing using both Python and CUDA (via PyCUDA):

The attractiveness of Python (especially with numpy) is the ability to shorten my development time, and the attractiveness of CUDA is to shorten my run time. You are correct that these two goals have some interference, and depending on the number of times I need to run a particular program, the acceptable tradeoff between development and run time will vary. Improved integration of Python and CUDA helps bend that curve so that I can keep using Python for a broader class of programs before I need to switch part or all of the implementation to statically-compiled host languages.

These days I'm using C++ only for a few programs, and Python + PyCUDA is good enough for the rest.

#4
Posted 04/22/2013 07:49 PM   
Well, the work I have been doing was originally prototyped in Matlab and Python. I think Matlab is easier than Python, and the speed difference is huge. I understand that Python is the trendy language now, but for my work seconds do matter. Some applications which took 8 hours in Python now take a couple minutes in C/CUDA . Lets compare some C/C++ CUDA code with the Python equivalent. My version of the Floyd-Warshall all-pairs shortest path algorithm is a good choice because it does the outer loop on the CPU and the rest on a 680 gtx. (n^3) This implementation has a running time of 163 seconds for a dense random 10,000 x 10,000 adjacency matrix, which includes all memory alloc times and copies. The cpu verison in C++ runs at about 3700 seconds, and I can just imagine how long the Python version takes. This is a really simple algorithm and here is the source code; [url]https://github.com/OlegKonings/CUDA_Floyd_Warshall_/blob/master/WikiGraphCuda/WikiGraphCuda/WGCmain.cu[/url] It also stores the optimal paths, so the memory requirements are large, but should work with a decent GPU. If anybody can even get close to the 2 min C++/CUDA time I will be amazed. This is not even the best example, just a good simple test. Please post your Python cpu results and the PyCuda results. The linear algebra stuff has an even bigger time difference. Maybe it is just as fast, but I doubt it.
Well, the work I have been doing was originally prototyped in Matlab and Python. I think Matlab is easier than Python, and the speed difference is huge.

I understand that Python is the trendy language now, but for my work seconds do matter. Some applications which took 8 hours in Python now take a couple minutes in C/CUDA .

Lets compare some C/C++ CUDA code with the Python equivalent. My version of the Floyd-Warshall all-pairs shortest path algorithm is a good choice because it does the outer loop on the CPU and the rest on a 680 gtx. (n^3)

This implementation has a running time of 163 seconds for a dense random 10,000 x 10,000 adjacency matrix, which includes all memory alloc times and copies. The cpu verison in C++ runs at about 3700 seconds, and I can just imagine how long the Python version takes.

This is a really simple algorithm and here is the source code;

https://github.com/OlegKonings/CUDA_Floyd_Warshall_/blob/master/WikiGraphCuda/WikiGraphCuda/WGCmain.cu

It also stores the optimal paths, so the memory requirements are large, but should work with a decent GPU.

If anybody can even get close to the 2 min C++/CUDA time I will be amazed. This is not even the best example, just a good simple test. Please post your Python cpu results and the PyCuda results.

The linear algebra stuff has an even bigger time difference.

Maybe it is just as fast, but I doubt it.

#5
Posted 04/22/2013 08:16 PM   
CUDA Python does not have the same performance issues that Python alone has. You are writing kernels in Python syntax that get translated to the equivalent machine code kernels that CUDA C produces. CUDA Python can also interact with Numba which is a compiler that translates Python syntax to machine code as well. This all uses the LLVM technology stack which is the same backend that the CLANG C/C++ compiler uses. So, in the end you are producing the same code from Python or C/C++. That's why it should be possible to get the same kinds of speeds. Thank you for the example to try. Numba and NumbaPro are new products and are improving regularly. Disclaimer: I work for Continuum Analytics, Inc. which sponsors the open source Numba project and sells NumbaPro.
CUDA Python does not have the same performance issues that Python alone has. You are writing kernels in Python syntax that get translated to the equivalent machine code kernels that CUDA C produces. CUDA Python can also interact with Numba which is a compiler that translates Python syntax to machine code as well. This all uses the LLVM technology stack which is the same backend that the CLANG C/C++ compiler uses. So, in the end you are producing the same code from Python or C/C++.

That's why it should be possible to get the same kinds of speeds. Thank you for the example to try.

Numba and NumbaPro are new products and are improving regularly. Disclaimer: I work for Continuum Analytics, Inc. which sponsors the open source Numba project and sells NumbaPro.

#6
Posted 04/24/2013 06:38 PM   
Well if you are truly able to get similar CUDA performance that is impressive. I am happy there is work being done to make GPU programming easier. C just seems like a good fit for GPU programming, because so much of it deals with handling memory. It is kind of like having the choice between driving a 6-speed manual transmission Ferrari and an easier to drive automatic Lexus. The Ferrari is harder to drive and is easier to crash, but you have more control and speed. The Lexus is more comfortable, but cannot corner as well or go as fast 0-60. I do think that in the future much code will be a combination of portions which are better suited for the cpu, and portions which are better suited for the gpu. In such a situation anything other than C will be a drag. Much of the dynamic programming and graph algorithms are best accomplished by such a mix, and that is my focus.
Well if you are truly able to get similar CUDA performance that is impressive. I am happy there is work being done to make GPU programming easier.

C just seems like a good fit for GPU programming, because so much of it deals with handling memory. It is kind of like having the choice between driving a 6-speed manual transmission Ferrari and an easier to drive automatic Lexus.

The Ferrari is harder to drive and is easier to crash, but you have more control and speed. The Lexus is more comfortable, but cannot corner as well or go as fast 0-60.

I do think that in the future much code will be a combination of portions which are better suited for the cpu, and portions which are better suited for the gpu. In such a situation anything other than C will be a drag. Much of the dynamic programming and graph algorithms are best accomplished by such a mix, and that is my focus.

#7
Posted 04/24/2013 08:03 PM   
I'm a C programmer and in my university we do use C in Data Structures, compilers etc so teaching CUDA C/C++ is the best for us. Now I love python also so is good that Nvidia is working so hard in Cuda Py
I'm a C programmer and in my university we do use C in Data Structures, compilers etc so teaching CUDA C/C++ is the best for us.
Now I love python also so is good that Nvidia is working so hard in Cuda Py

#8
Posted 05/14/2013 04:19 AM   
Scroll To Top