This is the Maxwell facts thread. Let’s end the speculation and rumours. It’s just one hour before the press embargo is lifted. Let’s start collecting deviceQuery dumps, performance numbers, benchmarks, instruction throughput figures etc. all in one place.
Also: what hardware features are new? How can we make use of them?
So there’s now 64 kb shared memory per (full) SMX unit. Which is a step up from 48kb previously. Would anyone know details if shared memory bandwidth has been improved?
I’m curious to see how Maxwell improves interaction with CUDA 6 managed memory.
(BTW, I’ve submitted a pull request for managed memory support in PyCUDA. Hopefully that will be merged soon.)
Edit: I have an EVGA 750 Ti en route now. :) Interesting that I had to go direct to the manufacturer’s page to order, rather than the usual computer parts vendors I buy from.
I’d love to have some information about how to make use of the ARM processor that is supposed to be inside this Maxwell chip. Is it only used internally by the driver to offload some things (like dynamic parallelism) or is it also accessible to the programmer?
I assume Dynamic Parallelism & Hyper-Q will be found in sm_50. The last page of the AnandTech article states that they are “baseline” features in Maxwell.
Now that I have to pay for CUDA devices out of my pocket, I very much appreciate that NVIDIA decided to lead the Maxwell architecture release with a low-end desktop card. :)
A friend of mine will be getting 10. His first mining farm.
Because of some product expectations I have I will hold off buying single GPU cards. I’ve bought Asus MARS recently and I want this kind of device for mining, but definitely Maxwell based. We need more hash power density - up to the power limit that a single PCI express card can provide (would that be 250 Watts?)
Anyone knows if the GTX 750 Ti has Dynamic parallelism? At CUDA GPUs - Compute Capability | NVIDIA Developer appears with compute capability 3.0…
I want to try the new architecture but I need this feature.
Not directly related to Maxwell, but I’m pleased to see improved code generation in CUDA 6.0. After recompiling my image processing codes, the instruction count reduced by 12% and kernel time by 22% !
One thing I’ve always been bothered by is the very inefficient array indexing code. Unlike x86 which can compute
index * scale + offset + constOffset with a single load/store instruction, CUDA actually uses multiply and add instructions to do it (you can translate the array index into an induction variable, but that increases register use). 64 bit addressing makes it worse by doubling the # instructions.
It took me a while to realize why my simple code had 2 multiplies for each memory load:
We should know as soon as someone gets one and prints the device caps. It’s likely sm_35 or the (new) sm_32. It is not the sm_37 buried in the CUDA 6.0 headers (which provides more shared memory than the 64K GM108 is known to have).
Even GK208 is sm_35.
One (small) clue is from the GM107 white paper, which says “our first-generation Maxwell GPUs offer the same API functionality as Kepler GPUs”. That doesn’t tell us anything really except it’s sm_3x.