shared memory vs registers
What would really increase your speed is switching to one of those Tesla cards, or the original Kepler based Geforce Titan (GK110) or Titan-Z (dual GPU) which featured very high double precision throughput at 1/3rd the level of single precision throughput. Christian
What would really increase your speed is switching to one of those Tesla cards, or the original Kepler based Geforce Titan (GK110) or Titan-Z (dual GPU) which featured very high double precision throughput at 1/3rd the level of single precision throughput.

Christian

#16
Posted 07/27/2017 08:31 AM   
amount of shared memory on GPU is ~1/4 of register pool. so it hardly can help without sharing data between threads. instead, you can try to offload constants to the constant memory
amount of shared memory on GPU is ~1/4 of register pool. so it hardly can help without sharing data between threads. instead, you can try to offload constants to the constant memory

#17
Posted 07/27/2017 04:06 PM   
The idea of distributing the information across threads is actually a good one. I've used this strategy to accelerate some hash functions on consumer Kepler GPUs that had trouble keeping the full hash state in a single thread due to register pressure. So I spread the state across 4 consecutive threads and got a decent speedup. The unfortunate part is that your arrays have a dimension of 3 and 7 which is not a power of 2 - so it requires a bit of padding (losing a bit of efficiency) to be compatible with the warp shuffle instructions. If you could post your latest code I could have a closer look.
The idea of distributing the information across threads is actually a good one. I've used this strategy to accelerate some hash functions on consumer Kepler GPUs that had trouble keeping the full hash state in a single thread due to register pressure. So I spread the state across 4 consecutive threads and got a decent speedup.

The unfortunate part is that your arrays have a dimension of 3 and 7 which is not a power of 2 - so it requires a bit of padding (losing a bit of efficiency) to be compatible with the warp shuffle instructions.

If you could post your latest code I could have a closer look.

#18
Posted 07/27/2017 04:32 PM   
[quote]instead, you can try to offload constants to the constant memory[/quote] The constant memory is 64K, so not significantly larger than shared memory. And since it is designed for broadcast access across an entire warp, efficient use requires uniform access (or at most 2-3 different addresses across a warp, i.e. minor serialization), which basically amount to inter-thread sharing. The compiler moves compile-time constants (user specified as well as compiler generated ones) into constant memory on its own. So I am not sure what specific data on top of that we could / would manually move into constant memory in this code? I note that register pressure is significantly lower (100 registers) with the code as originally written with all arrays in local memory [i]plus [b]complete[/b] loop unrolling[/i] (as suggested by cbuchner in #4, we have come full circle on this :-), instead of the [i]incorrect[/i] version using arrays in shared memory that was posted. So that would seem the way to go.
instead, you can try to offload constants to the constant memory

The constant memory is 64K, so not significantly larger than shared memory. And since it is designed for broadcast access across an entire warp, efficient use requires uniform access (or at most 2-3 different addresses across a warp, i.e. minor serialization), which basically amount to inter-thread sharing.

The compiler moves compile-time constants (user specified as well as compiler generated ones) into constant memory on its own. So I am not sure what specific data on top of that we could / would manually move into constant memory in this code?

I note that register pressure is significantly lower (100 registers) with the code as originally written with all arrays in local memory plus complete loop unrolling (as suggested by cbuchner in #4, we have come full circle on this :-), instead of the incorrect version using arrays in shared memory that was posted. So that would seem the way to go.

#19
Posted 07/27/2017 04:46 PM   
I think about constant memerory because the real version of code contains not such simple device function as cent_body but more complicated (accelirations from Earht gravitation field with nxn harmonics and accelirations from atmocpheric drag and etc.). So additional arrays with constant variables appeare in the code. Each thread uses these arrays and it would be better to allocate them as constant variables in constant memory. Idea of cbuchner1 to combine threads for calculation is interesting too, but it requires complete changing of the code. Thanks
I think about constant memerory because the real version of code contains not such simple device function as cent_body but more complicated (accelirations from Earht gravitation field with nxn harmonics and accelirations from atmocpheric drag and etc.). So additional arrays with constant variables appeare in the code. Each thread uses these arrays and it would be better to allocate them as constant variables in constant memory. Idea of cbuchner1 to combine threads for calculation is interesting too, but it requires complete changing of the code.
Thanks

#20
Posted 07/28/2017 06:17 AM   
Scroll To Top

Add Reply