Constant memory 64KB only 8KB usable?

Appendix G of the CUDA 3.1 C programming guide says

My largest kernel uses 55878 bytes of constant memory

in one large and two small (32 int each) arrays.

It runs very slowly.

I can make peformance worse by moving where the arrays are decleared.

I am using a mixture of short int and unsigned int.

I am not sure of the significance of the 8KB cache.

But am beginning to suspect that (despite lots of effort

with shared memory) the kernel is held up by random access

to off-chip memory for “constant” data as the 8KB cache is overwelmed.

On the other hand perhaps the 295 GTX does not like short int constant

As always any help, comments or hints would be most welcome

Bill

Appendix G of the CUDA 3.1 C programming guide says

My largest kernel uses 55878 bytes of constant memory

in one large and two small (32 int each) arrays.

It runs very slowly.

I can make peformance worse by moving where the arrays are decleared.

I am using a mixture of short int and unsigned int.

I am not sure of the significance of the 8KB cache.

But am beginning to suspect that (despite lots of effort

with shared memory) the kernel is held up by random access

to off-chip memory for “constant” data as the 8KB cache is overwelmed.

On the other hand perhaps the 295 GTX does not like short int constant

As always any help, comments or hints would be most welcome

Bill

Do all threads of the warp access the same array elements? As far as I remember, constant cache accesses to different elements get serialized.

Do all threads of the warp access the same array elements? As far as I remember, constant cache accesses to different elements get serialized.

Dear tera,

      Whilst there is some structure, each thread tends to read at random from the array.

My initial assumption was that this would be good enough since I had struggled to fit all

the data into 64KB. Now my plan is to split the kernel in two and force each half to limit

itself to reading < 8KB.

How big a deal is serialised access? I am hoping that using less than the cache size will

ensure no off-chip reads and that will be a big enough win. Any thoughts on how to check this?

Once again many thanks

Bill

Dear tera,

      Whilst there is some structure, each thread tends to read at random from the array.

My initial assumption was that this would be good enough since I had struggled to fit all

the data into 64KB. Now my plan is to split the kernel in two and force each half to limit

itself to reading < 8KB.

How big a deal is serialised access? I am hoping that using less than the cache size will

ensure no off-chip reads and that will be a big enough win. Any thoughts on how to check this?

Once again many thanks

Bill

My guess is that using textured reads (using e.g. cudaArrays) will give better performance.

My guess is that using textured reads (using e.g. cudaArrays) will give better performance.

I agree.

If you can split your kernel operation on ~64kb of data into two operating on ~8kb each, that would indeed be a perfect solution (is the constant data kind of separable, or why is that possible?)

Cached serialized access should indeed still be faster than uncached access to the same address.

I agree.

If you can split your kernel operation on ~64kb of data into two operating on ~8kb each, that would indeed be a perfect solution (is the constant data kind of separable, or why is that possible?)

Cached serialized access should indeed still be faster than uncached access to the same address.

Dear tera and cbuchner1,

Many thanks for your suggestions. I am indeed trying to restructure the code

so that it works on each column (row) one at a time. The hope is that this will

limit the volume of data read by each multi-processor and so fit inside the 8Kbyte limit.

Is there an easy way to see how effective each constant cache is?

Eg hit or miss rates?

Many thanks

Bill

Dear tera and cbuchner1,

Many thanks for your suggestions. I am indeed trying to restructure the code

so that it works on each column (row) one at a time. The hope is that this will

limit the volume of data read by each multi-processor and so fit inside the 8Kbyte limit.

Is there an easy way to see how effective each constant cache is?

Eg hit or miss rates?

Many thanks

Bill

Still working on this:-(
However people may be interested in a recent article in which they probe the 280GTX’s caches
in huge detail.
Bill

Wong, H.; Papadopoulou, M.-M.; Sadooghi-Alvandi, M.; Moshovos, A.; , “Demystifying GPU microarchitecture through microbenchmarking,” Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on , vol., no., pp.235-246, 28-30 March 2010

Dear Dr.Langdon,

Have you evaluated Textures for your requirement?

Warp Serialization can bring down performance dearly especially for FLOP intensive apps which can’t hide this latency.

Best Regards,
Sarnath