Texture Unit in Pascal architecture
Hi all, I found out from nvprof that texture unit will still be used even I did not use any texture memory on Pascal Titan X. I found that in Pascal texture cache can be used as L1 unified memory for SM. Will the texture unit also be in use when L1/texture cache is in use? Thanks,
Hi all,

I found out from nvprof that texture unit will still be used even I did not use any texture memory on Pascal Titan X. I found that in Pascal texture cache can be used as L1 unified memory for SM. Will the texture unit also be in use when L1/texture cache is in use?

Thanks,

#1
Posted 06/08/2017 08:28 AM   
The Maxwell and Pascal architecture combined the TEX and L1 cache into a single unified cache. All global, local, surface, and texture operations go through this cache. [url]http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf[/url] p.6 [i]"Pairs of processing blocks share four texture filtering units and a texture cache. The compute L1 cache function has now also been combined with the texture cache, and shared memory is a separate unit (similar to the approach used on G80, the first CUDA capable GPU), that is shared across all four blocks."[/i] I do not understand the question "Will the texture unit also be in use when L1/texture cache is in use?". All local, global, surface, and texture operations go through the texture cache. Local and global operations do not perform LOD, sampling, filtering, and data type conversion.
The Maxwell and Pascal architecture combined the TEX and L1 cache into a single unified cache. All global, local, surface, and texture operations go through this cache.

http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf p.6

"Pairs of processing blocks share four texture filtering units and a texture cache. The compute L1 cache
function has now also been combined with the texture cache, and shared memory is a separate unit
(similar to the approach used on G80, the first CUDA capable GPU), that is shared across all four blocks."


I do not understand the question "Will the texture unit also be in use when L1/texture cache is in use?". All local, global, surface, and texture operations go through the texture cache. Local and global operations do not perform LOD, sampling, filtering, and data type conversion.

#2
Posted 06/10/2017 05:31 PM   
Thank you for your post, it may save me some coding time. I have two global arrays. 1. Small and heavily read 2. The second big and read only once. Both are read in order rather than at random. My thoughts are there is no point caching the second, but the compiler will not know this, instead both arrays will be cached and so the first (small <12kbytes) will often be pushed out of the cache by the second (<16Mbytes). I was thinking to put the small heavily used array in a 1D int texture and using the texture cache. But I read your post to mean on my GeForce GTX 745 the L1 caches are combined, so there is no texture cache and this approach will not work. Have I understood correctly? Is there another way? How can I tell if the feared trashing of the caches is happening or not. As always, any help or guidance would be most welcome Bill
Thank you for your post, it may save me some coding time.

I have two global arrays.
1. Small and heavily read
2. The second big and read only once.
Both are read in order rather than at random.

My thoughts are there is no point caching the second,
but the compiler will not know this, instead both arrays
will be cached and so the first (small <12kbytes) will often be pushed
out of the cache by the second (<16Mbytes).

I was thinking to put the small heavily used array in a 1D int texture
and using the texture cache. But I read your post to mean on my
GeForce GTX 745 the L1 caches are combined, so there is no texture cache
and this approach will not work.
Have I understood correctly?

Is there another way?

How can I tell if the feared trashing of the caches is happening or not.

As always, any help or guidance would be most welcome
Bill

#3
Posted 01/02/2018 08:55 PM   
12 KB can be placed into the shared memory of each SM another possibility is to use L2-only caching for the large array and L1-enabled caching for the small one: https://nvlabs.github.io/cub/classcub_1_1_cache_modified_input_iterator.html#details https://nvlabs.github.io/cub/group___util_io.html#gac5f2805ad56fdd0f2860a5421d76d9b9 well, LOAD_CS looks like a good fit for the large table, and LOAD_CA for the small one. you can experiment in order to find best modifiers. also, note that results may be different for other GPU generations since it's a part of architecture that is especially frequently changed :)
12 KB can be placed into the shared memory of each SM

another possibility is to use L2-only caching for the large array and L1-enabled caching for the small one:

https://nvlabs.github.io/cub/classcub_1_1_cache_modified_input_iterator.html#details

https://nvlabs.github.io/cub/group___util_io.html#gac5f2805ad56fdd0f2860a5421d76d9b9

well, LOAD_CS looks like a good fit for the large table, and LOAD_CA for the small one. you can experiment in order to find best modifiers. also, note that results may be different for other GPU generations since it's a part of architecture that is especially frequently changed :)

#4
Posted 01/02/2018 09:08 PM   
[quote]instead both arrays will be cached and so the first (small <12kbytes) will often be pushed out of the cache by the second (<16Mbytes).[/quote] Slightly simplifying what will happen: Given just these two arrays, elements from the small array will be read into to cache at most twice. According to what you stated, elements from the large array will be read only once, and [i]may[/i] kick elements of the small array out of the cache at that time. The next access to an element from the small array will place it in the cache again, and it will be read from there many times. The loss in efficiency compared to the ideal case (large array is kept out of the cache altogether) seems minuscule. As BulatZiganshin points out, manually caching the small array in shared memory seems like the way to go for ultimate performance.
instead both arrays will be cached and so the first (small <12kbytes) will often be pushed out of the cache by the second (<16Mbytes).

Slightly simplifying what will happen: Given just these two arrays, elements from the small array will be read into to cache at most twice. According to what you stated, elements from the large array will be read only once, and may kick elements of the small array out of the cache at that time. The next access to an element from the small array will place it in the cache again, and it will be read from there many times. The loss in efficiency compared to the ideal case (large array is kept out of the cache altogether) seems minuscule.

As BulatZiganshin points out, manually caching the small array in shared memory seems like the way to go for ultimate performance.

#5
Posted 01/02/2018 09:14 PM   
I think your GTX 745 is compute capability 5.0 (GM107) so I would agree with your assessment. On cc5.0 L1/Tex/RO are unified. A few possibilities: - read the 2nd array using uncached loads [url]https://stackoverflow.com/questions/12553086/cuda-disable-l1-cache-only-for-one-variable[/url] - put the first array in __constant__ memory if read patterns are uniform (same address across warp) - put the first array in shared memory (might be best/most performant option) The profiler can help with assessing cache behavior. Take a look at the cache related metrics such as hit rate.
I think your GTX 745 is compute capability 5.0 (GM107) so I would agree with your assessment. On cc5.0 L1/Tex/RO are unified. A few possibilities:

- read the 2nd array using uncached loads

https://stackoverflow.com/questions/12553086/cuda-disable-l1-cache-only-for-one-variable

- put the first array in __constant__ memory if read patterns are uniform (same address across warp)
- put the first array in shared memory (might be best/most performant option)

The profiler can help with assessing cache behavior. Take a look at the cache related metrics such as hit rate.

#6
Posted 01/02/2018 09:19 PM   
njuffa, your analysis is incorrect - outcome depends on the ratio of accesses. if we read A[i] once per 1000 reads of B[j], then cache with 1000 cells will drop the A[i] value before it will be requested again Although, there are high chances that it will be not the case, and anyway - if there are so many accesses to B[] that 12KB-large A[] may be swapped out of 1-4 MB cache of modern GPUs, probably we don't have enough accesses to A[] anyway to make any significant speed improvements by better caching it
njuffa, your analysis is incorrect - outcome depends on the ratio of accesses. if we read A[i] once per 1000 reads of B[j], then cache with 1000 cells will drop the A[i] value before it will be requested again

Although, there are high chances that it will be not the case, and anyway - if there are so many accesses to B[] that 12KB-large A[] may be swapped out of 1-4 MB cache of modern GPUs, probably we don't have enough accesses to A[] anyway to make any significant speed improvements by better caching it

#7
Posted 01/02/2018 09:49 PM   
[quote]if we read A[i] once per 1000 reads of B[j], then cache with 1000 cells will drop the A[i] value before it will be requested again[/quote] That's not how I interpreted the usage pattern stated by the OP. To first order, the single reading of the large array B[] serves to flush the cache completely, once. Accesses to the small array A[] thereafter will repopulate the cache with data from A[], and any subsequent access (many, according to the OP), will now hit the cache. Maybe the access pattern in the OP's code is more complicated than the original description suggests. Maybe there is more data than just A[] and B[]. Maybe accesses to A[] and B[] are interspersed in interesting ways. Even so, given that A[] is the most frequently accessed data around, and by itself completely fits into the cache, caching should work very well for this situation. It will be interesting to see how performance changes from reading both A[] and B[] through the cache, vs. caching A[] in shared memory manually. If the code does not do so yet, I would suggest use of '__restrict__' and 'const __restrict__' pointers to allow the compiler the maximum freedom in arranging an "optimal" sequence of load instructions.
if we read A[i] once per 1000 reads of B[j], then cache with 1000 cells will drop the A[i] value before it will be requested again

That's not how I interpreted the usage pattern stated by the OP. To first order, the single reading of the large array B[] serves to flush the cache completely, once. Accesses to the small array A[] thereafter will repopulate the cache with data from A[], and any subsequent access (many, according to the OP), will now hit the cache.

Maybe the access pattern in the OP's code is more complicated than the original description suggests. Maybe there is more data than just A[] and B[]. Maybe accesses to A[] and B[] are interspersed in interesting ways. Even so, given that A[] is the most frequently accessed data around, and by itself completely fits into the cache, caching should work very well for this situation.

It will be interesting to see how performance changes from reading both A[] and B[] through the cache, vs. caching A[] in shared memory manually. If the code does not do so yet, I would suggest use of '__restrict__' and 'const __restrict__' pointers to allow the compiler the maximum freedom in arranging an "optimal" sequence of load instructions.

#8
Posted 01/02/2018 10:17 PM   
my bad, i was sure that my interpretation of this English code is only one possible. But English is awfully polyvalue language, we should ban its use for any serious conversations! :D
my bad, i was sure that my interpretation of this English code is only one possible. But English is awfully polyvalue language, we should ban its use for any serious conversations! :D

#9
Posted 01/02/2018 11:21 PM   
At this point I am actually not at all sure that my interpretation of the OP's description is correct. In general, it is better to look at actual code than read high-level descriptions of it.
At this point I am actually not at all sure that my interpretation of the OP's description is correct. In general, it is better to look at actual code than read high-level descriptions of it.

#10
Posted 01/02/2018 11:26 PM   
Many thanks for all your kind thoughts. In the hope that this will help I am going to try and post the code of my kernel: [code]//WBL 3 Jan 2018 clean up for https://devtalk.nvidia.com/default/topic/1012969/cuda-programming-and-performance/texture-unit-in-pascal-architecture/ //Was r1.65 #define INF 10000000 /* (INT_MAX/10) */ #define MIN2(A, B) ((A) < (B) ? (A) : (B)) //BLOCK_SIZE 128 only slightly lower performance on GeForce GTX 745 (4GB compute capability 5.0) #define BLOCK_SIZE 64 /* kernel, excepting data which is set to INF, set the output to the smallest sum of fml_i and fml_j note data in fml_i (especially at the top end) are repeatedly read but data in fml_j are only read once. Each block calculates the smallest sum for once chunk and writes it (one int) to dml. The outputs do not overlap. The volume of work starts tiny (one addition) but grows quadratically as i reduces to 1. typical use: i 2909 down to 1 turn = 4 length = 2913 fml_i 2914 int fml_j 4247155 int dml 2914 int I.e. arrays are of fixed size but part used grows linearly or quadratically (fml_j) as i decreases to 1. Identically the number of blocks grows linearly as i decreases to 1. */ //tried __restrict__ and got slightly lower performance on GeForce GTX 745 __global__ void kernel( const int i, const int turn, const int length, const int* fml_i, const int* fml_j, //In int* dml) { //Out const int x = blockIdx.x; const int j = x + (i + 2*(turn+1)) + 1; int y = threadIdx.x; int thread = j*(j-1)/2 + threadIdx.x + i + (turn+1) + 1; int decomp = INF; __shared__ int en[BLOCK_SIZE]; for(; y <= x; thread+=blockDim.x, y+=blockDim.x) { //assert(x>=0 && x<=length); //assert(y>=0 && y<=length); //assert(y<=x); en[threadIdx.x] = ((fml_i[y] != INF ) && (fml_j[thread] != INF))? fml_i[y] + fml_j[thread] : INF; //Use reduction, require power of two block size #if BLOCK_SIZE > 32 #define SYNC32 __syncthreads() #else #define SYNC32 #endif int ix = threadIdx.x; const int ix_stop = MIN2(x-y+threadIdx.x, blockDim.x - 1); //assert(ix_stop >= 0 && ix_stop < blockDim.x); //assert(en[ix] > -INF && en[ix] <= INF); //assert(en[ix] != 0); //for testing only #if BLOCK_SIZE >=1024 __syncthreads(); if(ix+512 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+512]); #endif #if BLOCK_SIZE >=512 __syncthreads(); if(ix+256 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+256]); #endif #if BLOCK_SIZE >=256 __syncthreads(); if(ix+128 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+128]); #endif #if BLOCK_SIZE >=128 __syncthreads(); if(ix+ 64 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 64]); #endif #if BLOCK_SIZE >=64 __syncthreads(); if(ix+ 32 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 32]); #endif SYNC32; if(ix+ 16 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 16]); SYNC32; if(ix+ 8 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 8]); SYNC32; if(ix+ 4 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 4]); SYNC32; if(ix+ 2 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 2]); SYNC32; if(ix+ 1 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 1]); SYNC32; decomp = MIN2(decomp,en[ix]); }//endfor if(threadIdx.x==0){ dml[j] = decomp; } } [/code]
Many thanks for all your kind thoughts.

In the hope that this will help I am going to try and post the code of
my kernel:
//WBL 3 Jan 2018 clean up for https://devtalk.nvidia.com/default/topic/1012969/cuda-programming-and-performance/texture-unit-in-pascal-architecture/

//Was r1.65

#define INF 10000000 /* (INT_MAX/10) */
#define MIN2(A, B) ((A) < (B) ? (A) : (B))

//BLOCK_SIZE 128 only slightly lower performance on GeForce GTX 745 (4GB compute capability 5.0)
#define BLOCK_SIZE 64

/*
kernel, excepting data which is set to INF, set the output to the smallest sum of fml_i and fml_j
note data in fml_i (especially at the top end) are repeatedly read
but data in fml_j are only read once.
Each block calculates the smallest sum for once chunk and writes it (one int) to dml.
The outputs do not overlap.

The volume of work starts tiny (one addition) but grows quadratically as i reduces to 1.

typical use:
i 2909 down to 1
turn = 4
length = 2913
fml_i 2914 int
fml_j 4247155 int
dml 2914 int
I.e. arrays are of fixed size but part used grows linearly or quadratically (fml_j)
as i decreases to 1.
Identically the number of blocks grows linearly as i decreases to 1.
*/

//tried __restrict__ and got slightly lower performance on GeForce GTX 745
__global__ void
kernel(
const int i, const int turn, const int length,
const int* fml_i, const int* fml_j, //In
int* dml) { //Out

const int x = blockIdx.x;
const int j = x + (i + 2*(turn+1)) + 1;
int y = threadIdx.x;
int thread = j*(j-1)/2 + threadIdx.x + i + (turn+1) + 1;
int decomp = INF;
__shared__ int en[BLOCK_SIZE];
for(; y <= x; thread+=blockDim.x, y+=blockDim.x) {
//assert(x>=0 && x<=length);
//assert(y>=0 && y<=length);
//assert(y<=x);

en[threadIdx.x] = ((fml_i[y] != INF ) && (fml_j[thread] != INF))? fml_i[y] + fml_j[thread] : INF;

//Use reduction, require power of two block size
#if BLOCK_SIZE > 32
#define SYNC32 __syncthreads()
#else
#define SYNC32
#endif
int ix = threadIdx.x;
const int ix_stop = MIN2(x-y+threadIdx.x, blockDim.x - 1);
//assert(ix_stop >= 0 && ix_stop < blockDim.x);
//assert(en[ix] > -INF && en[ix] <= INF);
//assert(en[ix] != 0); //for testing only
#if BLOCK_SIZE >=1024
__syncthreads(); if(ix+512 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+512]);
#endif
#if BLOCK_SIZE >=512
__syncthreads(); if(ix+256 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+256]);
#endif
#if BLOCK_SIZE >=256
__syncthreads(); if(ix+128 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+128]);
#endif
#if BLOCK_SIZE >=128
__syncthreads(); if(ix+ 64 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 64]);
#endif
#if BLOCK_SIZE >=64
__syncthreads(); if(ix+ 32 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 32]);
#endif
SYNC32; if(ix+ 16 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 16]);
SYNC32; if(ix+ 8 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 8]);
SYNC32; if(ix+ 4 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 4]);
SYNC32; if(ix+ 2 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 2]);
SYNC32; if(ix+ 1 <= ix_stop) en[ix] = MIN2(en[ix], en[ix+ 1]);
SYNC32;
decomp = MIN2(decomp,en[ix]);
}//endfor

if(threadIdx.x==0){
dml[j] = decomp;
}
}

#11
Posted 01/03/2018 07:40 PM   
Thank you everyone. The same volume of data are read from the small array and the large array essentially sum = A[i] + B[j] but i takes many of the same small (<3000) values over and over again, whereas j (<4500000) is never repeated. My understanding may be out of date but: 1. I had bad experience with constant memory early on (CUDA 3.1) Each thread in each warp will access an adjacent int in constant memory I suspect this will cause it to serialise reads. My feeling is constant memory is only being retained for compatibility. I seem to remember (ages back) constant memory was actually implemented by a tiny (2kb??) read only cache. 2. Again I may be out of date, but reading data into shared memory means it can only be used by the block that read it? I will re-read your many helpful suggestions. Many thanks Bill
Thank you everyone.

The same volume of data are read from the small array and the large array
essentially sum = A[i] + B[j]
but i takes many of the same small (<3000) values over and over again,
whereas j (<4500000) is never repeated.


My understanding may be out of date but:
1. I had bad experience with constant memory early on (CUDA 3.1)
Each thread in each warp will access an adjacent int in constant memory
I suspect this will cause it to serialise reads.
My feeling is constant memory is only being retained for compatibility.
I seem to remember (ages back) constant memory was actually implemented by
a tiny (2kb??) read only cache.
2. Again I may be out of date, but reading data into shared memory means it
can only be used by the block that read it?

I will re-read your many helpful suggestions.
Many thanks
Bill

#12
Posted 01/03/2018 08:05 PM   
I am having an impossible time trying to assess the access pattern for fml_i[] in my head. Is this some sort of triangular structure? So I wouldn't want to predict what happens to the cache here. The code certainly looks very different from what I had envisioned based on the original description; I withdraw my earlier comments, they don't seem to apply at all. Manually caching fml_i[] in shared memory seems advisable and trivially possible (unless BLOCKSIZE > 1024, but then a BLOCKSIZE > 1024 would not normally be recommended, since one would want have at least two thread blocks active per SM for good performance). By benchmarking the original code, as posted, against a version using shared memory for fml_i[], we would get a reasonable idea about the amount of destructive interference in the cache, but only indirectly. At this point I am actually quite curious what the result would be, it should help build intuition for similar usage patterns. BTW, does the comment about __restrict__ being counter-productive in terms of performance still apply? That may be owed to compiler artifacts related to load ordering / batching, or maybe register pressure effects (as pulling out loads to be performed early may increase the live range of variables).
I am having an impossible time trying to assess the access pattern for fml_i[] in my head. Is this some sort of triangular structure? So I wouldn't want to predict what happens to the cache here. The code certainly looks very different from what I had envisioned based on the original description; I withdraw my earlier comments, they don't seem to apply at all.

Manually caching fml_i[] in shared memory seems advisable and trivially possible (unless BLOCKSIZE > 1024, but then a BLOCKSIZE > 1024 would not normally be recommended, since one would want have at least two thread blocks active per SM for good performance).

By benchmarking the original code, as posted, against a version using shared memory for fml_i[], we would get a reasonable idea about the amount of destructive interference in the cache, but only indirectly. At this point I am actually quite curious what the result would be, it should help build intuition for similar usage patterns.

BTW, does the comment about __restrict__ being counter-productive in terms of performance still apply? That may be owed to compiler artifacts related to load ordering / batching, or maybe register pressure effects (as pulling out loads to be performed early may increase the live range of variables).

#13
Posted 01/03/2018 08:05 PM   
Shared memory is shared by all threads in a thread block, correct. You are correct, constant memory is built for broadcast access across a warp. If different threads in the warp present different addresses, the hardware will serialize the access and replay the load for as many times as different addresses occur across the warp. This is an obvious performance penalty. Last I did experiments (Kepler architecture), use of constant memory would still be a win if the average number of unique addresses presented is < 2.5 (across the [i]small[/i] number of scenarios I examined). Constant memory is still needed for literal constants, both those introduced by programmers and compiler-generated ones (the latter go to a different constant bank than the former). Constant memory (yet another bank) is also used to pass kernel arguments. So constant memory is not obsolete. With the broadcast access, an access to constant memory is basically the same cost as a register access. If you implement a function by table lookup, constant memory might still be a good place to consider. I looked at a real-life use case for that not too long ago, where I examined the trade-off between on-the-fly computation and table lookup. Normally, I push in the direction of using computation, as "FLOPs are too cheap to meter" and computation is energetically advantageous compared with memory access. However, in this particular case it turned out (contrary to my expectation) that table access won in terms of performance, because the serialization had only a minor impact.
Shared memory is shared by all threads in a thread block, correct.


You are correct, constant memory is built for broadcast access across a warp. If different threads in the warp present different addresses, the hardware will serialize the access and replay the load for as many times as different addresses occur across the warp. This is an obvious performance penalty. Last I did experiments (Kepler architecture), use of constant memory would still be a win if the average number of unique addresses presented is < 2.5 (across the small number of scenarios I examined).

Constant memory is still needed for literal constants, both those introduced by programmers and compiler-generated ones (the latter go to a different constant bank than the former). Constant memory (yet another bank) is also used to pass kernel arguments. So constant memory is not obsolete. With the broadcast access, an access to constant memory is basically the same cost as a register access.

If you implement a function by table lookup, constant memory might still be a good place to consider. I looked at a real-life use case for that not too long ago, where I examined the trade-off between on-the-fly computation and table lookup. Normally, I push in the direction of using computation, as "FLOPs are too cheap to meter" and computation is energetically advantageous compared with memory access. However, in this particular case it turned out (contrary to my expectation) that table access won in terms of performance, because the serialization had only a minor impact.

#14
Posted 01/03/2018 08:19 PM   
1. if i understood the kernel right, thread block doesn't reuse any values, so forget about caching data into shared memory 2. so, the old suggestions apply - direct compiler to use non-caching read for the large array and try various forms of caching for the small one 3. most part of your code is the block reduce algorithm, you may try ready-to-use supeer-optimized implementation from https://nvlabs.github.io/cub/classcub_1_1_block_reduce.html#details 4. may be, code may further optimized by prefetching data in parallel with computations. the simplest way to implement it is to use CUB's [url=https://nvlabs.github.io/cub/classcub_1_1_block_load.html]BlockLoad[/url] algo, followed by [url=https://nvlabs.github.io/cub/classcub_1_1_block_reduce.html#a81878a614ef3b39de654918fc1f6144d]multi-item BlockReduce[/url]
1. if i understood the kernel right, thread block doesn't reuse any values, so forget about caching data into shared memory

2. so, the old suggestions apply - direct compiler to use non-caching read for the large array and try various forms of caching for the small one

3. most part of your code is the block reduce algorithm, you may try ready-to-use supeer-optimized implementation from https://nvlabs.github.io/cub/classcub_1_1_block_reduce.html#details

4. may be, code may further optimized by prefetching data in parallel with computations. the simplest way to implement it is to use CUB's BlockLoad algo, followed by multi-item BlockReduce

#15
Posted 01/03/2018 09:06 PM   
Scroll To Top

Add Reply