Even more Fermi Fun: Uncoalesced writes
My last post got me wondering if the L2 cache was any benefit to writes that cannot be coalesced at the warp level, but could be approximately coalesced at a higher level. To test the idea, I wrote a little kernel that had each thread (16M total) write an integer to output index = threadIdx.x * gridDim.x + blockIdx.x. This requires 16M memory transactions to process, but also has lots of opportunities for combining writes from different blocks if you get lucky and have consecutive blocks writing to the same cache line at the same time. Results:

[code]Device name: GeForce GTX 275
BogoGFLOPS: 699.8

Size of array: 16776960 elements
No coalesce: 21.145 ms, 3026.6 MB/sec[/code]

(Edit: fixed device name and BogoGFLOPS that I retyped wrong and aviday spotted below.)
[code]Device name: GeForce GTX 470
BogoGFLOPS: 1088.6

Size of array: 16776960 elements
No coalesce: 6.492 ms, 9857.6 MB/sec[/code]

Not bad. A 3.25x improvement between devices with approximately the same theoretical memory bandwidth. Looks like writes can be combined in the L2 cache.

(Code at: [url="http://bitbucket.org/seibert/fermi_test/src/tip/no_coalesce.cu"]http://bitbucket.org/seibert/fermi_test/sr.../no_coalesce.cu[/url] )
My last post got me wondering if the L2 cache was any benefit to writes that cannot be coalesced at the warp level, but could be approximately coalesced at a higher level. To test the idea, I wrote a little kernel that had each thread (16M total) write an integer to output index = threadIdx.x * gridDim.x + blockIdx.x. This requires 16M memory transactions to process, but also has lots of opportunities for combining writes from different blocks if you get lucky and have consecutive blocks writing to the same cache line at the same time. Results:



Device name: GeForce GTX 275

BogoGFLOPS: 699.8



Size of array: 16776960 elements

No coalesce: 21.145 ms, 3026.6 MB/sec




(Edit: fixed device name and BogoGFLOPS that I retyped wrong and aviday spotted below.)

Device name: GeForce GTX 470

BogoGFLOPS: 1088.6



Size of array: 16776960 elements

No coalesce: 6.492 ms, 9857.6 MB/sec




Not bad. A 3.25x improvement between devices with approximately the same theoretical memory bandwidth. Looks like writes can be combined in the L2 cache.



(Code at: http://bitbucket.org/seibert/fermi_test/sr.../no_coalesce.cu )

#1
Posted 05/31/2010 07:49 PM   
I presume the second one is Fermi, despite the GTX 275 title and "BogoGFLOPS" value?
I presume the second one is Fermi, despite the GTX 275 title and "BogoGFLOPS" value?

#2
Posted 05/31/2010 08:42 PM   
[quote name='avidday' post='1065493' date='May 31 2010, 03:42 PM']I presume the second one is Fermi, despite the GTX 275 title and "BogoGFLOPS" value?[/quote]

Oops, right, sorry. I have to run jobs from home on the GTX 470 through a piece of junk Java terminal emulator that doesn't support copy/paste.
[quote name='avidday' post='1065493' date='May 31 2010, 03:42 PM']I presume the second one is Fermi, despite the GTX 275 title and "BogoGFLOPS" value?



Oops, right, sorry. I have to run jobs from home on the GTX 470 through a piece of junk Java terminal emulator that doesn't support copy/paste.

#3
Posted 05/31/2010 11:09 PM   
[quote name='seibert' post='1065470' date='May 31 2010, 10:49 PM'][code]Device name: GeForce GTX 275
BogoGFLOPS: 699.8

Size of array: 16776960 elements
No coalesce: 21.145 ms, 3026.6 MB/sec[/code]

(Edit: fixed device name and BogoGFLOPS that I retyped wrong and aviday spotted below.)
[code]Device name: GeForce GTX 470
BogoGFLOPS: 1088.6

Size of array: 16776960 elements
No coalesce: 6.492 ms, 9857.6 MB/sec[/code]

Not bad. A 3.25x improvement between devices with approximately the same theoretical memory bandwidth. Looks like writes can be combined in the L2 cache.

(Code at: [url="http://bitbucket.org/seibert/fermi_test/src/tip/no_coalesce.cu"]http://bitbucket.org/seibert/fermi_test/sr.../no_coalesce.cu[/url] )[/quote]
This is using Linux and GTX480:
[code]-bash-3.2$ ./NoCoalesce
Device name: GeForce GTX 480
BogoGFLOPS: 1345.0

Size of array: 16776960 elements
No coalesce: 5.605 ms, 11418.9 MB/sec[/code]

These results bring me back to the original my original question at [url="http://forums.nvidia.com/index.php?showtopic=168798&st=0"]http://forums.nvidia.com/index.php?showtopic=168798&st=0[/url] ....
seibert - I hope its not too rude, maybe you can compile some test case (as you did here) that will stress-test a bandwidth bounded kernel,
a compute-bound kernel and a 50% BW -50% compute bounded kernel and see the results?

I'll try to do this myself as well....

thanks
eyal
[quote name='seibert' post='1065470' date='May 31 2010, 10:49 PM']
Device name: GeForce GTX 275

BogoGFLOPS: 699.8



Size of array: 16776960 elements

No coalesce: 21.145 ms, 3026.6 MB/sec




(Edit: fixed device name and BogoGFLOPS that I retyped wrong and aviday spotted below.)

Device name: GeForce GTX 470

BogoGFLOPS: 1088.6



Size of array: 16776960 elements

No coalesce: 6.492 ms, 9857.6 MB/sec




Not bad. A 3.25x improvement between devices with approximately the same theoretical memory bandwidth. Looks like writes can be combined in the L2 cache.



(Code at: http://bitbucket.org/seibert/fermi_test/sr.../no_coalesce.cu )

This is using Linux and GTX480:

-bash-3.2$ ./NoCoalesce

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0



Size of array: 16776960 elements

No coalesce: 5.605 ms, 11418.9 MB/sec




These results bring me back to the original my original question at http://forums.nvidia.com/index.php?showtopic=168798&st=0 ....

seibert - I hope its not too rude, maybe you can compile some test case (as you did here) that will stress-test a bandwidth bounded kernel,

a compute-bound kernel and a 50% BW -50% compute bounded kernel and see the results?



I'll try to do this myself as well....



thanks

eyal

#4
Posted 06/01/2010 06:30 AM   
[quote name='eyalhir74' post='1065685' date='Jun 1 2010, 09:30 AM']These results bring me back to the original my original question at [url="http://forums.nvidia.com/index.php?showtopic=168798&st=0"]http://forums.nvidia.com/index.php?showtopic=168798&st=0[/url] ....
seibert - I hope its not too rude, maybe you can compile some test case (as you did here) that will stress-test a bandwidth bounded kernel,
a compute-bound kernel and a 50% BW -50% compute bounded kernel and see the results?

I'll try to do this myself as well....[/quote]
I changed the original code a bit to do simple bandwidth test - I hope its ok with you seibert :). The results are somewhat interesting and dissapointing at the same time :) (at least for me)...

[code]/* -*- c++ -*- */

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

/* Yes, I'm going to hell. */
# define CUDA_CHECK_ERROR( call) do { \
cudaError err = call; \
if( cudaSuccess != err) { \
fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", \
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} } while (0)

///////////////// Kernels ///////////////////////

#define THREADS_PER_BLOCK 256

__global__ void no_coalesce(float *output)
{
unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;
// Transpose the block and thread ID to break coalescing
// maximally
unsigned int write_index = threadIdx.x * gridDim.x + blockIdx.x;
output[write_index] = index * 1.1f;
}

__global__ void bw_test_straight( float *input, float *output )
{
unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;
output[index] = input[index] * 1.1f;
}

__global__ void bw_test_transpose( float *input, float *output )
{
unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;
// Transpose the block and thread ID to break coalescing
// maximally
unsigned int write_index = threadIdx.x * gridDim.x + blockIdx.x;
output[write_index] = input[ index ]* 1.1f;
}


////////////////// Host code ///////////

int main()
{
const unsigned int blocks = 65535;
const unsigned int n = blocks * THREADS_PER_BLOCK;

// Device information
int device;
CUDA_CHECK_ERROR(cudaGetDevice(&device));
struct cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int proc_per_multiproc = 8;
if (prop.major == 2) proc_per_multiproc = 32;
printf("Device name: %s\n", prop.name);
// Bogus normalization metric
float bogogflops = 2 * prop.clockRate * prop.multiProcessorCount * proc_per_multiproc / 1e6;
printf("BogoGFLOPS: %1.1f\n\n", bogogflops);

printf("Size of array: %d elements\n", n);

// Allocate arrays
float *d_output, *d_input;
float *h_input = new float[n];
for ( int i = 0; i < n; i++ ) h_input[ i ] = i / 5.555f;
cudaMalloc( ( void ** )&( d_input ), sizeof(float) * n );
cudaMemcpy( d_input, h_input, sizeof(float) * n, cudaMemcpyHostToDevice );

CUDA_CHECK_ERROR(cudaMalloc((void **) &d_output, sizeof(float) * n));

// Warmup
no_coalesce<<<blocks, THREADS_PER_BLOCK>>>(d_output);
bw_test_straight<<< blocks, THREADS_PER_BLOCK>>>(d_input, d_output);
bw_test_transpose<<< blocks, THREADS_PER_BLOCK>>>(d_input, d_output);

CUDA_CHECK_ERROR(cudaThreadSynchronize());

cudaEvent_t start, stop, start1, stop1, start2, stop2;
float elapsedTime, elapsedTime1, elapsedTime2;
CUDA_CHECK_ERROR(cudaEventCreate(&start));
CUDA_CHECK_ERROR(cudaEventCreate(&stop));
CUDA_CHECK_ERROR(cudaEventCreate(&start1));
CUDA_CHECK_ERROR(cudaEventCreate(&stop1));
CUDA_CHECK_ERROR(cudaEventCreate(&start2));
CUDA_CHECK_ERROR(cudaEventCreate(&stop2));



CUDA_CHECK_ERROR(cudaEventRecord(start, 0));

no_coalesce<<<blocks, THREADS_PER_BLOCK>>>(d_output);

CUDA_CHECK_ERROR(cudaEventRecord(stop, 0));
CUDA_CHECK_ERROR(cudaEventSynchronize(stop));
CUDA_CHECK_ERROR(cudaEventElapsedTime(&elapsedTime, start,stop));

// Compute effective memory bandwidth.
printf("No coalesce: %1.3f ms, %1.1f MB/sec\n", elapsedTime,
((float) n / elapsedTime * 1e3 / 1024.0 / 1024.0 * 4));


CUDA_CHECK_ERROR(cudaEventRecord(start1, 0));

bw_test_straight<<<blocks, THREADS_PER_BLOCK>>>(d_input, d_output);

CUDA_CHECK_ERROR(cudaEventRecord(stop1, 0));
CUDA_CHECK_ERROR(cudaEventSynchronize(stop1));
CUDA_CHECK_ERROR(cudaEventElapsedTime(&elapsedTime1, start1,stop1));

printf("bw_test_straight: %1.3f ms, %1.1f MB/sec\n", elapsedTime1,
((float) n / elapsedTime1 * 1e3 / 1024.0 / 1024.0 * 4));

CUDA_CHECK_ERROR(cudaEventRecord(start2, 0));

bw_test_transpose<<<blocks, THREADS_PER_BLOCK>>>(d_input, d_output);

CUDA_CHECK_ERROR(cudaEventRecord(stop2, 0));
CUDA_CHECK_ERROR(cudaEventSynchronize(stop2));
CUDA_CHECK_ERROR(cudaEventElapsedTime(&elapsedTime2, start2,stop2));

printf("bw_test_transpose: %1.3f ms, %1.1f MB/sec\n", elapsedTime2,
((float) n / elapsedTime2 * 1e3 / 1024.0 / 1024.0 * 4));



return 0;
}[/code]


Those are the results for a GTX480 on linux:
[code]Device name: GeForce GTX 480
BogoGFLOPS: 1345.0

Size of array: 16776960 elements
No coalesce: 5.585 ms, 11458.5 MB/sec
bw_test_straight: 0.947 ms, 67571.1 MB/sec
bw_test_transpose: 5.386 ms, 11882.8 MB/sec[/code]

And those for GTX295 (one half of it) on linux:
[code]Device name: GeForce GTX 295
BogoGFLOPS: 596.2

Size of array: 16776960 elements
No coalesce: 16.828 ms, 3803.1 MB/sec
bw_test_straight: 1.378 ms, 46437.5 MB/sec
bw_test_transpose: 19.622 ms, 3261.5 MB/sec[/code]

So it seems that the original test "No coalesce" shows the same performance gain as seibert saw.
The "bw_test_transpose" also shows a great performance boost in favour of the GTX480 (~ same as the "No coalesce" test)
However the "bw_test_straight" shows ~30-40% boost from the GTX480. This is consistant with my real application boost and
is discussed in the thread I've mentioned above.
So for me, and maybe for all conventional BW bounded kernels, this is the top boost I can expect from Fermi.

eyal
[quote name='eyalhir74' post='1065685' date='Jun 1 2010, 09:30 AM']These results bring me back to the original my original question at http://forums.nvidia.com/index.php?showtopic=168798&st=0 ....

seibert - I hope its not too rude, maybe you can compile some test case (as you did here) that will stress-test a bandwidth bounded kernel,

a compute-bound kernel and a 50% BW -50% compute bounded kernel and see the results?



I'll try to do this myself as well....

I changed the original code a bit to do simple bandwidth test - I hope its ok with you seibert :). The results are somewhat interesting and dissapointing at the same time :) (at least for me)...



/* -*- c++ -*- */



#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>



/* Yes, I'm going to hell. */

# define CUDA_CHECK_ERROR( call) do { \

cudaError err = call; \

if( cudaSuccess != err) { \

fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", \

__FILE__, __LINE__, cudaGetErrorString( err) ); \

exit(EXIT_FAILURE); \

} } while (0)



///////////////// Kernels ///////////////////////



#define THREADS_PER_BLOCK 256



__global__ void no_coalesce(float *output)

{

unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

// Transpose the block and thread ID to break coalescing

// maximally

unsigned int write_index = threadIdx.x * gridDim.x + blockIdx.x;

output[write_index] = index * 1.1f;

}



__global__ void bw_test_straight( float *input, float *output )

{

unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

output[index] = input[index] * 1.1f;

}



__global__ void bw_test_transpose( float *input, float *output )

{

unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

// Transpose the block and thread ID to break coalescing

// maximally

unsigned int write_index = threadIdx.x * gridDim.x + blockIdx.x;

output[write_index] = input[ index ]* 1.1f;

}





////////////////// Host code ///////////



int main()

{

const unsigned int blocks = 65535;

const unsigned int n = blocks * THREADS_PER_BLOCK;



// Device information

int device;

CUDA_CHECK_ERROR(cudaGetDevice(&device));

struct cudaDeviceProp prop;

cudaGetDeviceProperties(&prop, 0);

int proc_per_multiproc = 8;

if (prop.major == 2) proc_per_multiproc = 32;

printf("Device name: %s\n", prop.name);

// Bogus normalization metric

float bogogflops = 2 * prop.clockRate * prop.multiProcessorCount * proc_per_multiproc / 1e6;

printf("BogoGFLOPS: %1.1f\n\n", bogogflops);



printf("Size of array: %d elements\n", n);



// Allocate arrays

float *d_output, *d_input;

float *h_input = new float[n];

for ( int i = 0; i < n; i++ ) h_input[ i ] = i / 5.555f;

cudaMalloc( ( void ** )&( d_input ), sizeof(float) * n );

cudaMemcpy( d_input, h_input, sizeof(float) * n, cudaMemcpyHostToDevice );



CUDA_CHECK_ERROR(cudaMalloc((void **) &d_output, sizeof(float) * n));



// Warmup

no_coalesce<<<blocks, THREADS_PER_BLOCK>>>(d_output);

bw_test_straight<<< blocks, THREADS_PER_BLOCK>>>(d_input, d_output);

bw_test_transpose<<< blocks, THREADS_PER_BLOCK>>>(d_input, d_output);



CUDA_CHECK_ERROR(cudaThreadSynchronize());



cudaEvent_t start, stop, start1, stop1, start2, stop2;

float elapsedTime, elapsedTime1, elapsedTime2;

CUDA_CHECK_ERROR(cudaEventCreate(&start));

CUDA_CHECK_ERROR(cudaEventCreate(&stop));

CUDA_CHECK_ERROR(cudaEventCreate(&start1));

CUDA_CHECK_ERROR(cudaEventCreate(&stop1));

CUDA_CHECK_ERROR(cudaEventCreate(&start2));

CUDA_CHECK_ERROR(cudaEventCreate(&stop2));







CUDA_CHECK_ERROR(cudaEventRecord(start, 0));



no_coalesce<<<blocks, THREADS_PER_BLOCK>>>(d_output);



CUDA_CHECK_ERROR(cudaEventRecord(stop, 0));

CUDA_CHECK_ERROR(cudaEventSynchronize(stop));

CUDA_CHECK_ERROR(cudaEventElapsedTime(&elapsedTime, start,stop));



// Compute effective memory bandwidth.

printf("No coalesce: %1.3f ms, %1.1f MB/sec\n", elapsedTime,

((float) n / elapsedTime * 1e3 / 1024.0 / 1024.0 * 4));





CUDA_CHECK_ERROR(cudaEventRecord(start1, 0));



bw_test_straight<<<blocks, THREADS_PER_BLOCK>>>(d_input, d_output);



CUDA_CHECK_ERROR(cudaEventRecord(stop1, 0));

CUDA_CHECK_ERROR(cudaEventSynchronize(stop1));

CUDA_CHECK_ERROR(cudaEventElapsedTime(&elapsedTime1, start1,stop1));



printf("bw_test_straight: %1.3f ms, %1.1f MB/sec\n", elapsedTime1,

((float) n / elapsedTime1 * 1e3 / 1024.0 / 1024.0 * 4));



CUDA_CHECK_ERROR(cudaEventRecord(start2, 0));



bw_test_transpose<<<blocks, THREADS_PER_BLOCK>>>(d_input, d_output);



CUDA_CHECK_ERROR(cudaEventRecord(stop2, 0));

CUDA_CHECK_ERROR(cudaEventSynchronize(stop2));

CUDA_CHECK_ERROR(cudaEventElapsedTime(&elapsedTime2, start2,stop2));



printf("bw_test_transpose: %1.3f ms, %1.1f MB/sec\n", elapsedTime2,

((float) n / elapsedTime2 * 1e3 / 1024.0 / 1024.0 * 4));







return 0;

}






Those are the results for a GTX480 on linux:

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0



Size of array: 16776960 elements

No coalesce: 5.585 ms, 11458.5 MB/sec

bw_test_straight: 0.947 ms, 67571.1 MB/sec

bw_test_transpose: 5.386 ms, 11882.8 MB/sec




And those for GTX295 (one half of it) on linux:

Device name: GeForce GTX 295

BogoGFLOPS: 596.2



Size of array: 16776960 elements

No coalesce: 16.828 ms, 3803.1 MB/sec

bw_test_straight: 1.378 ms, 46437.5 MB/sec

bw_test_transpose: 19.622 ms, 3261.5 MB/sec




So it seems that the original test "No coalesce" shows the same performance gain as seibert saw.

The "bw_test_transpose" also shows a great performance boost in favour of the GTX480 (~ same as the "No coalesce" test)

However the "bw_test_straight" shows ~30-40% boost from the GTX480. This is consistant with my real application boost and

is discussed in the thread I've mentioned above.

So for me, and maybe for all conventional BW bounded kernels, this is the top boost I can expect from Fermi.



eyal

#5
Posted 06/01/2010 07:39 AM   
[quote name='eyalhir74' post='1065700' date='Jun 1 2010, 02:39 AM']So it seems that the original test "No coalesce" shows the same performance gain as seibert saw.
The "bw_test_transpose" also shows a great performance boost in favour of the GTX480 (~ same as the "No coalesce" test)
However the "bw_test_straight" shows ~30-40% boost from the GTX480. This is consistant with my real application boost and
is discussed in the thread I've mentioned above.
So for me, and maybe for all conventional BW bounded kernels, this is the top boost I can expect from Fermi.[/quote]

Yes, a bandwidth-bound kernel reaching near peak memory bandwidth already is not going to see an enormous boost from Fermi. In fact the compute/memory bandwidth ratio in Fermi cards seems to be tipped more toward compute compared to previous cards.
[quote name='eyalhir74' post='1065700' date='Jun 1 2010, 02:39 AM']So it seems that the original test "No coalesce" shows the same performance gain as seibert saw.

The "bw_test_transpose" also shows a great performance boost in favour of the GTX480 (~ same as the "No coalesce" test)

However the "bw_test_straight" shows ~30-40% boost from the GTX480. This is consistant with my real application boost and

is discussed in the thread I've mentioned above.

So for me, and maybe for all conventional BW bounded kernels, this is the top boost I can expect from Fermi.



Yes, a bandwidth-bound kernel reaching near peak memory bandwidth already is not going to see an enormous boost from Fermi. In fact the compute/memory bandwidth ratio in Fermi cards seems to be tipped more toward compute compared to previous cards.

#6
Posted 06/01/2010 02:30 PM   
How about the L1 cache?

If you for example have a kernel where you do something like

[quote]for (int i = 0; i < 100; i++)
{
a += global_memory[i];
}[/quote]

I guess that the L1 cache will result in a big speedup?

(I know that this is extremely bad code and that shared memory should be used, but it's not possible at all times).
How about the L1 cache?



If you for example have a kernel where you do something like



for (int i = 0; i < 100; i++)

{

a += global_memory[i];

}




I guess that the L1 cache will result in a big speedup?



(I know that this is extremely bad code and that shared memory should be used, but it's not possible at all times).

#7
Posted 06/02/2010 08:33 PM   
[quote name='wanderine' post='1066840' date='Jun 2 2010, 11:33 PM']How about the L1 cache?
If you for example have a kernel where you do something like
I guess that the L1 cache will result in a big speedup?

(I know that this is extremely bad code and that shared memory should be used, but it's not possible at all times).[/quote]
This was run with default L1 settings:

[code]__global__ void bw_test_straight_with_loop( float *input, float *output )
{
unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;
float a = 0.0f;
for ( int i = 0; i < 1000; i++ )
{
a += input[index + i ];
}
output[index] = a * 1.1f;
}[/code]



GTX295:
[code]Device name: GeForce GTX 295
Size of array: 16776960 elements
bw_test_straight_with_loop: 1035.089 ms, 61.8 MB/sec[/code]

GTX480:
[code]Device name: GeForce GTX 480
Size of array: 16776960 elements
bw_test_straight_with_loop: 269.733 ms, 237.3 MB/sec[/code]

Nice :)


I've tested it with #pragma unroll on the loop and those are the results... I wonder which Fermi feature responsible for this.???.. :)

[code]Device name: GeForce GTX 295
Size of array: 16776960 elements
bw_test_straight_with_loop: 1037.054 ms, 61.7 MB/sec[/code]

vs

[code]Device name: GeForce GTX 480
Size of array: 16776960 elements
bw_test_straight_with_loop: 174.261 ms, 367.3 MB/sec[/code]
[quote name='wanderine' post='1066840' date='Jun 2 2010, 11:33 PM']How about the L1 cache?

If you for example have a kernel where you do something like

I guess that the L1 cache will result in a big speedup?



(I know that this is extremely bad code and that shared memory should be used, but it's not possible at all times).

This was run with default L1 settings:



__global__ void bw_test_straight_with_loop( float *input, float *output )

{

unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

float a = 0.0f;

for ( int i = 0; i < 1000; i++ )

{

a += input[index + i ];

}

output[index] = a * 1.1f;

}








GTX295:

Device name: GeForce GTX 295

Size of array: 16776960 elements

bw_test_straight_with_loop: 1035.089 ms, 61.8 MB/sec




GTX480:

Device name: GeForce GTX 480

Size of array: 16776960 elements

bw_test_straight_with_loop: 269.733 ms, 237.3 MB/sec




Nice :)





I've tested it with #pragma unroll on the loop and those are the results... I wonder which Fermi feature responsible for this.???.. :)



Device name: GeForce GTX 295

Size of array: 16776960 elements

bw_test_straight_with_loop: 1037.054 ms, 61.7 MB/sec




vs



Device name: GeForce GTX 480

Size of array: 16776960 elements

bw_test_straight_with_loop: 174.261 ms, 367.3 MB/sec

#8
Posted 06/03/2010 07:38 AM   
[quote name='eyalhir74' post='1067124' date='Jun 3 2010, 07:38 AM']I've tested it with #pragma unroll on the loop and those are the results... I wonder which Fermi feature responsible for this.???.. :)[/quote]
may be i am wrong but ... l1 cache has same hardware access as shared memory
[quote]Shared memory has 32 banks that are organized such that successive 32-bit words are assigned to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two clock cycles. Therefore, unlike for devices of lower compute capability, there may be bank conflicts between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp.
A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank.[/quote]
and
[quote]If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently:
 Two memory requests, one for each half-warp, if the size is 8 bytes,
 Four memory requests, one for each quarter-warp, if the size is 16 bytes.
Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.[/quote]
if compiler is smart enough then after loop unrolling it can organize memory/cache access with much more optimal way.
[quote name='eyalhir74' post='1067124' date='Jun 3 2010, 07:38 AM']I've tested it with #pragma unroll on the loop and those are the results... I wonder which Fermi feature responsible for this.???.. :)

may be i am wrong but ... l1 cache has same hardware access as shared memory

Shared memory has 32 banks that are organized such that successive 32-bit words are assigned to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two clock cycles. Therefore, unlike for devices of lower compute capability, there may be bank conflicts between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp.

A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank.


and

If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently:

 Two memory requests, one for each half-warp, if the size is 8 bytes,

 Four memory requests, one for each quarter-warp, if the size is 16 bytes.

Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.


if compiler is smart enough then after loop unrolling it can organize memory/cache access with much more optimal way.

#9
Posted 06/05/2010 08:47 AM   
Scroll To Top