Visual Profiler Output
Hi guys, i have run my code in Visual Profiler and i don't know what changes i can make to make it run faster. Here is the relevant output from the profiler and below that there is my kernel code which is launched 11 times from CPU, with size 2048, and block size (32,32,1) and grid (64,64):


Limiting Factor Identification:

[code]
Analysis for kernel apsp_shared_kernel on device NVS 4200M


Summary profiling information for the kernel:
Number of calls: 1
GPU time(us): 1757651.12
GPU time (%): 99.72

Limiting Factor
Achieved Occupancy: 0.05 ( Theoretical Occupancy: 0.67 )
Achieved global memory throughput: 1.23 ( Peak global memory throughput(GB/s): 12.80 )


Hint(s)
The kernel occupancy is low. For details, click on Occupancy Analysis.
[/code]

Occupancy Analysis:

[code]
Occupancy Analysis for kernel apsp_shared_kernel on device NVS 4200M

Kernel details: Grid size: [64 64 1], Block size: [32 32 1]
Register Ratio: 0.5625 ( 18432 / 32768 ) [17 registers per thread]
Shared Memory Ratio: 0.166667 ( 8192 / 49152 ) [8192 bytes per Block]
Active Blocks per SM: 1 (Maximum Active Blocks per SM: 8)
Active threads per SM: 1024 (Maximum Active threads per SM: 1536)
Potential Occupancy: 0.666667 ( 32 / 48 )
Occupancy limiting factor: Block-Size
[/code]


Instruction Throupput Analysis:

[code]
Instruction Throughput Analysis for kernel apsp_shared_kernel on device NVS 4200M

Maximum IPC: 4
Divergent branches(%): 0.00
Control flow divergence(%): 94.15

[/code]

Memory Throughput:

[code]

Memory Throughput Analysis for kernel apsp_shared_kernel on device NVS 4200M

Kernel requested global memory read throughput(GB/s): 0.00
Kernel requested global memory write throughput(GB/s): 0.00
Kernel requested global memory throughput(GB/s): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 1.22
L2 cache global memory write throughput(GB/s): 0.01
L2 cache global memory throughput(GB/s): 1.23
Global memory excess load(%): 100.00
Global memory excess store(%): 100.00
Achieved global memory read throughput(GB/s): 1.22
Achieved global memory write throughput(GB/s): 0.01
Achieved global memory throughput(GB/s): 1.23
Peak global memory throughput(GB/s): 12.80
[/code]



The kernel code is Below:

[code]


/*
The Algorithm solves the APSP problem in the same way as Matrix Multiplication but
instead of using (+,*) we have to use (min, +)
*/


__global__ void apsp_shared_kernel(int* A, int size)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x;
int by = blockIdx.y;

__shared__ int As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ int Bs[BLOCK_SIZE][BLOCK_SIZE];

int iterations = N/BLOCK_SIZE;

int i = by * BLOCK_SIZE + ty;
int j = bx * BLOCK_SIZE + tx;




int value = INFINITY;

#pragma unroll 1
for(int n=0; n < iterations; ++n)
{
//every thread loads one element into As and Bs - No bank conflicts
As[ty][tx] = A[by * size * BLOCK_SIZE + n * BLOCK_SIZE + tx + ty * size];
Bs[ty][tx] = A[bx * BLOCK_SIZE + n * size * BLOCK_SIZE + tx + ty * size];

//stalling - doing nothing!
__syncthreads();

//nothing being done here inbetween


#pragma unroll 1
for(int k=0; k < BLOCK_SIZE; ++k)
{
//No conflict
if(value > (As[ty][k] + Bs[k][tx]))
{
value = As[ty][k] + Bs[k][tx];
}
}

__syncthreads();
}

//save back to the global memory
A[i * size + j] = value;
}
[/code]


Any ideas how i can improve Occupancy and other limiting factors that you can read of the data?

Thanks a lot!
Hi guys, i have run my code in Visual Profiler and i don't know what changes i can make to make it run faster. Here is the relevant output from the profiler and below that there is my kernel code which is launched 11 times from CPU, with size 2048, and block size (32,32,1) and grid (64,64):





Limiting Factor Identification:





Analysis for kernel apsp_shared_kernel on device NVS 4200M





Summary profiling information for the kernel:

Number of calls: 1

GPU time(us): 1757651.12

GPU time (%): 99.72



Limiting Factor

Achieved Occupancy: 0.05 ( Theoretical Occupancy: 0.67 )

Achieved global memory throughput: 1.23 ( Peak global memory throughput(GB/s): 12.80 )





Hint(s)

The kernel occupancy is low. For details, click on Occupancy Analysis.




Occupancy Analysis:





Occupancy Analysis for kernel apsp_shared_kernel on device NVS 4200M



Kernel details: Grid size: [64 64 1], Block size: [32 32 1]

Register Ratio: 0.5625 ( 18432 / 32768 ) [17 registers per thread]

Shared Memory Ratio: 0.166667 ( 8192 / 49152 ) [8192 bytes per Block]

Active Blocks per SM: 1 (Maximum Active Blocks per SM: 8)

Active threads per SM: 1024 (Maximum Active threads per SM: 1536)

Potential Occupancy: 0.666667 ( 32 / 48 )

Occupancy limiting factor: Block-Size






Instruction Throupput Analysis:





Instruction Throughput Analysis for kernel apsp_shared_kernel on device NVS 4200M



Maximum IPC: 4

Divergent branches(%): 0.00

Control flow divergence(%): 94.15






Memory Throughput:







Memory Throughput Analysis for kernel apsp_shared_kernel on device NVS 4200M



Kernel requested global memory read throughput(GB/s): 0.00

Kernel requested global memory write throughput(GB/s): 0.00

Kernel requested global memory throughput(GB/s): 0.00

L2 cache texture memory read throughput(GB/s): 0.00

L2 cache global memory read throughput(GB/s): 1.22

L2 cache global memory write throughput(GB/s): 0.01

L2 cache global memory throughput(GB/s): 1.23

Global memory excess load(%): 100.00

Global memory excess store(%): 100.00

Achieved global memory read throughput(GB/s): 1.22

Achieved global memory write throughput(GB/s): 0.01

Achieved global memory throughput(GB/s): 1.23

Peak global memory throughput(GB/s): 12.80








The kernel code is Below:









/*

The Algorithm solves the APSP problem in the same way as Matrix Multiplication but

instead of using (+,*) we have to use (min, +)

*/





__global__ void apsp_shared_kernel(int* A, int size)

{

int tx = threadIdx.x;

int ty = threadIdx.y;

int bx = blockIdx.x;

int by = blockIdx.y;



__shared__ int As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ int Bs[BLOCK_SIZE][BLOCK_SIZE];



int iterations = N/BLOCK_SIZE;



int i = by * BLOCK_SIZE + ty;

int j = bx * BLOCK_SIZE + tx;









int value = INFINITY;



#pragma unroll 1

for(int n=0; n < iterations; ++n)

{

//every thread loads one element into As and Bs - No bank conflicts

As[ty][tx] = A[by * size * BLOCK_SIZE + n * BLOCK_SIZE + tx + ty * size];

Bs[ty][tx] = A[bx * BLOCK_SIZE + n * size * BLOCK_SIZE + tx + ty * size];



//stalling - doing nothing!

__syncthreads();



//nothing being done here inbetween





#pragma unroll 1

for(int k=0; k < BLOCK_SIZE; ++k)

{

//No conflict

if(value > (As[ty][k] + Bs[k][tx]))

{

value = As[ty][k] + Bs[k][tx];

}

}



__syncthreads();

}



//save back to the global memory

A[i * size + j] = value;

}






Any ideas how i can improve Occupancy and other limiting factors that you can read of the data?



Thanks a lot!

#1
Posted 04/19/2012 04:07 PM   
Hi

You should start with:
"Hint(s)
The kernel occupancy is low. For details, click on Occupancy Analysis."

From there you should go to the Occupancy Analysis and see that:
"Occupancy limiting factor: Block-Size".

This means that your block size is the problem, it is too big.
Try running with smaller blocks (less threads each block), such as block size (16,16,1). This will of course require more blocks (grid size (128,128) in this case).
Hi



You should start with:

"Hint(s)

The kernel occupancy is low. For details, click on Occupancy Analysis."



From there you should go to the Occupancy Analysis and see that:

"Occupancy limiting factor: Block-Size".



This means that your block size is the problem, it is too big.

Try running with smaller blocks (less threads each block), such as block size (16,16,1). This will of course require more blocks (grid size (128,128) in this case).

#2
Posted 05/02/2012 01:45 PM   
Scroll To Top