Hi guys, i have run my code in Visual Profiler and i don’t know what changes i can make to make it run faster. Here is the relevant output from the profiler and below that there is my kernel code which is launched 11 times from CPU, with size 2048, and block size (32,32,1) and grid (64,64):
Limiting Factor Identification:
Analysis for kernel apsp_shared_kernel on device NVS 4200M
Summary profiling information for the kernel:
Number of calls: 1
GPU time(us): 1757651.12
GPU time (%): 99.72
Limiting Factor
Achieved Occupancy: 0.05 ( Theoretical Occupancy: 0.67 )
Achieved global memory throughput: 1.23 ( Peak global memory throughput(GB/s): 12.80 )
Hint(s)
The kernel occupancy is low. For details, click on Occupancy Analysis.
Occupancy Analysis:
Occupancy Analysis for kernel apsp_shared_kernel on device NVS 4200M
Kernel details: Grid size: [64 64 1], Block size: [32 32 1]
Register Ratio: 0.5625 ( 18432 / 32768 ) [17 registers per thread]
Shared Memory Ratio: 0.166667 ( 8192 / 49152 ) [8192 bytes per Block]
Active Blocks per SM: 1 (Maximum Active Blocks per SM: 8)
Active threads per SM: 1024 (Maximum Active threads per SM: 1536)
Potential Occupancy: 0.666667 ( 32 / 48 )
Occupancy limiting factor: Block-Size
Instruction Throupput Analysis:
Instruction Throughput Analysis for kernel apsp_shared_kernel on device NVS 4200M
Maximum IPC: 4
Divergent branches(%): 0.00
Control flow divergence(%): 94.15
Memory Throughput:
Memory Throughput Analysis for kernel apsp_shared_kernel on device NVS 4200M
Kernel requested global memory read throughput(GB/s): 0.00
Kernel requested global memory write throughput(GB/s): 0.00
Kernel requested global memory throughput(GB/s): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 1.22
L2 cache global memory write throughput(GB/s): 0.01
L2 cache global memory throughput(GB/s): 1.23
Global memory excess load(%): 100.00
Global memory excess store(%): 100.00
Achieved global memory read throughput(GB/s): 1.22
Achieved global memory write throughput(GB/s): 0.01
Achieved global memory throughput(GB/s): 1.23
Peak global memory throughput(GB/s): 12.80
The kernel code is Below:
/*
The Algorithm solves the APSP problem in the same way as Matrix Multiplication but
instead of using (+,*) we have to use (min, +)
*/
__global__ void apsp_shared_kernel(int* A, int size)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x;
int by = blockIdx.y;
__shared__ int As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ int Bs[BLOCK_SIZE][BLOCK_SIZE];
int iterations = N/BLOCK_SIZE;
int i = by * BLOCK_SIZE + ty;
int j = bx * BLOCK_SIZE + tx;
int value = INFINITY;
#pragma unroll 1
for(int n=0; n < iterations; ++n)
{
//every thread loads one element into As and Bs - No bank conflicts
As[ty][tx] = A[by * size * BLOCK_SIZE + n * BLOCK_SIZE + tx + ty * size];
Bs[ty][tx] = A[bx * BLOCK_SIZE + n * size * BLOCK_SIZE + tx + ty * size];
//stalling - doing nothing!
__syncthreads();
//nothing being done here inbetween
#pragma unroll 1
for(int k=0; k < BLOCK_SIZE; ++k)
{
//No conflict
if(value > (As[ty][k] + Bs[k][tx]))
{
value = As[ty][k] + Bs[k][tx];
}
}
__syncthreads();
}
//save back to the global memory
A[i * size + j] = value;
}
Any ideas how i can improve Occupancy and other limiting factors that you can read of the data?
Thanks a lot!