Hey,
i’m currently trying to figure out why my kernel runs slow. It processes two stereo images (1280x960) and calculates the 3d cost space. So every pixel is tested against x (in the test case 128) disparity levels (so pixels from the other image). To achieve good performance, the pixels from the right image get cached in the shared memory so global memory should not be the bottleneck. It’s based on the ideas of a paper which reported ~2ms execution time for that settings. With my kernel I only achieve 8ms…
They used a Tesla C2050 for this, I’m working on a GTX 460 which should be roughly even right?
The Visual Profiler says there is only ~3GB/s global memory bandwidth (load) used - but I don’t know why because to my understanding the load operations should all be coalesced and therefor quite fast.
Here is the kernel (I know it’s hacky but it’s currently only for testing)
It’s currently called with a block size of [32, 1, 4]
__global__ void costEstimation(unsigned char* imageLeft, unsigned char* imageRight, unsigned char* costs){
const uint3 id = make_uint3(
blockIdx.x * blockDim.x + threadIdx.x,
blockIdx.y * blockDim.y + threadIdx.y,
blockIdx.z * blockDim.z + threadIdx.z);
const uint3 globalSize = make_uint3(
gridDim.x * blockDim.x,
gridDim.y * blockDim.y,
gridDim.z * blockDim.z);
const uint3 localId = threadIdx;
const uint3 localSize = make_uint3(blockDim.x,
blockDim.y,
blockDim.z);
int2 size = make_int2(1280, 960);
__shared__ uchar4 rightPixelCache[DISPARITY_LEVELS + 32];
// fetch into cache
int cacheSize = DISPARITY_LEVELS + localSize.x;
if(localId.z == 0) {
int cacheBasePosX = id.x - localId.x - DISPARITY_LEVELS;
int cacheId = localId.x;
int fetchesPerItem = cacheSize / localSize.x;
for(int i = 0; i < fetchesPerItem; i++) {
const int2 rightPos = make_int2(cacheBasePosX + cacheId, id.y);
if(cacheId < cacheSize) {
rightPixelCache[cacheId] = readPixel(imageRight, rightPos, size);
//rightPixelCache[cacheId] = make_uchar4(id.x, 1, 1, 0);
cacheId += localSize.x;
}
}
}
__syncthreads();
//
const int2 leftPos = make_int2(id.x, id.y);
const uchar4 leftPixel = readPixel(imageLeft, leftPos, size);
//const uchar4 leftPixel = make_uchar4(0,0,0,0);
int disparitiesPerItem = DISPARITY_LEVELS / localSize.z;
for(int i = 0; i < disparitiesPerItem; i++) {
int disparity = i * localSize.z + localId.z;
const int2 rightPos = make_int2(id.x - disparity, id.y);
unsigned char cost = 255;
int cacheBaseIndex = localId.x + DISPARITY_LEVELS;
//const uchar4 rightPixel = readPixel(imageRight, rightPos, size);
const uchar4 rightPixel = rightPixelCache[cacheBaseIndex - disparity];
cost = abs(leftPixel.y - rightPixel.y);
costs[(id.y * size.x + id.x) + disparity * size.x * size.y] = cost;
}
}
Any help would be appreaciated.
Thanks
[edit]
On a GTX 570 I get 5,7ms
load efficiency is ~83%