I am wanting to understand shared memory better and how to use it for my application. I have read through some of the documents and its still not clear to me how to put the variables i need in shared memory and access them for faster performance.
I have written a basic generic kernel based on my application that doesn't use shared memory and one that uses shared memory how i understand it.

non shared memory
[code]
#define twoPi 6.283185307179586

The kernel call will be:
delay<<<1,512>>>(array1,..... realOut, imagOut);

I have tried this implementation out on my applciation and it doesn't work, I know its got something to do with my understanding. My question is, do i have the basic syntax right for distributing the variables into shared memory and accessing them appropriately? I know there is a limit to the amount of shared memory i have, i think this is 49152 bytes for my Tesla C2050 card. This should be anough for what i'm using here, 12010 bytes. Also since each thread accesses every element in both arrays, is there a better way to do this?

I am wanting to understand shared memory better and how to use it for my application. I have read through some of the documents and its still not clear to me how to put the variables i need in shared memory and access them for faster performance.

I have written a basic generic kernel based on my application that doesn't use shared memory and one that uses shared memory how i understand it.

I have tried this implementation out on my applciation and it doesn't work, I know its got something to do with my understanding. My question is, do i have the basic syntax right for distributing the variables into shared memory and accessing them appropriately? I know there is a limit to the amount of shared memory i have, i think this is 49152 bytes for my Tesla C2050 card. This should be anough for what i'm using here, 12010 bytes. Also since each thread accesses every element in both arrays, is there a better way to do this?

Physics is like sex,

Sure it has practical applications, but thats not why we do it.

1) the way used to load the data in shared memory is very inefficient. Take a look at slides 48-57 from http://corsi.cineca.it/courses/scuolaAvanzata/Massimiliano%20Fatica/03-CUDA_C_Basics.pptx.pdf to see
a more efficient way ( basically each thread should load a different element, in your case with a block size of 512 and 999 elements, each thread should read up to two elements)

2) There is no need to load the scalars in shared memory

3) you are loading the data in SharedArray1 and SharedArray2 but after you are still using the original array1 and array2.

These 3 points should only effect performance, the synchthread is in the right place.

a more efficient way ( basically each thread should load a different element, in your case with a block size of 512 and 999 elements, each thread should read up to two elements)

2) There is no need to load the scalars in shared memory

3) you are loading the data in SharedArray1 and SharedArray2 but after you are still using the original array1 and array2.

These 3 points should only effect performance, the synchthread is in the right place.

thanks mfatica,
1) The slides helped me realise how to populate the arrays using the thread idx.
2)Can you explain why you dont need to transfer the scalars to shared memory, is it because the kernel places them there at launch?
3)I corrected the use of the shared array.

I did get a minor improvement int he performance, but i was hoping for more. Can you suggest any other options with this algorithm in mind to increase performance? I should note that this kernel gets executed 512 times via a for loop outside of the kernel.

This is what it looks like now.
[code]
#define twoPi 6.283185307179586

/* populate the shared memory */
SharedArr1[idx] = array1[idx]; // first 512 elemnents
SharedArr2[idx] = array2[idx];

/* load the remaining elements in the shared arrays */
if (idx+blockDim.x < 999) {
SharedArr1[idx+blockDim.x] = array1[idx+blockDim.x];
SharedArr2[idx+blockDim.x] = array2[idx+blockDim.x];
}
__syncthreads();

1) The slides helped me realise how to populate the arrays using the thread idx.

2)Can you explain why you dont need to transfer the scalars to shared memory, is it because the kernel places them there at launch?

3)I corrected the use of the shared array.

I did get a minor improvement int he performance, but i was hoping for more. Can you suggest any other options with this algorithm in mind to increase performance? I should note that this kernel gets executed 512 times via a for loop outside of the kernel.

This is what it looks like now.

#define twoPi 6.283185307179586

__global__ void delay(

const float *array1,

const double *array2,

const float scalar1,

const float scalar2,

const float scalar3,

const float scalar4,

const float offset,

const int integer1,

float *RealOutput,

float *ImagOuptut)

{

int idx;

idx = threadIdx.x;

__shared__ float SharedArr1[999];

__shared__ double SharedArr2[999];

/* populate the shared memory */

SharedArr1[idx] = array1[idx]; // first 512 elemnents

SharedArr2[idx] = array2[idx];

/* load the remaining elements in the shared arrays */

[quote name='theotheraussie' date='24 April 2012 - 06:31 AM' timestamp='1335245495' post='1400177']
I did get a minor improvement int he performance, but i was hoping for more.
[/quote]
Without having looked at your code: Global memory is cached on the C2050, so the advantage of shared memory is much smaller than on a compute capability 1.x device.

[quote name='theotheraussie' date='24 April 2012 - 06:31 AM' timestamp='1335245495' post='1400177']

I did get a minor improvement int he performance, but i was hoping for more.

Without having looked at your code: Global memory is cached on the C2050, so the advantage of shared memory is much smaller than on a compute capability 1.x device.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

I have written a basic generic kernel based on my application that doesn't use shared memory and one that uses shared memory how i understand it.

non shared memory

[code]

#define twoPi 6.283185307179586

__global__ void delay(const float *array1,

const double *array2,

const float scalar1,

const float scalar2,

const float scalar3,

const float scalar4,

const floar offset,

const int integer1,

float *RealOutput,

float *ImagOuptut)

{

int idx;

idx = threadIdx.x;

float fractional;

double step;

double delay;

float real, imag;

float sumReal, sumImag;

double st;

step = (double)scalar1 + (double)scalar2 * ((double)idx/(double)integer1);

time = (double)scalar3 + (double)scalar4 *(double)idx + (double)offset;

for (int i = 0; i < 999; i++)

{

delay = step*(1.0 - array2[i]);

fractional = (float)(twoPi * (delay - trunc(delay)));

/* calculate the exponent. */

if (fractional < 0.0)

{

__sincosf(-fractional, &imag, &real);

imag = -imag;

} else {

__sincosf(fractional, &imag, &real);

}

sumReal = sumReal + array1[i] * real;

sumImag = sumImag + array1[i] * imag;

}

st = step*time;

fractional = -(float)(twoPi*(st - trunc(st)));

/* calculate the exponent. */

if (fractional < 0.0)

{

__sincosf(-fractional, &imag, &real);

imag = -imag;

} else {

__sincosf(fractional, &imag, &real);

}

RealOutput[idx] = sumReal * real - sumImag * imag;

ImagOutput[idx] = sumReal * imag + sumImag * real;

}

[/code]

shared memory

[code]

#define twoPi 6.283185307179586

__global__ void delay(const float *array1,

const double *array2,

const float scalar1,

const float scalar2,

const float scalar3,

const float scalar4,

const float offset,

const int integer1,

float *RealOutput,

float *ImagOuptut)

{

int idx;

idx = threadIdx.x;

__shared__ float SharedArr1[999];

__shared__ double SharedArr2[999];

__shared__ float Shared1, Shared2, Shared3, Shared4, SharedOffset;

__shared__ int SharedInt;

/* populate the shared memory */

if (idx = 0)

{

for (int i = 0; i < 999; i++)

{

SharedArr1[i] = array1[i];

SharedArr2[i] = array2[i];

}

Shared1 = scalar1;

Shared2 = scalar2;

Shared3 = scalar3;

Shared4 = scalar4;

SharedOffset = offset;

SharedInt = integer1;

}

__syncthreads();

float fractional;

double step;

double delay;

float real, imag;

float sumReal, sumImag;

double st;

step = (double)Shared1 + (double)Shared2 * ((double)idx/(double)SharedInt);

time = (double)Shared3 + (double)Shared4 *(double)idx + (double)SharedOffset;

for (int i = 0; i < 999; i++)

{

delay = step*(1.0 - array2[i]);

fractional = (float)(twoPi * (delay - trunc(delay)));

/* calculate the exponent. */

if (fractional < 0.0)

{

__sincosf(-fractional, &imag, &real);

imag = -imag;

} else {

__sincosf(fractional, &imag, &real);

}

sumReal = sumReal + array1[i] * real;

sumImag = sumImag + array1[i] * imag;

}

st = step*time;

fractional = -(float)(twoPi*(st - trunc(st)));

/* calculate the exponent. */

if (fractional < 0.0)

{

__sincosf(-fractional, &imag, &real);

imag = -imag;

} else {

__sincosf(fractional, &imag, &real);

}

RealOutput[idx] = sumReal * real - sumImag * imag;

ImagOutput[idx] = sumReal * imag + sumImag * real;

}

[/code]

The kernel call will be:

delay<<<1,512>>>(array1,..... realOut, imagOut);

I have tried this implementation out on my applciation and it doesn't work, I know its got something to do with my understanding. My question is, do i have the basic syntax right for distributing the variables into shared memory and accessing them appropriately? I know there is a limit to the amount of shared memory i have, i think this is 49152 bytes for my Tesla C2050 card. This should be anough for what i'm using here, 12010 bytes. Also since each thread accesses every element in both arrays, is there a better way to do this?

I have written a basic generic kernel based on my application that doesn't use shared memory and one that uses shared memory how i understand it.

non shared memory

shared memory

The kernel call will be:

delay<<<1,512>>>(array1,..... realOut, imagOut);

I have tried this implementation out on my applciation and it doesn't work, I know its got something to do with my understanding. My question is, do i have the basic syntax right for distributing the variables into shared memory and accessing them appropriately? I know there is a limit to the amount of shared memory i have, i think this is 49152 bytes for my Tesla C2050 card. This should be anough for what i'm using here, 12010 bytes. Also since each thread accesses every element in both arrays, is there a better way to do this?

Physics is like sex,

Sure it has practical applications, but thats not why we do it.

- R. Feynman.

1) the way used to load the data in shared memory is very inefficient. Take a look at slides 48-57 from http://corsi.cineca.it/courses/scuolaAvanzata/Massimiliano%20Fatica/03-CUDA_C_Basics.pptx.pdf to see

a more efficient way ( basically each thread should load a different element, in your case with a block size of 512 and 999 elements, each thread should read up to two elements)

2) There is no need to load the scalars in shared memory

3) you are loading the data in SharedArray1 and SharedArray2 but after you are still using the original array1 and array2.

These 3 points should only effect performance, the synchthread is in the right place.

1) the way used to load the data in shared memory is very inefficient. Take a look at slides 48-57 from http://corsi.cineca.it/courses/scuolaAvanzata/Massimiliano%20Fatica/03-CUDA_C_Basics.pptx.pdf to see

a more efficient way ( basically each thread should load a different element, in your case with a block size of 512 and 999 elements, each thread should read up to two elements)

2) There is no need to load the scalars in shared memory

3) you are loading the data in SharedArray1 and SharedArray2 but after you are still using the original array1 and array2.

These 3 points should only effect performance, the synchthread is in the right place.

1) The slides helped me realise how to populate the arrays using the thread idx.

2)Can you explain why you dont need to transfer the scalars to shared memory, is it because the kernel places them there at launch?

3)I corrected the use of the shared array.

I did get a minor improvement int he performance, but i was hoping for more. Can you suggest any other options with this algorithm in mind to increase performance? I should note that this kernel gets executed 512 times via a for loop outside of the kernel.

This is what it looks like now.

[code]

#define twoPi 6.283185307179586

__global__ void delay(

const float *array1,

const double *array2,

const float scalar1,

const float scalar2,

const float scalar3,

const float scalar4,

const float offset,

const int integer1,

float *RealOutput,

float *ImagOuptut)

{

int idx;

idx = threadIdx.x;

__shared__ float SharedArr1[999];

__shared__ double SharedArr2[999];

/* populate the shared memory */

SharedArr1[idx] = array1[idx]; // first 512 elemnents

SharedArr2[idx] = array2[idx];

/* load the remaining elements in the shared arrays */

if (idx+blockDim.x < 999) {

SharedArr1[idx+blockDim.x] = array1[idx+blockDim.x];

SharedArr2[idx+blockDim.x] = array2[idx+blockDim.x];

}

__syncthreads();

float fractional;

double step;

double delay;

float real, imag;

float sumReal, sumImag;

double st;

step = (double)scalar1 + (double)scalar2 * ((double)idx/(double)scalarInt);

time = (double)scalar3 + (double)scalar4 *(double)idx + (double)offset;

for (int i = 0; i < 999; i++)

{

delay = step*(1.0 - SharedArr2[i]);

fractional = (float)(twoPi * (delay - trunc(delay)));

/* calculate the exponent. */

if (fractional < 0.0)

{

__sincosf(-fractional, &imag, &real);

imag = -imag;

} else {

__sincosf(fractional, &imag, &real);

}

sumReal = sumReal + SharedArr1[i] * real;

sumImag = sumImag + SharedArr1[i] * imag;

}

st = step*time;

fractional = -(float)(twoPi*(st - trunc(st)));

/* calculate the exponent. */

if (fractional < 0.0)

{

__sincosf(-fractional, &imag, &real);

imag = -imag;

} else {

__sincosf(fractional, &imag, &real);

}

RealOutput[idx] = sumReal * real - sumImag * imag;

ImagOutput[idx] = sumReal * imag + sumImag * real;

}

[/code]

1) The slides helped me realise how to populate the arrays using the thread idx.

2)Can you explain why you dont need to transfer the scalars to shared memory, is it because the kernel places them there at launch?

3)I corrected the use of the shared array.

I did get a minor improvement int he performance, but i was hoping for more. Can you suggest any other options with this algorithm in mind to increase performance? I should note that this kernel gets executed 512 times via a for loop outside of the kernel.

This is what it looks like now.

Physics is like sex,

Sure it has practical applications, but thats not why we do it.

- R. Feynman.

I did get a minor improvement int he performance, but i was hoping for more.

[/quote]

Without having looked at your code: Global memory is cached on the C2050, so the advantage of shared memory is much smaller than on a compute capability 1.x device.

I did get a minor improvement int he performance, but i was hoping for more.

Without having looked at your code: Global memory is cached on the C2050, so the advantage of shared memory is much smaller than on a compute capability 1.x device.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.