Different performance from different GPUs with Identical Code
  1 / 2    
I have a Linux Box with three Tesla C1060s running CUDA4 and the latest drivers. I've got a piece of code that runs fine on GPU0, however, when I attempt to run the same code with the same input (except for one input variable that changes between 0, 1, and 2 to indicate which GPU to use) on GPUs 1 and 2 it crashes with an unknown error. I thought the behavior should be identical. Literally the only thing that changes is the input to cudaSetDevice at the beginning of the code. Am I missing something? Has anyone else experienced this?

Thanks,
B
I have a Linux Box with three Tesla C1060s running CUDA4 and the latest drivers. I've got a piece of code that runs fine on GPU0, however, when I attempt to run the same code with the same input (except for one input variable that changes between 0, 1, and 2 to indicate which GPU to use) on GPUs 1 and 2 it crashes with an unknown error. I thought the behavior should be identical. Literally the only thing that changes is the input to cudaSetDevice at the beginning of the code. Am I missing something? Has anyone else experienced this?



Thanks,

B

#1
Posted 04/06/2012 05:55 PM   
Is the deviceQuery program working?
Is the deviceQuery program working?

#2
Posted 04/06/2012 06:11 PM   
[quote name='pasoleatis' date='06 April 2012 - 11:11 AM' timestamp='1333735910' post='1392745']
Is the deviceQuery program working?
[/quote]

Yes, deviceQuery registers all three devices GPU0, GPU1, and GPU2 as Tesla C1060. I can even run nvidia-smi when I first call the function and I can watch the memory and processor of the device being utilized until it crashes.
[quote name='pasoleatis' date='06 April 2012 - 11:11 AM' timestamp='1333735910' post='1392745']

Is the deviceQuery program working?





Yes, deviceQuery registers all three devices GPU0, GPU1, and GPU2 as Tesla C1060. I can even run nvidia-smi when I first call the function and I can watch the memory and processor of the device being utilized until it crashes.

#3
Posted 04/06/2012 06:14 PM   
How are each of the GPUs powered? Your power supply unit (PSU) may have multiple rails where each of them may not exceed a certain wattage (for a PSU able to power three Teslas this is quite likely). So even if supplied from one PSU, one GPU might be underpowered while another isn't.
How are each of the GPUs powered? Your power supply unit (PSU) may have multiple rails where each of them may not exceed a certain wattage (for a PSU able to power three Teslas this is quite likely). So even if supplied from one PSU, one GPU might be underpowered while another isn't.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 04/06/2012 09:41 PM   
[quote name='tera' date='06 April 2012 - 02:41 PM' timestamp='1333748509' post='1392810']
How are each of the GPUs powered? Your power supply unit (PSU) may have multiple rails where each of them may not exceed a certain wattage (for a PSU able to power three Teslas this is quite likely). So even if supplied from one PSU, one GPU might be underpowered while another isn't.
[/quote]

That's possible. I've left the office, so I won't be able to check until Monday.
[quote name='tera' date='06 April 2012 - 02:41 PM' timestamp='1333748509' post='1392810']

How are each of the GPUs powered? Your power supply unit (PSU) may have multiple rails where each of them may not exceed a certain wattage (for a PSU able to power three Teslas this is quite likely). So even if supplied from one PSU, one GPU might be underpowered while another isn't.





That's possible. I've left the office, so I won't be able to check until Monday.

#5
Posted 04/06/2012 10:17 PM   
So I've nailed down exactly what line of code is causing the crash, however I have no idea why it's causing the crash on certain GPUs and not on others. To me it should cause a crash at all.

I'm performing some dense-sparse matrix multiplication using a kernel function I wrote myself. Each thread calculates one row of the result matrix. I've declared created a single device array that is twice as large as the matrix. Since I'm performing multiple iterations of the multiplication I begin with the dense matrix in the first half of the array, and store the result in the second half of the array. On the second iteration I multiply the result from the first multiplication which resides in the second half of the array by another sparse matrix, stored elsewhere, and store the result of this multiplication in the first half of the array. The result just keeps bouncing back between the first and second half of the array. The host function which issues the kernel call repeatedly just switches between feeding in the first half as the input/second half as the output and feeding in the first half as the output/second half as the input.

If I comment out the line in the kernel function where the result is written to the output portion of the array the code runs without any errors (except that I'm now no longer saving my results).

Now I've run a bunch of tests and it always boils down to writing the result to somewhere in the array. I've verified and no where in the code is an input declared to be a constant input, so I have no idea what is going on. Also, mind the fact that the code runs perfectly fine on GPU0, but crashes on GPU1 and GPU2. I've also run the code on another Linux box with two GeForce GTX 560s. It crashes on GPU1 unless the same line of code is commented out. I can't test it on GPU0 because that is currently running other code.

1. Switch output and input in the kernel function, code crashes. Comment out line of code where I'm now writing the result to the input portion of the array and code runs without issue.
2. Alter host function calling device function so that it doesn't alternate which half of the array is the input and which half is the output. Both scenarios (first half is input/second half is output and first half is output/second half is input) causes the code to crash unless the line of code which saves the result to the array is commented out.

Has anybody experienced anything of this sort? It's completely bizarre and frustrating to me.
So I've nailed down exactly what line of code is causing the crash, however I have no idea why it's causing the crash on certain GPUs and not on others. To me it should cause a crash at all.



I'm performing some dense-sparse matrix multiplication using a kernel function I wrote myself. Each thread calculates one row of the result matrix. I've declared created a single device array that is twice as large as the matrix. Since I'm performing multiple iterations of the multiplication I begin with the dense matrix in the first half of the array, and store the result in the second half of the array. On the second iteration I multiply the result from the first multiplication which resides in the second half of the array by another sparse matrix, stored elsewhere, and store the result of this multiplication in the first half of the array. The result just keeps bouncing back between the first and second half of the array. The host function which issues the kernel call repeatedly just switches between feeding in the first half as the input/second half as the output and feeding in the first half as the output/second half as the input.



If I comment out the line in the kernel function where the result is written to the output portion of the array the code runs without any errors (except that I'm now no longer saving my results).



Now I've run a bunch of tests and it always boils down to writing the result to somewhere in the array. I've verified and no where in the code is an input declared to be a constant input, so I have no idea what is going on. Also, mind the fact that the code runs perfectly fine on GPU0, but crashes on GPU1 and GPU2. I've also run the code on another Linux box with two GeForce GTX 560s. It crashes on GPU1 unless the same line of code is commented out. I can't test it on GPU0 because that is currently running other code.



1. Switch output and input in the kernel function, code crashes. Comment out line of code where I'm now writing the result to the input portion of the array and code runs without issue.

2. Alter host function calling device function so that it doesn't alternate which half of the array is the input and which half is the output. Both scenarios (first half is input/second half is output and first half is output/second half is input) causes the code to crash unless the line of code which saves the result to the array is commented out.



Has anybody experienced anything of this sort? It's completely bizarre and frustrating to me.

#6
Posted 04/07/2012 04:57 PM   
I am doing the same type of iterative process with ping-pong storage like you and never experienced any problem. It's very difficult to say anything without seeing the code. A blind guess: are you sure to not write beyond the and of the array ?
I am doing the same type of iterative process with ping-pong storage like you and never experienced any problem. It's very difficult to say anything without seeing the code. A blind guess: are you sure to not write beyond the and of the array ?

#7
Posted 04/09/2012 01:13 AM   
I have some experience using CUDA with multiple GPUs (I'm running 14 in total, distributed across several machines). Of the 14 cards I originally ordered (GTX 570, from EVGA), 4 tested positive for faulty memory. I used memtestG80 for this. These cards with bad memory were also yielding erroneous results in computations. I recommend checking the memory if you can.
I have some experience using CUDA with multiple GPUs (I'm running 14 in total, distributed across several machines). Of the 14 cards I originally ordered (GTX 570, from EVGA), 4 tested positive for faulty memory. I used memtestG80 for this. These cards with bad memory were also yielding erroneous results in computations. I recommend checking the memory if you can.

#8
Posted 04/09/2012 06:46 PM   
[quote name='tera' date='06 April 2012 - 02:41 PM' timestamp='1333748509' post='1392810']
How are each of the GPUs powered? Your power supply unit (PSU) may have multiple rails where each of them may not exceed a certain wattage (for a PSU able to power three Teslas this is quite likely). So even if supplied from one PSU, one GPU might be underpowered while another isn't.
[/quote]

Going to test this tomorrow morning...

[quote name='alexish' date='08 April 2012 - 06:13 PM' timestamp='1333934015' post='1393510']
I am doing the same type of iterative process with ping-pong storage like you and never experienced any problem. It's very difficult to say anything without seeing the code. A blind guess: are you sure to not write beyond the and of the array ?
[/quote]

Here's the code in question. Like I said a kernel is launched per row of the matrix. The dense matrix right multiples the sparse matrix which is stored in column-sparse row format (sparse_values, sparse_rows, sparse_cols). Each kernel calculates a row of the output matrix. nRows = leadDim = 55316, nCols = 600. A single 66379200 entry double array is declared, which for the purposes of post let's call Matrix. output and dense are set to either Matrix or Matrix + 33189600, and switch to accomplish the ping-pong type data shuffling I described above. grid = (217,1,1), block = (256,1,1).

__global__ void sdMult(double *output,double *dense,double *sparse_values,unsigned int *sparse_rows,unsigned int *sparse_cols,unsigned int nRows,unsigned int nCols,unsigned int leadDim){
unsigned int n = blockIdx.x*blockDim.x+threadIdx.x;
if(n < nRows){
long int offset = sparse_rows[n];// offset for reading from sparse_values and sparse_cols
int nValues = sparse_rows[n+1]-offset;// number of non-zero entries in this row
double temp;
double *input = dense;
for(int j = 0; j < nCols; j++){// loops through each column
temp = 0;
for(int k = 0; k < nValues; k++){
// temp += sparse_values[offset+k]*input[sparse_cols[offset+k]];
temp = fma(sparse_values[offset+k],input[sparse_cols[offset+k]],temp);
}
output[j*leadDim+n] = temp;
input += leadDim;
}
}
}


[quote name='zakaryah' date='09 April 2012 - 11:46 AM' timestamp='1333997199' post='1393805']
I have some experience using CUDA with multiple GPUs (I'm running 14 in total, distributed across several machines). Of the 14 cards I originally ordered (GTX 570, from EVGA), 4 tested positive for faulty memory. I used memtestG80 for this. These cards with bad memory were also yielding erroneous results in computations. I recommend checking the memory if you can.
[/quote]

I downloaded memtestG80 and ran it on all three Tesla C1060s. I didn't run multiple iterations of the test because at 4096 MBs the tests take quite a while, but there were no errors in the tests I did run.
[quote name='tera' date='06 April 2012 - 02:41 PM' timestamp='1333748509' post='1392810']

How are each of the GPUs powered? Your power supply unit (PSU) may have multiple rails where each of them may not exceed a certain wattage (for a PSU able to power three Teslas this is quite likely). So even if supplied from one PSU, one GPU might be underpowered while another isn't.





Going to test this tomorrow morning...



[quote name='alexish' date='08 April 2012 - 06:13 PM' timestamp='1333934015' post='1393510']

I am doing the same type of iterative process with ping-pong storage like you and never experienced any problem. It's very difficult to say anything without seeing the code. A blind guess: are you sure to not write beyond the and of the array ?





Here's the code in question. Like I said a kernel is launched per row of the matrix. The dense matrix right multiples the sparse matrix which is stored in column-sparse row format (sparse_values, sparse_rows, sparse_cols). Each kernel calculates a row of the output matrix. nRows = leadDim = 55316, nCols = 600. A single 66379200 entry double array is declared, which for the purposes of post let's call Matrix. output and dense are set to either Matrix or Matrix + 33189600, and switch to accomplish the ping-pong type data shuffling I described above. grid = (217,1,1), block = (256,1,1).



__global__ void sdMult(double *output,double *dense,double *sparse_values,unsigned int *sparse_rows,unsigned int *sparse_cols,unsigned int nRows,unsigned int nCols,unsigned int leadDim){

unsigned int n = blockIdx.x*blockDim.x+threadIdx.x;

if(n < nRows){

long int offset = sparse_rows[n];// offset for reading from sparse_values and sparse_cols

int nValues = sparse_rows[n+1]-offset;// number of non-zero entries in this row

double temp;

double *input = dense;

for(int j = 0; j < nCols; j++){// loops through each column

temp = 0;

for(int k = 0; k < nValues; k++){

// temp += sparse_values[offset+k]*input[sparse_cols[offset+k]];

temp = fma(sparse_values[offset+k],input[sparse_cols[offset+k]],temp);

}

output[j*leadDim+n] = temp;

input += leadDim;

}

}

}





[quote name='zakaryah' date='09 April 2012 - 11:46 AM' timestamp='1333997199' post='1393805']

I have some experience using CUDA with multiple GPUs (I'm running 14 in total, distributed across several machines). Of the 14 cards I originally ordered (GTX 570, from EVGA), 4 tested positive for faulty memory. I used memtestG80 for this. These cards with bad memory were also yielding erroneous results in computations. I recommend checking the memory if you can.





I downloaded memtestG80 and ran it on all three Tesla C1060s. I didn't run multiple iterations of the test because at 4096 MBs the tests take quite a while, but there were no errors in the tests I did run.

#9
Posted 04/10/2012 03:11 AM   
Have you run the code in question under cuda-memcheck? Results from both devices might be interesting.
Have you run the code in question under cuda-memcheck? Results from both devices might be interesting.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#10
Posted 04/10/2012 10:39 AM   
As far as the power supply concerns I switched the cables powering the devices and got the same pattern of failures. Also, we specifically purchased this power supply because we blew out another one prior to this. This one should be supplying enough power to run all three simultaneously, we have done so in the past.

[quote name='tera' date='10 April 2012 - 03:39 AM' timestamp='1334054365' post='1394176']
Have you run the code in question under cuda-memcheck? Results from both devices might be interesting.
[/quote]

I ran cuda-memcheck. I get no errors when I select to run the code on device 0, however device 1 and device 2 I get a "Invalid __global__ read of size 4" error. The error does not appear to have any pattern as far as the thread and block that trigger it on device 1 I got errors in t32b98, t64b1, t192b76, t32b77, t32b15, t96b16 and one device 2 I got errors in t224b27, t32b14, t32b70, t32b0, t128b59, t32b15. The addresses that cuda-memcheck is complaining are out of bounds are also curious, as they seem to small to me out of bounds. One device 1 0x00219880, 0x00201500, 0x00214300, 0x00214480, 0x00204c80, 0x00205180, and one device 2 0x0020a780, 0x000001c8, 0x00212880, 0x00201080, 0x0020fe00, 0x000001c8.

Also, how can two TeslaC1060s be running the same exact code, with the same exact input and one thinks it's accessing memory that is out of bounds, and the other runs fine without any issues!?!

Thanks to everyone who's taken a stab at this. I really appreciate your input. Hopefully someone out there can figure this out for me.
As far as the power supply concerns I switched the cables powering the devices and got the same pattern of failures. Also, we specifically purchased this power supply because we blew out another one prior to this. This one should be supplying enough power to run all three simultaneously, we have done so in the past.



[quote name='tera' date='10 April 2012 - 03:39 AM' timestamp='1334054365' post='1394176']

Have you run the code in question under cuda-memcheck? Results from both devices might be interesting.





I ran cuda-memcheck. I get no errors when I select to run the code on device 0, however device 1 and device 2 I get a "Invalid __global__ read of size 4" error. The error does not appear to have any pattern as far as the thread and block that trigger it on device 1 I got errors in t32b98, t64b1, t192b76, t32b77, t32b15, t96b16 and one device 2 I got errors in t224b27, t32b14, t32b70, t32b0, t128b59, t32b15. The addresses that cuda-memcheck is complaining are out of bounds are also curious, as they seem to small to me out of bounds. One device 1 0x00219880, 0x00201500, 0x00214300, 0x00214480, 0x00204c80, 0x00205180, and one device 2 0x0020a780, 0x000001c8, 0x00212880, 0x00201080, 0x0020fe00, 0x000001c8.



Also, how can two TeslaC1060s be running the same exact code, with the same exact input and one thinks it's accessing memory that is out of bounds, and the other runs fine without any issues!?!



Thanks to everyone who's taken a stab at this. I really appreciate your input. Hopefully someone out there can figure this out for me.

#11
Posted 04/10/2012 09:37 PM   
Have you tried to use CUDA_VISIBLE_DEVICES to enable just one card at the time?
It will appear as device 0.
Have you tried to use CUDA_VISIBLE_DEVICES to enable just one card at the time?

It will appear as device 0.

#12
Posted 04/10/2012 10:13 PM   
[quote name='mfatica' date='10 April 2012 - 03:13 PM' timestamp='1334096024' post='1394509']
Have you tried to use CUDA_VISIBLE_DEVICES to enable just one card at the time?
It will appear as device 0.
[/quote]

I don't know why I hadn't thought of that. I just did it, and yes, the code runs on the other two GPUs now. So what do I have to do differently so that when those GPUs are indexed 1 and 2 the code works? I thought all I had to do was "cudaSetDevice(deviceID);" with deviceID either 0, 1, or 2. It only seems to be working when deviceID is 0 though.

Thanks,
B
[quote name='mfatica' date='10 April 2012 - 03:13 PM' timestamp='1334096024' post='1394509']

Have you tried to use CUDA_VISIBLE_DEVICES to enable just one card at the time?

It will appear as device 0.





I don't know why I hadn't thought of that. I just did it, and yes, the code runs on the other two GPUs now. So what do I have to do differently so that when those GPUs are indexed 1 and 2 the code works? I thought all I had to do was "cudaSetDevice(deviceID);" with deviceID either 0, 1, or 2. It only seems to be working when deviceID is 0 though.



Thanks,

B

#13
Posted 04/10/2012 10:22 PM   
You may have allocations done only on the first device. Are you using constant memory?
You may have allocations done only on the first device. Are you using constant memory?

#14
Posted 04/10/2012 10:26 PM   
[quote name='mfatica' date='10 April 2012 - 03:26 PM' timestamp='1334096780' post='1394517']
You may have allocations done only on the first device. Are you using constant memory?
[/quote]

No I am not using constant memory. I am only using global memory, which I thought would be allocated on the device specified in cudaSetDevice(deviceID).
[quote name='mfatica' date='10 April 2012 - 03:26 PM' timestamp='1334096780' post='1394517']

You may have allocations done only on the first device. Are you using constant memory?





No I am not using constant memory. I am only using global memory, which I thought would be allocated on the device specified in cudaSetDevice(deviceID).

#15
Posted 04/10/2012 10:27 PM   
  1 / 2    
Scroll To Top