cl-mad-enable Discussing it's effects
I am working on some operations like

a + b * c + d * e

where all the alphabets are doubles. This is performed for very huge sizes.

For optimization, I packaged the above as:

mad(d,e,mad(b,c,a));

and compiled with cl-mad-enable. I was expecting huge performance improvement but the result was exactly the same. Moreover, the loss of precision was way too big.

Though most of the available literature are full of songs-of-praise, why hasn't it shown any improvement in my results?

Other combinations I have tried which do not help also are:

var = mad(b,c,a);
var = mad(d,e,var);
I am working on some operations like



a + b * c + d * e



where all the alphabets are doubles. This is performed for very huge sizes.



For optimization, I packaged the above as:



mad(d,e,mad(b,c,a));



and compiled with cl-mad-enable. I was expecting huge performance improvement but the result was exactly the same. Moreover, the loss of precision was way too big.



Though most of the available literature are full of songs-of-praise, why hasn't it shown any improvement in my results?



Other combinations I have tried which do not help also are:



var = mad(b,c,a);

var = mad(d,e,var);

#1
Posted 02/24/2012 12:38 AM   
cl-mad-enable should enable mad for regular a * b + c notation, AFAIK mad(b, c, a) should do mad in all cases.

From my tests though, NVIDIA enable mad even if you don't specify cl-mad-enable as the precision is the same, at least for float (for 32 bit multiply add, the intermediate storage is 32 bits). CPU uses a higher intermediate precision and thus would disable mad by default to make sure that you get consistent results.
cl-mad-enable should enable mad for regular a * b + c notation, AFAIK mad(b, c, a) should do mad in all cases.



From my tests though, NVIDIA enable mad even if you don't specify cl-mad-enable as the precision is the same, at least for float (for 32 bit multiply add, the intermediate storage is 32 bits). CPU uses a higher intermediate precision and thus would disable mad by default to make sure that you get consistent results.

#2
Posted 02/26/2012 11:50 AM   
I thought there was a dependency on the result latency, what is the GPU you use for your tests?
I would have tried to insert at least another computation between the two MAD, or even interleaved 2 series of computation (in the way Intel ispc does it for SSE or AVX using -x2 versions), to avoid register or results dependencies.
I thought there was a dependency on the result latency, what is the GPU you use for your tests?

I would have tried to insert at least another computation between the two MAD, or even interleaved 2 series of computation (in the way Intel ispc does it for SSE or AVX using -x2 versions), to avoid register or results dependencies.

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#3
Posted 03/09/2012 03:48 PM   
Scroll To Top