I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.
If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.

If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.
If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

I think you are trying to do the impossible. In principle, you may never even recompile your code, as this might rearrange floating point operations.

If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

[quote name='gthazmatt' post='1126922' date='Oct 6 2010, 05:44 AM']I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?[/quote]
Hi,
I've also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.
Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much
faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

[quote name='gthazmatt' post='1126922' date='Oct 6 2010, 05:44 AM']I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

Hi,

I've also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.

Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much

faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

[quote name='gthazmatt' post='1126922' date='Oct 6 2010, 05:44 AM']I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?[/quote]
Hi,
I've also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.
Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much
faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

[quote name='gthazmatt' post='1126922' date='Oct 6 2010, 05:44 AM']I'm working on a CUDA version of a lossless image encoder. Overall accuracy is not as important as you'd think, but what is important is that the same output is returned if a function is run on the CPU or GPU. I managed to do this using all double precision calculations as well as a few limitations on some functions (pow and tan, for example, caused problems when the output approached infinity, but as very large numbers aren't helpful to our algorithm, we capped them at 10) plus some funky rounding magic. While this works, the algorithm is evolutive (it'll run until you stop it to find the best result), so speed is key. I'm running on two different systems, one with GTX 480s and one with Tesla C2050s. The Tesla cards are about 15% faster when using DP, but the 480s with SP are about twice as fast as that. Unfortunately, if I use the float versions on both, there are some small differences in value (the final value is cast to an int, so the value just has to stay close). The programming guide says the maximum error is around 2-4 ULP for the SP trig functions (8 for powf), so am I essentially stuck with using DP?

Hi,

I've also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.

Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much

faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

* For basic operations:
On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.
On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.

* For transcendentals, you have two options:
1. Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.
2. Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is "return the FP number closest to the exact result", or correct rounding.
Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals...

Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

* For basic operations:

On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.

On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.

* For transcendentals, you have two options:

1. Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.

2. Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is "return the FP number closest to the exact result", or correct rounding.

Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals...

Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

* For basic operations:
On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.
On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.

* For transcendentals, you have two options:
1. Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.
2. Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is "return the FP number closest to the exact result", or correct rounding.
Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals...

Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

* For basic operations:

On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.

On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.

* For transcendentals, you have two options:

1. Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.

2. Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is "return the FP number closest to the exact result", or correct rounding.

Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals...

[quote name='Sylvain Collange' post='1127132' date='Oct 6 2010, 09:02 AM']Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.[/quote]

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

[quote name='Sylvain Collange' post='1127132' date='Oct 6 2010, 09:02 AM']Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

[quote name='Sylvain Collange' post='1127132' date='Oct 6 2010, 09:02 AM']Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.[/quote]

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

[quote name='Sylvain Collange' post='1127132' date='Oct 6 2010, 09:02 AM']Depending on your CPU compiler, bit-for-bit accuracy may be achievable, even on floating-point numbers.

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

If you really need bit-for-bit identical results, convert your code to fixed point arithmetics (i.e., scaled integers).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

Hi,

I've also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.

Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much

faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

my one cent :)

eyal

Hi,

I've also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.

Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much

faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

my one cent :)

eyal

Hi,

I've also seen difference in floating point results between GTX280 and the C1060 - either a bug on my part or some precision issues. So the use of doubles should indeed solve it.

Its a bit weird that the C2050 with DP is only 15% faster. There was a lot of discussions here about how NVIDIA crippled the DP badly in the GTX line and that the C2050 should be much much

faster then the Fermi GTX when it comes to DP. Maybe there is a lot of other overhead in your kernel preventing you from getting the full DP performance on the C2050?

my one cent :)

eyal

Hi,

my one cent :)

eyal

* For basic operations:

On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.

On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.

* For transcendentals, you have two options:

1. Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.

2. Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is "return the FP number closest to the exact result", or correct rounding.

Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals...

* For basic operations:

On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.

On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.

* For transcendentals, you have two options:

1. Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.

2. Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is "return the FP number closest to the exact result", or correct rounding.

Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals...

Barra, a CUDA-capable GPU simulator

* For basic operations:

On the GPU side, you need to use __fadd_rn and __fmul_rn to prevent the compiler from emitting (more accurate) fused multiply-adds.

On the CPU side, you need to make sure that the compiler uses the SSE instruction set exclusively and avoids any unsafe math optimization.

* For transcendentals, you have two options:

1. Use the exact same implementation on both the CPU and the GPU. Performance will be suboptimal on at least one platform. But results should be the same, as long as the implementation only uses basic arithmetic operations.

2. Enforce precise rounding rules that define the result of transcendental functions unambiguously. Then make sure that both the CPU and the GPU implementations follow these rules. The most reasonable set of rules is "return the FP number closest to the exact result", or correct rounding.

Unfortunately, implementations of correctly-rounded transcendentals are quite involved, even on CPUs. So I suggest option 1 instead.

Using fixed-point as Tera suggests is a good solution for basic operations, but you will still need to roll your own transcendentals...

* For basic operations:

* For transcendentals, you have two options:

Barra, a CUDA-capable GPU simulator

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.

I think tera was envisioning a scenario where for the expression d = a + b + c, cpu compiler decides to do d = (a + b ) + c and the gpu compiler decides to do d = a + (b+c), which could give different results even for identical a, b, c. I guess the take away would be that to enforce bit accuracy you would need to exactly specify all order of operations for all arithmetic operations.