Division problem (weird behavior)

Mange · November 1, 2010, 2:28pm

I have a weird behaviour in my kernel that I can’t understand and hope someone here can put some light on it for me =)

The basic problem is that a float number divided by the same float number does not give 1 as an answer. I’ll show the code part with some debugging code I’ve added.

[codebox]else if (codop == OP_DIV) {

		op2 = stack[--stackPointer];

		op1 = stack[--stackPointer];

		if (op2 == 0.0) stack[stackPointer++] = 1.0;

		else stack[stackPointer++] = op1/op2;//__fdividef(op1, op2);

                    //my debugging

		if (debugProgram >= 0 && (*bar2).date == 20050510 && (*bar2).time >= 145500 && (*bar2).time < 145600) {

			cuPrintf("Debugging inside : op1 %.8f , op2 , %.8f , kvot %.8f \n", op1, op2, stack[stackPointer-1]);

		}[/codebox]

On this particular date and time 20050510 145500 op1 and op2 is the same, namely 4626.52783203. So I get the output line.

Debugging inside : op1 4626.52783203 , op2 , 4626.52783203 , kvot 0.99999994 . Before I used __fdividef(op1, op2) that is why it is commented out, I just wanted to try if I got another result if I just used / - operator but I get the same result whichever I use. In another part of the kernel I just hardcoded the following to test if I got the same weird result there:

[codebox] float aa = 4626.52783203f;

		float bb = 4626.52783203f;

		cuPrintf("divisionenkernel %.8f \n", aa/bb);[/codebox]

But there I get the output “divisionenkernel 1.00000000”.

Another weird thing, which I don’t know if it does matter, is that I read in this value at an earlier stage from a file and it only has 8 decimals in the file but if I print the value with

“cuPrintf(“Debugging inside : op1 %.25f , op2 , %.25f , kvot %.25f \n”, op1, op2, stack[stackPointer-1]);” I get the output

“Debugging inside : op1 4626.5278320312500000000000000 , op2 , 4626.5278320312500000000000000 , kvot 0.9999999403953552246093750”. I have no idea where the 125 in the end comes from.

If I use the long value at the other debugging place I still get 1 as a result,

[codebox] float aa = 4626.5278320312500000000000000f;

		float bb = 4626.5278320312500000000000000f;

		cuPrintf("divisionenkernel %.8f \n", aa/bb);[/codebox] still gives "divisionenkernel 1.00000000".

So in my opinion I am doing the exact same calculation at two different places in the kernel but I get two different results?

I’m totally confused here and would really appreciate any input. First I thought it was because of some floating precision issue but it seems weird that the same calculation gives different results.

Mange · November 1, 2010, 2:28pm

I have a weird behaviour in my kernel that I can’t understand and hope someone here can put some light on it for me =)

The basic problem is that a float number divided by the same float number does not give 1 as an answer. I’ll show the code part with some debugging code I’ve added.

[codebox]else if (codop == OP_DIV) {

		op2 = stack[--stackPointer];

		op1 = stack[--stackPointer];

		if (op2 == 0.0) stack[stackPointer++] = 1.0;

		else stack[stackPointer++] = op1/op2;//__fdividef(op1, op2);

                    //my debugging

		if (debugProgram >= 0 && (*bar2).date == 20050510 && (*bar2).time >= 145500 && (*bar2).time < 145600) {

			cuPrintf("Debugging inside : op1 %.8f , op2 , %.8f , kvot %.8f \n", op1, op2, stack[stackPointer-1]);

		}[/codebox]

On this particular date and time 20050510 145500 op1 and op2 is the same, namely 4626.52783203. So I get the output line.

Debugging inside : op1 4626.52783203 , op2 , 4626.52783203 , kvot 0.99999994 . Before I used __fdividef(op1, op2) that is why it is commented out, I just wanted to try if I got another result if I just used / - operator but I get the same result whichever I use. In another part of the kernel I just hardcoded the following to test if I got the same weird result there:

[codebox] float aa = 4626.52783203f;

		float bb = 4626.52783203f;

		cuPrintf("divisionenkernel %.8f \n", aa/bb);[/codebox]

But there I get the output “divisionenkernel 1.00000000”.

Another weird thing, which I don’t know if it does matter, is that I read in this value at an earlier stage from a file and it only has 8 decimals in the file but if I print the value with

“cuPrintf(“Debugging inside : op1 %.25f , op2 , %.25f , kvot %.25f \n”, op1, op2, stack[stackPointer-1]);” I get the output

“Debugging inside : op1 4626.5278320312500000000000000 , op2 , 4626.5278320312500000000000000 , kvot 0.9999999403953552246093750”. I have no idea where the 125 in the end comes from.

If I use the long value at the other debugging place I still get 1 as a result,

[codebox] float aa = 4626.5278320312500000000000000f;

		float bb = 4626.5278320312500000000000000f;

		cuPrintf("divisionenkernel %.8f \n", aa/bb);[/codebox] still gives "divisionenkernel 1.00000000".

So in my opinion I am doing the exact same calculation at two different places in the kernel but I get two different results?

I’m totally confused here and would really appreciate any input. First I thought it was because of some floating precision issue but it seems weird that the same calculation gives different results.

seibert · November 1, 2010, 4:17pm

I think it is a floating point precision issue still. One thing to keep in mind is that floating numbers are stored in a base-2 fractional representation. When you write “4626.52783203f” in a source file, you don’t necessarily get that number. Instead you get the nearest floating point number. (The most common example of this is 1/10 = 0.1 cannot be represented as an exact base-2 floating point number with any number of bits. It is the binary equivalent of 1/3 = 0.3333… in decimal. Rounding when printing the result in base 10 can hide the problem from you, though.)

If op1 and op2 in your first case are computed from different operation sequences (even if mathematically they should be equivalent for real numbers), they might have rounded to adjacent floating point numbers. Depending on how rounding is handled in the printf implementation, they might even look identical when printed in base 10. Try printing the raw bit pattern with __float_as_int() and %x in your printf.

In your second case, you are creating identical numbers by using floating point literals in the source code. As long as you type the same things on both lines, you are guaranteed to get the same numbers.

seibert · November 1, 2010, 4:17pm

I think it is a floating point precision issue still. One thing to keep in mind is that floating numbers are stored in a base-2 fractional representation. When you write “4626.52783203f” in a source file, you don’t necessarily get that number. Instead you get the nearest floating point number. (The most common example of this is 1/10 = 0.1 cannot be represented as an exact base-2 floating point number with any number of bits. It is the binary equivalent of 1/3 = 0.3333… in decimal. Rounding when printing the result in base 10 can hide the problem from you, though.)

If op1 and op2 in your first case are computed from different operation sequences (even if mathematically they should be equivalent for real numbers), they might have rounded to adjacent floating point numbers. Depending on how rounding is handled in the printf implementation, they might even look identical when printed in base 10. Try printing the raw bit pattern with __float_as_int() and %x in your printf.

In your second case, you are creating identical numbers by using floating point literals in the source code. As long as you type the same things on both lines, you are guaranteed to get the same numbers.

Mange · November 1, 2010, 4:46pm

I think it is a floating point precision issue still. One thing to keep in mind is that floating numbers are stored in a base-2 fractional representation. When you write “4626.52783203f” in a source file, you don’t necessarily get that number. Instead you get the nearest floating point number. (The most common example of this is 1/10 = 0.1 cannot be represented as an exact base-2 floating point number with any number of bits. It is the binary equivalent of 1/3 = 0.3333… in decimal. Rounding when printing the result in base 10 can hide the problem from you, though.)

If op1 and op2 in your first case are computed from different operation sequences (even if mathematically they should be equivalent for real numbers), they might have rounded to adjacent floating point numbers. Depending on how rounding is handled in the printf implementation, they might even look identical when printed in base 10. Try printing the raw bit pattern with __float_as_int() and %x in your printf.

In your second case, you are creating identical numbers by using floating point literals in the source code. As long as you type the same things on both lines, you are guaranteed to get the same numbers.

Thanks a lot for the answer. I tried what you suggested and added a new “debugline” so the debugging looks like this;

[codebox] if (debugProgram >= 0 && (*bar2).date == 20050510 && (*bar2).time >= 145500 && (*bar2).time < 145600) {

			cuPrintf("Debugging inside : op1 %.25f , op2 , %.25f , kvot %.25f \n", op1, op2, stack[stackPointer-1]);

			cuPrintf("Casting op1 %x , op2 %x", float_as_int(op1), float_as_int(op2) );

		}[/codebox]

I get the output “Casting op1 45909439 , op2 45909439”. They also looks the same =( Or did I misunderstood anything you meant?

Mange · November 1, 2010, 4:46pm

I think it is a floating point precision issue still. One thing to keep in mind is that floating numbers are stored in a base-2 fractional representation. When you write “4626.52783203f” in a source file, you don’t necessarily get that number. Instead you get the nearest floating point number. (The most common example of this is 1/10 = 0.1 cannot be represented as an exact base-2 floating point number with any number of bits. It is the binary equivalent of 1/3 = 0.3333… in decimal. Rounding when printing the result in base 10 can hide the problem from you, though.)

If op1 and op2 in your first case are computed from different operation sequences (even if mathematically they should be equivalent for real numbers), they might have rounded to adjacent floating point numbers. Depending on how rounding is handled in the printf implementation, they might even look identical when printed in base 10. Try printing the raw bit pattern with __float_as_int() and %x in your printf.

In your second case, you are creating identical numbers by using floating point literals in the source code. As long as you type the same things on both lines, you are guaranteed to get the same numbers.

Thanks a lot for the answer. I tried what you suggested and added a new “debugline” so the debugging looks like this;

[codebox] if (debugProgram >= 0 && (*bar2).date == 20050510 && (*bar2).time >= 145500 && (*bar2).time < 145600) {

			cuPrintf("Debugging inside : op1 %.25f , op2 , %.25f , kvot %.25f \n", op1, op2, stack[stackPointer-1]);

			cuPrintf("Casting op1 %x , op2 %x", float_as_int(op1), float_as_int(op2) );

		}[/codebox]

I get the output “Casting op1 45909439 , op2 45909439”. They also looks the same =( Or did I misunderstood anything you meant?

njuffa · November 1, 2010, 4:52pm

With an IEEE-compliant division, a number divided by itself will return 1.

(1) For double-precision, on architectures >= sm_13, the division operator ‘/’ always maps to IEEE round-to-nearest-even division.
(2) For single precision, on architectures sm_1x, the division operator ‘/’ always maps to an approximate division. To get an IEEE round-to-nearest-even division, use the __fdiv_rn() device function. Be advised that this is very slow.
(3) For single precision, on architectures sm_2x, the division operator ‘/’ maps to IEEE round-to-nearest-even when -prec-div=true is specified for the compilation; this is the compiler default for sm_2x. When compiling with -prec-div=false, the division operator ‘/’ is mapped to approximate division; one can use __fdiv_rn() device function as on sm_1x in that case.

njuffa · November 1, 2010, 4:52pm

With an IEEE-compliant division, a number divided by itself will return 1.

(1) For double-precision, on architectures >= sm_13, the division operator ‘/’ always maps to IEEE round-to-nearest-even division.
(2) For single precision, on architectures sm_1x, the division operator ‘/’ always maps to an approximate division. To get an IEEE round-to-nearest-even division, use the __fdiv_rn() device function. Be advised that this is very slow.
(3) For single precision, on architectures sm_2x, the division operator ‘/’ maps to IEEE round-to-nearest-even when -prec-div=true is specified for the compilation; this is the compiler default for sm_2x. When compiling with -prec-div=false, the division operator ‘/’ is mapped to approximate division; one can use __fdiv_rn() device function as on sm_1x in that case.

Mange · November 1, 2010, 4:59pm

With an IEEE-compliant division, a number divided by itself will return 1.

(1) For double-precision, on architectures >= sm_13, the division operator ‘/’ always maps to IEEE round-to-nearest-even division.

(2) For single precision, on architectures sm_1x, the division operator ‘/’ always maps to an approximate division. To get an IEEE round-to-nearest-even division, use the __fdiv_rn() device function. Be advised that this is very slow.

(3) For single precision, on architectures sm_2x, the division operator ‘/’ maps to IEEE round-to-nearest-even when -prec-div=true is specified for the compilation; this is the compiler default for sm_2x. When compiling with -prec-div=false, the division operator ‘/’ is mapped to approximate division; one can use __fdiv_rn() device function as on sm_1x in that case.

Thanks for the answer. I was under the impression that the division in cuda was IEEE-compliant. Doesn’t that make my problem even more strange?

Mange · November 1, 2010, 4:59pm

With an IEEE-compliant division, a number divided by itself will return 1.

(1) For double-precision, on architectures >= sm_13, the division operator ‘/’ always maps to IEEE round-to-nearest-even division.

(2) For single precision, on architectures sm_1x, the division operator ‘/’ always maps to an approximate division. To get an IEEE round-to-nearest-even division, use the __fdiv_rn() device function. Be advised that this is very slow.

(3) For single precision, on architectures sm_2x, the division operator ‘/’ maps to IEEE round-to-nearest-even when -prec-div=true is specified for the compilation; this is the compiler default for sm_2x. When compiling with -prec-div=false, the division operator ‘/’ is mapped to approximate division; one can use __fdiv_rn() device function as on sm_1x in that case.

Thanks for the answer. I was under the impression that the division in cuda was IEEE-compliant. Doesn’t that make my problem even more strange?

njuffa · November 1, 2010, 5:08pm

The CUDA behavior is as described above. With an approximate division, an number divided by itself is not guaranteed to be 1. You could check the PTX to see what kind of division operator is being generated for your code:

(1) div.approx.f32 // corresponds to __fdividef(), i.e. an approximate division with restricted range
(2) div.full.f32 // approximate division with full range; maps sm_1x ‘/’ operator, sm_2x ‘/’ operator when -prec-div=false
(3) div.rn.f32 // IEEE round to nearest division; maps sm_2x ‘/’ operator when -prec-div=true (compiler default)

If I recall correctly, for sm_1x the __fdiv_rn() device function is expanded at source level, so you wouldn’t see a div instruction at PTX level for that, while for sm_2x it should map to div.rn.f32 at PTX level.

njuffa · November 1, 2010, 5:08pm

The CUDA behavior is as described above. With an approximate division, an number divided by itself is not guaranteed to be 1. You could check the PTX to see what kind of division operator is being generated for your code:

(1) div.approx.f32 // corresponds to __fdividef(), i.e. an approximate division with restricted range
(2) div.full.f32 // approximate division with full range; maps sm_1x ‘/’ operator, sm_2x ‘/’ operator when -prec-div=false
(3) div.rn.f32 // IEEE round to nearest division; maps sm_2x ‘/’ operator when -prec-div=true (compiler default)

If I recall correctly, for sm_1x the __fdiv_rn() device function is expanded at source level, so you wouldn’t see a div instruction at PTX level for that, while for sm_2x it should map to div.rn.f32 at PTX level.

Mange · November 2, 2010, 4:16pm

With an IEEE-compliant division, a number divided by itself will return 1.

(1) For double-precision, on architectures >= sm_13, the division operator ‘/’ always maps to IEEE round-to-nearest-even division.

(2) For single precision, on architectures sm_1x, the division operator ‘/’ always maps to an approximate division. To get an IEEE round-to-nearest-even division, use the __fdiv_rn() device function. Be advised that this is very slow.

(3) For single precision, on architectures sm_2x, the division operator ‘/’ maps to IEEE round-to-nearest-even when -prec-div=true is specified for the compilation; this is the compiler default for sm_2x. When compiling with -prec-div=false, the division operator ‘/’ is mapped to approximate division; one can use __fdiv_rn() device function as on sm_1x in that case.

Finally the forum is up and running again =) . Thanks a lot for the answers both njuffa and seibert. The problem disappeared when I used the “__fdiv_rn()” device function instead of “/” or “__fdividef” so the problem was related to approximate division.

I have some followup questions which I would really appreciate if someone could give some input on.

njuffa mentioned that “__fdiv_rn()” was very slow, how much slower is it compared to just “/”. I know I’ve read somewhere about how many clock cycles different functions took but I can’t find it anymore.
Where can I find the documentation about, for example, __fdiv_rn() and similar functions? I checked the reference manual but couldn’t find any info there. I assume they are documented somewhere.
I actually tried on a sm_2x machine where it should default to round-to-nearest-even but after a while I noticed that we had the flag “-arch=sm_13” in our makefile so it maybe did not default to round-to-nearest-even. The reason we have that even thou we have sm_2x machines is that we also have sm_1x machines and were lazy and wanted to use the same Makefile. To my question, do we “lose” something else in terms of effiency or functionality or similar when we compile our code with “-arch=sm_13” and use it on machines with sm_2x?
Is there similar functions that __fdiv_rn() for other operations like multiplication? My program is quite sensitive for that kind of weird behavior.

Mange · November 2, 2010, 4:16pm

With an IEEE-compliant division, a number divided by itself will return 1.

(1) For double-precision, on architectures >= sm_13, the division operator ‘/’ always maps to IEEE round-to-nearest-even division.

(2) For single precision, on architectures sm_1x, the division operator ‘/’ always maps to an approximate division. To get an IEEE round-to-nearest-even division, use the __fdiv_rn() device function. Be advised that this is very slow.

(3) For single precision, on architectures sm_2x, the division operator ‘/’ maps to IEEE round-to-nearest-even when -prec-div=true is specified for the compilation; this is the compiler default for sm_2x. When compiling with -prec-div=false, the division operator ‘/’ is mapped to approximate division; one can use __fdiv_rn() device function as on sm_1x in that case.

Finally the forum is up and running again =) . Thanks a lot for the answers both njuffa and seibert. The problem disappeared when I used the “__fdiv_rn()” device function instead of “/” or “__fdividef” so the problem was related to approximate division.

I have some followup questions which I would really appreciate if someone could give some input on.

njuffa mentioned that “__fdiv_rn()” was very slow, how much slower is it compared to just “/”. I know I’ve read somewhere about how many clock cycles different functions took but I can’t find it anymore.
Where can I find the documentation about, for example, __fdiv_rn() and similar functions? I checked the reference manual but couldn’t find any info there. I assume they are documented somewhere.
I actually tried on a sm_2x machine where it should default to round-to-nearest-even but after a while I noticed that we had the flag “-arch=sm_13” in our makefile so it maybe did not default to round-to-nearest-even. The reason we have that even thou we have sm_2x machines is that we also have sm_1x machines and were lazy and wanted to use the same Makefile. To my question, do we “lose” something else in terms of effiency or functionality or similar when we compile our code with “-arch=sm_13” and use it on machines with sm_2x?
Is there similar functions that __fdiv_rn() for other operations like multiplication? My program is quite sensitive for that kind of weird behavior.

njuffa · November 2, 2010, 4:52pm

(1) From memory, __fdiv_rn() on sm_13 hardware is about one order of magnitude slower than the approximate division. It also needs more registers, which can also impact performance. To gauge the impact on a particular piece of code, it would be easiest to simply compile the code both ways and measure the performance at kernel level or app level.
(2) __fdiv_rn and related functions are documented in one of the appendices of the CUDA C Programming Guide. Search for “__fdiv” in the PDF.
(3) If you compile for sm_13 and use the single-precision division operator, an div.full.f32 is emitted into the PTX code. If you run this on an sm_2x device, the div.full.f32 from the PTX is JITted to the appropriate approximate division instruction sequence. I would recommend changing the build so it creates a fat binary containing both sm_1x and sm_2x machine code, along with the PTX.
(4) Yes, there are __fadd_r[n,u,d,z], __fmul_r[n,u,d,z], etc device functions. Check the CUDA C Programming Guide for the full list. Floating-point adds and multiplies are hardware instructions with proper rounding so it is rarely necessary to use these device functions to access them, at least in my experience. Pretty much the only place I have used them is to prevent the compiler from merging adds and multiplies into FMADs/FMAs.

njuffa · November 2, 2010, 4:52pm

(1) From memory, __fdiv_rn() on sm_13 hardware is about one order of magnitude slower than the approximate division. It also needs more registers, which can also impact performance. To gauge the impact on a particular piece of code, it would be easiest to simply compile the code both ways and measure the performance at kernel level or app level.
(2) __fdiv_rn and related functions are documented in one of the appendices of the CUDA C Programming Guide. Search for “__fdiv” in the PDF.
(3) If you compile for sm_13 and use the single-precision division operator, an div.full.f32 is emitted into the PTX code. If you run this on an sm_2x device, the div.full.f32 from the PTX is JITted to the appropriate approximate division instruction sequence. I would recommend changing the build so it creates a fat binary containing both sm_1x and sm_2x machine code, along with the PTX.
(4) Yes, there are __fadd_r[n,u,d,z], __fmul_r[n,u,d,z], etc device functions. Check the CUDA C Programming Guide for the full list. Floating-point adds and multiplies are hardware instructions with proper rounding so it is rarely necessary to use these device functions to access them, at least in my experience. Pretty much the only place I have used them is to prevent the compiler from merging adds and multiplies into FMADs/FMAs.

Mange · November 3, 2010, 8:33am

(1) From memory, __fdiv_rn() on sm_13 hardware is about one order of magnitude slower than the approximate division. It also needs more registers, which can also impact performance. To gauge the impact on a particular piece of code, it would be easiest to simply compile the code both ways and measure the performance at kernel level or app level.

(2) __fdiv_rn and related functions are documented in one of the appendices of the CUDA C Programming Guide. Search for “__fdiv” in the PDF.

(3) If you compile for sm_13 and use the single-precision division operator, an div.full.f32 is emitted into the PTX code. If you run this on an sm_2x device, the div.full.f32 from the PTX is JITted to the appropriate approximate division instruction sequence. I would recommend changing the build so it creates a fat binary containing both sm_1x and sm_2x machine code, along with the PTX.

(4) Yes, there are __fadd_r[n,u,d,z], __fmul_r[n,u,d,z], etc device functions. Check the CUDA C Programming Guide for the full list. Floating-point adds and multiplies are hardware instructions with proper rounding so it is rarely necessary to use these device functions to access them, at least in my experience. Pretty much the only place I have used them is to prevent the compiler from merging adds and multiplies into FMADs/FMAs.

Thanks a lot for your help, it was really helpful!

Mange · November 3, 2010, 8:33am

(1) From memory, __fdiv_rn() on sm_13 hardware is about one order of magnitude slower than the approximate division. It also needs more registers, which can also impact performance. To gauge the impact on a particular piece of code, it would be easiest to simply compile the code both ways and measure the performance at kernel level or app level.

(2) __fdiv_rn and related functions are documented in one of the appendices of the CUDA C Programming Guide. Search for “__fdiv” in the PDF.

(3) If you compile for sm_13 and use the single-precision division operator, an div.full.f32 is emitted into the PTX code. If you run this on an sm_2x device, the div.full.f32 from the PTX is JITted to the appropriate approximate division instruction sequence. I would recommend changing the build so it creates a fat binary containing both sm_1x and sm_2x machine code, along with the PTX.

(4) Yes, there are __fadd_r[n,u,d,z], __fmul_r[n,u,d,z], etc device functions. Check the CUDA C Programming Guide for the full list. Floating-point adds and multiplies are hardware instructions with proper rounding so it is rarely necessary to use these device functions to access them, at least in my experience. Pretty much the only place I have used them is to prevent the compiler from merging adds and multiplies into FMADs/FMAs.

Thanks a lot for your help, it was really helpful!

Mange · November 15, 2010, 2:00pm

I have a followup question here again after a little while =) . It turned out I got similar problems with addition in some case and I changed to __fadd_rn and that solved the problem. I noticed that there are similar functions for division, multiplication and addition but not for subtraction, why is that the case? Is subtraction basically an addition on instruction level? ( slightly modified ) Or how does it work? How would I do if I would like to use a “safe” subtraction in cuda?

Mange · November 15, 2010, 2:00pm

I have a followup question here again after a little while =) . It turned out I got similar problems with addition in some case and I changed to __fadd_rn and that solved the problem. I noticed that there are similar functions for division, multiplication and addition but not for subtraction, why is that the case? Is subtraction basically an addition on instruction level? ( slightly modified ) Or how does it work? How would I do if I would like to use a “safe” subtraction in cuda?