Inverse of a square matrix using Gauss-Jordan elimination

Hello there,
I am trying to calculate inverse of a 3x3 matrix on the GPU using Gaussian elimination, as required by another kernel already running on the GPU. However, seems I am not getting the right answer:

My original matrix is:
204.00000401_____________-204.00000001____________________-2e-06
-204.00000001_____________ 204.00000401____________________-2e-06
-2e-06____________________-2e-06____________________ 4e-06

If I try my algorithm on the CPU using a simple C-code or in Matlab, it gives me the right answer as follows:

      250000.001138622__________249999.998687641__________249999.999913132
      249999.998687641__________250000.001138622__________249999.999913132
      249999.999913132__________249999.999913132__________499999.999913132

however if I try this on the GPU, I get the following:
nan__________nan__________nan
nan__________nan__________nan
nan__________nan__________nan

I am aware that this issue is probably related to the IEEE 754 standard of floating point operations on the GPU. Is there any way to work around this issue? I am pretty sure someone in this forum is having this similar issue. It would be highly appreciated if someone could share their few point here. thanks in advance.
-Jayanta

I doubt this has anything to do with IEEE 754 standard. More likely a bug in your code. Here is a rudimentary GJE example:

[url]algorithm - cuda matrix inverse gaussian jordan - Stack Overflow

Thank you for the comment. I am thinking about IEEE 754 because the matrix is only 3x3 and I tried with a different 3x3 matrix. As shown below:

Matrix before inversion:
6.000000__________-2.500000__________-1.000000
-2.500000__________78.320000__________-5.247000
-1.000000__________-5.247000__________11.259000

Matrix after inversion:
0.172555506__________0.00674538361__________0.0184695385
0.00674538361__________0.0134432986__________0.0068640532
0.0184695385__________0.0068640532__________0.0936570987

These answers are from GPU version of the code and is in perfect match with CPU-based algorithms. I tried my code for other bigger matrices as well, results matches perfectly. So the issue of Bug is probably not here.

which is your kernel using float or double? (matlab uses double)

Thank you Txbob. I am using float type both on the CPU and GPU.Sorry for the delay in replying.

Try using double. GJE involves division, and some of your divisors for the badly performing matrix are quite small relatively speaking. The exact details probably matter here, i.e. the exact code probably matters. I don’t think this is actually specific to the GPU, because, in my experience, properly constrained float behavior on the CPU is basically identical to float behavior on the GPU. It is pretty much a misconception floating around that there are differences.

Thank you Txbob for your effort. For me no Luck yet!
However, I found some interesting thought from the following post, where it says:

"For single precision…

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any larger than this and the distance between floating point numbers is greater than 0.0005. " – during calculations, some of my numbers are going beyond 10^6 (i.e. 2^23) and I am getting wrong values? What do you think? thanks in advance.

https://stackoverflow.com/questions/872544/what-range-of-numbers-can-be-represented-in-a-16-32-and-64-bit-ieee-754-syste