Is there any source code available for the benchmark? Otherwise trying to perform a similar test under CUDA might be really difficult...

Also, is there any price for a CUDA implementation that beats the CPU?

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

Prize of course... where has the edit function gone?

Can you beat the simplicity of use of its parallel library?

Take any compiled test you want

http://www.equation.com/servlet/equation.cmd?fa=laipebenchmark

and compare your CUDA speed with Intel/AMD multi-core CPUs.

Test - the solution of sparse band system of equations.

1 cpu 2.46s

2 cpu 1.22s

3 cpu 0.83s

4 cpu 0.67s

5 cpu 0.58s

6 cpu 0.50s

constant * variable + constant * variable + constant * variable <= 1000;

If so can you give an example of how the input would look like ?

x = 2y

y = x + 4

Or even

x/y = 2

y - x = 4

Of course, there can be more variables...

However, because they are linear you will never see

x = y^2

