CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5
Yes, I had figured out that the do loop executes only once (~610x per thread), and I had connected the 610 number with 10,000,000 photons divided by 256*64 threads. I had also discovered that timing of the launchnewphoton() function from within the function did not match the timing when wrapping the function call. This is a curious result. Function inlining doesn't help when trying to dissect code this way. I'm not really sure how to connect all the dots yet. But your data point around changing the if condition certainly suggests the compiler is doing something pretty odd. Anyway your datapoint probably narrows things down enough that the compiler team should be able to chew on it with reasonable efficiency. I'll file a bug at some point soon. If you want continued feedback/communication about status, I suggest you file your own bug as well.
Yes, I had figured out that the do loop executes only once (~610x per thread), and I had connected the 610 number with 10,000,000 photons divided by 256*64 threads.

I had also discovered that timing of the launchnewphoton() function from within the function did not match the timing when wrapping the function call. This is a curious result. Function inlining doesn't help when trying to dissect code this way. I'm not really sure how to connect all the dots yet. But your data point around changing the if condition certainly suggests the compiler is doing something pretty odd.

Anyway your datapoint probably narrows things down enough that the compiler team should be able to chew on it with reasonable efficiency.

I'll file a bug at some point soon. If you want continued feedback/communication about status, I suggest you file your own bug as well.

#31
Posted 03/25/2016 03:36 AM   
[quote=""]I had also discovered that timing of the launchnewphoton() function from within the function did not match the timing when wrapping the function call.[/quote] that's very curious. something must have been added to account for 10x more instructions executed. [quote=""]I'll file a bug at some point soon. If you want continued feedback/communication about status, I suggest you file your own bug as well.[/quote] ok, I will file one from my side as well. thanks a lot for looking into this. On the other hand, if you see anything we could optimize for making the code more efficient, we'd love to hear. This project is actually funded by the NIH and I am obligated to make the code more efficient.
said:I had also discovered that timing of the launchnewphoton() function from within the function did not match the timing when wrapping the function call.



that's very curious. something must have been added to account for 10x more instructions executed.


said:I'll file a bug at some point soon. If you want continued feedback/communication about status, I suggest you file your own bug as well.


ok, I will file one from my side as well.

thanks a lot for looking into this. On the other hand, if you see anything we could optimize for making the code more efficient, we'd love to hear. This project is actually funded by the NIH and I am obligated to make the code more efficient.

#32
Posted 03/25/2016 04:09 AM   
As stated above, my guess at this point is a problem with placing merge / convergence points (insertion of SSY and .S). This is not visible at the PTX level, it is a low-level mechanism inserted into SASS by PTXAS, and it is basically an optimization. While the use of merge / convergence points is not required for functional correctness, it is important for performance to avoid the potential effect of divergence causing a single thread to run all the way to completion before the thread-mask stack is finally popped and other threads get to run. I recall at least one bug related to SSY/.S placement in the past that affected code with a structure similar to the code considered here (nested loops with plenty conditionals inside). The performance degradation in that case was also dramatic, although not 10x. It sure would be interesting to get feedback eventually as to what the underlying reason for this issue was.
As stated above, my guess at this point is a problem with placing merge / convergence points (insertion of SSY and .S). This is not visible at the PTX level, it is a low-level mechanism inserted into SASS by PTXAS, and it is basically an optimization.

While the use of merge / convergence points is not required for functional correctness, it is important for performance to avoid the potential effect of divergence causing a single thread to run all the way to completion before the thread-mask stack is finally popped and other threads get to run. I recall at least one bug related to SSY/.S placement in the past that affected code with a structure similar to the code considered here (nested loops with plenty conditionals inside). The performance degradation in that case was also dramatic, although not 10x.

It sure would be interesting to get feedback eventually as to what the underlying reason for this issue was.

#33
Posted 03/25/2016 04:21 AM   
The internal bug I filed is 1747451 You may wish to reference that bug number in the bug you file.
The internal bug I filed is 1747451

You may wish to reference that bug number in the bug you file.

#34
Posted 03/25/2016 04:22 AM   
I would also report a related finding regarding the OpenCL version of the code. Maybe I should start a new thread, but let me briefly describe the problem first. I wrote an OpenCL version of mcx, called mcxcl. For Maxwell, it has the same issue as the CUDA version - the running speed is much slower than what it was before, and even slower than Fermi/Kepler cards on my system. However, we found out that, by turning on a flag (-d 1) from the command line, the mcxcl simulation speed can be improved by 10x. This is quite puzzling, because when a user sets "-d 1", mcxcl appends "-D MCX_SAVE_DETECTORS" to the JIT compiler option, see [url]https://github.com/fangq/mcxcl/blob/1a499869462b72760163d96975a34d48cc966d6f/src/mcx_host.cpp#L358-L359[/url] if you inspect the cl kernel, defining the MCX_SAVE_DETECTORS macro enables 5 to 6 additional code blocks [url]https://github.com/fangq/mcxcl/blob/1a499869462b72760163d96975a34d48cc966d6f/src/mcx_core.cl[/url] this means the cl kernel is more complex and more computation is needed for the additional photon detection calculations/storage. So, I expect the code to be slower, and could not imagine it can become 10x faster! If you want to test this, here is the test sequence: you can use any versions of CUDA, and you need to have a Maxwell to reproduce [code]git clone https://github.com/fangq/mcxcl.git cd mcxcl git checkout 1a499869462b72760163d96975a34d48cc966d6f . cd src make clean make # compile mcxcl binary cd ../example/quicktest ./listgpu.sh # list all available GPUs ./run_qtest.sh # run code using the 1st GPU (-G 1), use 01 mask string to select GPU [/code] for my GTX 980Ti, the simulation speed is pretty low, similar to the CUDA case, around 1400 p/s. However, appending "-d 1" to the command in run_qtest.sh, you run this command instead [code]../../bin/mcxcl -t 16384 -T 64 -g 10 -n 1e7 -f qtest.inp -s qtest -r 1 -a 0 -b 0 -k ../../src/mcx_core.cl -d 1 [/code] on my 980Ti, the speed increases to ~17,000 photon/ms, also similar to the CUDA case. [b]<edit>[/b] just to provide an additional reference for speed: I compiled mcxcl on a Ubuntu box running CUDA 6.5, and run the same benchmark on a 980 (not Ti), I got 22,000 photon/ms with "-d 0" and 16,000 with "-d 1". This result makes perfectly sense. However, comparing with those on 980Ti, I expect 980Ti outperform 980 by ~10%. So, I think the 17,000 p/s with -d 1 on 980Ti looks proportional to the speed on 980 (16,000); the broken case was when setting "-d 0" on 980Ti (1400 p/s. likely dropped 17x from 24000 p/s). [b]</edit>[/b] However, this time, it appears that something inside the "#ifdef MCX_SAVE_DETECTORS ... #endif" blocks influenced the branch predication. but more complex code seems to help the heuristics to make better code generation. txbob, do you want to include this additional finding to your bug report as well? or you think a separate bug report is more appropriate?
I would also report a related finding regarding the OpenCL version of the code. Maybe I should start a new thread, but let me briefly describe the problem first.

I wrote an OpenCL version of mcx, called mcxcl. For Maxwell, it has the same issue as the CUDA version - the running speed is much slower than what it was before, and even slower than Fermi/Kepler cards on my system.

However, we found out that, by turning on a flag (-d 1) from the command line, the mcxcl simulation speed can be improved by 10x.

This is quite puzzling, because when a user sets "-d 1", mcxcl appends "-D MCX_SAVE_DETECTORS" to the JIT compiler option, see

https://github.com/fangq/mcxcl/blob/1a499869462b72760163d96975a34d48cc966d6f/src/mcx_host.cpp#L358-L359

if you inspect the cl kernel, defining the MCX_SAVE_DETECTORS macro enables 5 to 6 additional code blocks

https://github.com/fangq/mcxcl/blob/1a499869462b72760163d96975a34d48cc966d6f/src/mcx_core.cl

this means the cl kernel is more complex and more computation is needed for the additional photon detection calculations/storage. So, I expect the code to be slower, and could not imagine it can become 10x faster!

If you want to test this, here is the test sequence: you can use any versions of CUDA, and you need to have a Maxwell to reproduce

git clone https://github.com/fangq/mcxcl.git 
cd mcxcl
git checkout 1a499869462b72760163d96975a34d48cc966d6f .
cd src
make clean
make # compile mcxcl binary
cd ../example/quicktest
./listgpu.sh # list all available GPUs
./run_qtest.sh # run code using the 1st GPU (-G 1), use 01 mask string to select GPU


for my GTX 980Ti, the simulation speed is pretty low, similar to the CUDA case, around 1400 p/s.

However, appending "-d 1" to the command in run_qtest.sh, you run this command instead

../../bin/mcxcl -t 16384 -T 64 -g 10 -n 1e7 -f qtest.inp -s qtest -r 1 -a 0 -b 0 -k ../../src/mcx_core.cl -d 1


on my 980Ti, the speed increases to ~17,000 photon/ms, also similar to the CUDA case.

<edit>
just to provide an additional reference for speed: I compiled mcxcl on a Ubuntu box running CUDA 6.5, and run the same benchmark on a 980 (not Ti), I got 22,000 photon/ms with "-d 0" and 16,000 with "-d 1". This result makes perfectly sense. However, comparing with those on 980Ti, I expect 980Ti outperform 980 by ~10%. So, I think the 17,000 p/s with -d 1 on 980Ti looks proportional to the speed on 980 (16,000); the broken case was when setting "-d 0" on 980Ti (1400 p/s. likely dropped 17x from 24000 p/s).
</edit>

However, this time, it appears that something inside the "#ifdef MCX_SAVE_DETECTORS ... #endif" blocks influenced the branch predication. but more complex code seems to help the heuristics to make better code generation.

txbob, do you want to include this additional finding to your bug report as well? or you think a separate bug report is more appropriate?

#35
Posted 03/25/2016 04:41 AM   
[quote=""]txbob, do you want to include this additional finding to your bug report as well? or you think a separate bug report is more appropriate? [/quote] I don't see enough data to connect the two. The only thing I can surmise is that there is what I call "fragile" code generation going on (in both cases). Essentially a repeat of what njuffa has said a couple times now. I think njuffa has more experience in this area than I do. I've seen maybe one or 2 examples of "fragile" code generation in my experience. This might be a third case. But that's merely conjecture. I cannot connect the two cases based on what you've said. I would suggest filing a separate bug for the OpenCL case, and if you wish, reference your CUDA bug for the issue discussed in this thread (or mine). That will provide enough context to connect the two issues if need be. It should result in less cluttered reports anyway, as the code bases and exact repro steps are separate anyway (although, perhaps similar). Thanks for your patience with us and with this issue. Thanks for taking the time to help unravel it.
said:txbob, do you want to include this additional finding to your bug report as well? or you think a separate bug report is more appropriate?


I don't see enough data to connect the two. The only thing I can surmise is that there is what I call "fragile" code generation going on (in both cases). Essentially a repeat of what njuffa has said a couple times now. I think njuffa has more experience in this area than I do. I've seen maybe one or 2 examples of "fragile" code generation in my experience. This might be a third case. But that's merely conjecture. I cannot connect the two cases based on what you've said.

I would suggest filing a separate bug for the OpenCL case, and if you wish, reference your CUDA bug for the issue discussed in this thread (or mine). That will provide enough context to connect the two issues if need be. It should result in less cluttered reports anyway, as the code bases and exact repro steps are separate anyway (although, perhaps similar).

Thanks for your patience with us and with this issue. Thanks for taking the time to help unravel it.

#36
Posted 03/25/2016 04:53 AM   
While I do have a extensive experience with diving into the details of SASS code trying to pinpoint compiler bugs (or else proving the issue is with the source code), I no longer enjoy the benefits of discussing such issues at length with the CUDA compiler engineers. I'd say much of my knowledge in this area is "dated" at this point, possibly even "outdated". If the issue in the MCX code is one of SSY/.S placement (a mere conjecture at this point) it is probably not an issue of "fragile" code generation, just a very hard problem for the compiler to solve, and it may just so happen that adding one more branch triggers a performance cliff in very rare cases. The placement issue is hard because once the compiler starts placing convergence points (as tightly as possible, to avoid lengthy divergent flows), it also needs to traverse all the possible call graphs to make sure the thread-mask stack comes out correctly on all possible paths to that convergence point. If the code has instances of 'break' and 'continue' (really 'goto' in disguise) that can make it extra hard.
While I do have a extensive experience with diving into the details of SASS code trying to pinpoint compiler bugs (or else proving the issue is with the source code), I no longer enjoy the benefits of discussing such issues at length with the CUDA compiler engineers. I'd say much of my knowledge in this area is "dated" at this point, possibly even "outdated".

If the issue in the MCX code is one of SSY/.S placement (a mere conjecture at this point) it is probably not an issue of "fragile" code generation, just a very hard problem for the compiler to solve, and it may just so happen that adding one more branch triggers a performance cliff in very rare cases.

The placement issue is hard because once the compiler starts placing convergence points (as tightly as possible, to avoid lengthy divergent flows), it also needs to traverse all the possible call graphs to make sure the thread-mask stack comes out correctly on all possible paths to that convergence point. If the code has instances of 'break' and 'continue' (really 'goto' in disguise) that can make it extra hard.

#37
Posted 03/25/2016 07:25 AM   
[quote=""]If the code has instances of 'break' and 'continue' (really 'goto' in disguise) that can make it extra hard.[/quote] Yes, this code has break and continue in various places, as well as the use of return statements from various points conditionally within a function. The two cases I can remember previously where I argued with the compiler engineers were just such cases as well.
said:If the code has instances of 'break' and 'continue' (really 'goto' in disguise) that can make it extra hard.


Yes, this code has break and continue in various places, as well as the use of return statements from various points conditionally within a function.

The two cases I can remember previously where I argued with the compiler engineers were just such cases as well.

#38
Posted 03/25/2016 10:37 AM   
[quote=""]The placement issue is hard because once the compiler starts placing convergence points (as tightly as possible, to avoid lengthy divergent flows), it also needs to traverse all the possible call graphs to make sure the thread-mask stack comes out correctly on all possible paths to that convergence point. If the code has instances of 'break' and 'continue' (really 'goto' in disguise) that can make it extra hard.[/quote] I am totally open to the idea of optimizing MCX coding styles so that the compiler heuristics can easily generate highly efficient instructions (of course, the compiler team may use MCX as a benchmark to enhance robustness of handling large monolithic complex kernels). I am also wiling to be educated to learn techniques that can ease the predication process for the compiler (if I can't understand, I am sure Fanny will). During the early development cycles of this software (circa 2009), I found using a while-loop construct and a for-loop construct made huge difference in terms of speed. [url]https://github.com/fangq/mcx/blob/fc7963a53c7d918de65e484242ffa54ae358a61f/src/mcx_core.cu#L150-L157[/url] but this difference diminished in newer versions of the toolkit. I was under the impression that the code complexities presented in MCX was well taken care of by the compiler. A "fragile code generation" issue has no longer been an issue until this Maxwell/cuda 7.5 issue showed up. The first thing I would like to learn from you guys is that: through what mechanism can an inefficient code generation negatively impact the speed? does it impact through consuming more registers? does it impact though increasing the instruction size and instruction loading overhead? My second question, is there a fundamental difference between a for-loop and while-loop in code generation? what about variable-limit for-loops? can a while-loop be unrolled? how does break/continue influence the loop code generation? My other question is related to JIT. I understand OpenCL uses JIT, and I heard that part of nvcc code generation also uses JIT. However, the JIT compilation happens before users initialize the constant variables, which may contain crucial information to signify the enabling and disabling of large code blocks (which can make a substantial impact to complexity and code generation). So, in either the case of OpenCL or CUDA compilation, does the compiler use the constant memory values at all to simplify code generation? if not, what was the difficulty? or is there a way we can hint the heuristics?
said:The placement issue is hard because once the compiler starts placing convergence points (as tightly as possible, to avoid lengthy divergent flows), it also needs to traverse all the possible call graphs to make sure the thread-mask stack comes out correctly on all possible paths to that convergence point. If the code has instances of 'break' and 'continue' (really 'goto' in disguise) that can make it extra hard.


I am totally open to the idea of optimizing MCX coding styles so that the compiler heuristics can easily generate highly efficient instructions (of course, the compiler team may use MCX as a benchmark to enhance robustness of handling large monolithic complex kernels). I am also wiling to be educated to learn techniques that can ease the predication process for the compiler (if I can't understand, I am sure Fanny will).

During the early development cycles of this software (circa 2009), I found using a while-loop construct and a for-loop construct made huge difference in terms of speed.

https://github.com/fangq/mcx/blob/fc7963a53c7d918de65e484242ffa54ae358a61f/src/mcx_core.cu#L150-L157

but this difference diminished in newer versions of the toolkit. I was under the impression that the code complexities presented in MCX was well taken care of by the compiler. A "fragile code generation" issue has no longer been an issue until this Maxwell/cuda 7.5 issue showed up.

The first thing I would like to learn from you guys is that: through what mechanism can an inefficient code generation negatively impact the speed? does it impact through consuming more registers? does it impact though increasing the instruction size and instruction loading overhead?

My second question, is there a fundamental difference between a for-loop and while-loop in code generation? what about variable-limit for-loops? can a while-loop be unrolled? how does break/continue influence the loop code generation?

My other question is related to JIT. I understand OpenCL uses JIT, and I heard that part of nvcc code generation also uses JIT. However, the JIT compilation happens before users initialize the constant variables, which may contain crucial information to signify the enabling and disabling of large code blocks (which can make a substantial impact to complexity and code generation). So, in either the case of OpenCL or CUDA compilation, does the compiler use the constant memory values at all to simplify code generation? if not, what was the difficulty? or is there a way we can hint the heuristics?

#39
Posted 03/25/2016 05:09 PM   
In general, I advise [i]against[/i] massaging code to be more palatable to any particular compiler (CPU or GPU), because the resulting optimization is very fragile. Every new version of the compiler will change some of the interaction between the heaps of heuristics inside the code generating phases and the previously favored idiom may now become disadvantageous. Instead, I advocate writing code in a clear, straight-forward, manner and reporting any resulting inefficiencies to the compiler vendor. This isn't just a theoretical consideration: For example, to get the best performance for the CUDA math library, I would often massage the source code to be more palatable to the compiler, which required a lot of time for dissecting and studying the generated SASS code to come up with the winning combination, [i]as there is no general recipe[/i]. Overall, for investigating functional bugs and performance issues (both for in-house and customer code), I would claim (without boasting) that I probably looked at more SASS code in detail than any particular CUDA compiler engineer. However, my approach to the math library source code required numerous re-writes over the years and made the code difficult to read in places. If someone is desperate for performance (and cannot wait for the compiler to improve), they should by all means look into massaging source code, using inline PTX assembly, maybe even code SASS assembly code with Scott Gray's Maxwell assembler. But it is not sound software engineering in my thinking, as lack of code readability and increased code maintenance has a definite long-term cost. For general optimization strategies, the CUDA Best Practices Guide is an excellent starting point. In terms of overall code structure, prefer "single-entry, single-exit" constructs. This means avoiding use of 'break', 'continue', multiple 'return', which are all hidden uses of 'goto'. Such irregular control flow interferes with many compiler optimizations, independent of the platform. One frequent source of execution inefficiencies (could be major or minor) in CUDA is not taking full advantage of CUDA's extended set of math functions and device intrinsics (e.g. some programmers do not realize they have rsqrt(), sincos(), rhypot() etc. at their disposal). As for loops, I am not aware of any particular pros and cons for the three basic loop types, other than that [i]integer-counted[/i] for-loops probably make unrolling easier and more likely. I say "probably" because I have not actually researched this, it has never come up in my CUDA performance work. As for JIT compilation: CUDA can now JIT compile from source code. Since the beginning of CUDA, it can JIT compile from PTX representation. I usually advise against using this, unless dynamic code generation is a crucial technique for a particular use case. My advice is to use offline compilation in such a way that SASS (machine code) for all architectures of interest is embedded in the object file, plus one copy of PTX for the most recent architecture. The latter serves as an insurance policy for future GPU architectures on which the PTX code can be JIT compiled when it first arrives.
In general, I advise against massaging code to be more palatable to any particular compiler (CPU or GPU), because the resulting optimization is very fragile. Every new version of the compiler will change some of the interaction between the heaps of heuristics inside the code generating phases and the previously favored idiom may now become disadvantageous. Instead, I advocate writing code in a clear, straight-forward, manner and reporting any resulting inefficiencies to the compiler vendor.

This isn't just a theoretical consideration: For example, to get the best performance for the CUDA math library, I would often massage the source code to be more palatable to the compiler, which required a lot of time for dissecting and studying the generated SASS code to come up with the winning combination, as there is no general recipe. Overall, for investigating functional bugs and performance issues (both for in-house and customer code), I would claim (without boasting) that I probably looked at more SASS code in detail than any particular CUDA compiler engineer.

However, my approach to the math library source code required numerous re-writes over the years and made the code difficult to read in places. If someone is desperate for performance (and cannot wait for the compiler to improve), they should by all means look into massaging source code, using inline PTX assembly, maybe even code SASS assembly code with Scott Gray's Maxwell assembler. But it is not sound software engineering in my thinking, as lack of code readability and increased code maintenance has a definite long-term cost.

For general optimization strategies, the CUDA Best Practices Guide is an excellent starting point. In terms of overall code structure, prefer "single-entry, single-exit" constructs. This means avoiding use of 'break', 'continue', multiple 'return', which are all hidden uses of 'goto'. Such irregular control flow interferes with many compiler optimizations, independent of the platform. One frequent source of execution inefficiencies (could be major or minor) in CUDA is not taking full advantage of CUDA's extended set of math functions and device intrinsics (e.g. some programmers do not realize they have rsqrt(), sincos(), rhypot() etc. at their disposal).

As for loops, I am not aware of any particular pros and cons for the three basic loop types, other than that integer-counted for-loops probably make unrolling easier and more likely. I say "probably" because I have not actually researched this, it has never come up in my CUDA performance work.

As for JIT compilation: CUDA can now JIT compile from source code. Since the beginning of CUDA, it can JIT compile from PTX representation. I usually advise against using this, unless dynamic code generation is a crucial technique for a particular use case. My advice is to use offline compilation in such a way that SASS (machine code) for all architectures of interest is embedded in the object file, plus one copy of PTX for the most recent architecture. The latter serves as an insurance policy for future GPU architectures on which the PTX code can be JIT compiled when it first arrives.

#40
Posted 03/25/2016 05:53 PM   
thanks njuffa for the helpful feedback. I will read the Best Practices Guide in more details.The last time I read the guide carefully was about 5 years ago ... by disabling the offending if() block we found earlier, I was able to run the PC sampling profiler again. I now see some new findings. I would like to get some help on interpreting the assembly code. Over the last week, I've implemented a new RNG (xorshift128+) with a hope to get better speed. I now see some different patterns in the PC sampling profiler output. Memory dependency, previously only accounts for 2-3% of the latency, now returns back to the scene, even though the overall running speed is pretty decent (24k photon/ms on 980Ti, higher than what it was before). I notice that almost 100% of this memory dependency comes from a single line of code (line#622), [code] if(idx1d!=idx1dold && idx1dold>0 && mediaidold){[/code] which accounts for 1/3 of the total run time now. In the assembly, almost 100% of the memory dependency comes from the below single assembly line: [code] I2I.S32.S16 R57, R6;[/code] I am attaching the screenshot of the PC sampling benchmark output. The hotspot in both the source code (top-left) and the assembly code (top-right) are highlighted. [img]http://www.nmr.mgh.harvard.edu/~fangq/temp/mcx_memory_dependency.png[/img] variable mediaidold is a char (label of the media), read from the global memory array media[] on line#609. I suspect the I2I.S32.S16 instruction was for retrieving the value of mediaidold? is there a document I can read more about assembly instructions? PS: I just changed line #609 from mediaidold=media[idx1d]; to mediaidold=mediaid; MCX got a nice 40% speed improvement ! (jumping from 24k photon/ms to 34k photon/ms on Maxwell)! I guess this confirms my suspicion. On the other hand, the improvement on Fermi and Kepler was not as exciting, only about 10%. In comparison, my 980Ti is about 10x faster than 1 core of 590 (Fermi). I wish I can see what happened on the older GPU architecture. Unfortunately the PC sampling profiler only runs on Maxwell.
thanks njuffa for the helpful feedback. I will read the Best Practices Guide in more details.The last time I read the guide carefully was about 5 years ago ...

by disabling the offending if() block we found earlier, I was able to run the PC sampling profiler again. I now see some new findings. I would like to get some help on interpreting the assembly code.

Over the last week, I've implemented a new RNG (xorshift128+) with a hope to get better speed. I now see some different patterns in the PC sampling profiler output. Memory dependency, previously only accounts for 2-3% of the latency, now returns back to the scene, even though the overall running speed is pretty decent (24k photon/ms on 980Ti, higher than what it was before).

I notice that almost 100% of this memory dependency comes from a single line of code (line#622),

if(idx1d!=idx1dold && idx1dold>0 && mediaidold){


which accounts for 1/3 of the total run time now. In the assembly, almost 100% of the memory dependency comes from the below single assembly line:

I2I.S32.S16 R57, R6;



I am attaching the screenshot of the PC sampling benchmark output. The hotspot in both the source code (top-left) and the assembly code (top-right) are highlighted.

Image

variable mediaidold is a char (label of the media), read from the global memory array media[] on line#609. I suspect the I2I.S32.S16 instruction was for retrieving the value of mediaidold? is there a document I can read more about assembly instructions?



PS: I just changed line #609 from mediaidold=media[idx1d]; to mediaidold=mediaid; MCX got a nice 40% speed improvement ! (jumping from 24k photon/ms to 34k photon/ms on Maxwell)! I guess this confirms my suspicion.

On the other hand, the improvement on Fermi and Kepler was not as exciting, only about 10%. In comparison, my 980Ti is about 10x faster than 1 core of 590 (Fermi). I wish I can see what happened on the older GPU architecture. Unfortunately the PC sampling profiler only runs on Maxwell.

#41
Posted 03/28/2016 04:39 PM   
I2I is an integer type conversion, the mnemonic means "integer to integer". Here, it converts a signed 16-bit integer to a signed 32-bit integer. It makes sense that you would see this instruction as part of a 'char' to 'int' conversion. This is not a particularly slow instruction, but if the source data comes directly from a load instruction, it may be stalled due to memory dependency. Best practice: every integer in a C/C++ program should be 'int', unless there is an exceedingly good reason for it to be of some other type. C and C++ semantics require that in an expression, all integer data with a type narrower than 'int' is widened to 'int' before being incorporated into the computation. So use of narrow integer types can often decrease efficiency (the compiler may be able to work around some of that under the "as-if" rule). I don't see how a minor issue like this would contribute to 1/3 of the runtime, that could be an artifact of the sampling profiler, which is a common risk of using a sampling approach. You may want to look into the general efficiency of your global memory accesses. txbob already pointed out the general low efficiency of that, I think. The use of 'const __restrict__' pointer arguments may also allow for more aggressive re-ordering of loads, leading to better tolerance to high global memory access latency.
I2I is an integer type conversion, the mnemonic means "integer to integer". Here, it converts a signed 16-bit integer to a signed 32-bit integer. It makes sense that you would see this instruction as part of a 'char' to 'int' conversion. This is not a particularly slow instruction, but if the source data comes directly from a load instruction, it may be stalled due to memory dependency.

Best practice: every integer in a C/C++ program should be 'int', unless there is an exceedingly good reason for it to be of some other type.

C and C++ semantics require that in an expression, all integer data with a type narrower than 'int' is widened to 'int' before being incorporated into the computation. So use of narrow integer types can often decrease efficiency (the compiler may be able to work around some of that under the "as-if" rule).

I don't see how a minor issue like this would contribute to 1/3 of the runtime, that could be an artifact of the sampling profiler, which is a common risk of using a sampling approach.

You may want to look into the general efficiency of your global memory accesses. txbob already pointed out the general low efficiency of that, I think. The use of 'const __restrict__' pointer arguments may also allow for more aggressive re-ordering of loads, leading to better tolerance to high global memory access latency.

#42
Posted 03/28/2016 04:57 PM   
Usually when I see "trivial" code changes: [code]mediaidold=media[idx1d]; to mediaidold=mediaid;[/code] resulting in large speed improvements, I think about the effect on optimization. The classic example is when people try to debug/optimize by commenting things out. Eventually they comment out a "trivial" write to global memory and suddenly their function gets 1000x faster. "Why does this one line of code take 343ms ??" You can find questions like that all over the place. So I haven't studied this case, but I would also consider whether the code change in question allowed the compiler to optimize away some significant chunk of code, which no longer has any impact on global state. For example, does this change eliminate the dependency on a previous computation involving either idx1d or media[idx1d]? If so, this code change could result in that section of code/computation being dropped/skipped.
Usually when I see "trivial" code changes:

mediaidold=media[idx1d]; to mediaidold=mediaid;



resulting in large speed improvements, I think about the effect on optimization. The classic example is when people try to debug/optimize by commenting things out. Eventually they comment out a "trivial" write to global memory and suddenly their function gets 1000x faster. "Why does this one line of code take 343ms ??" You can find questions like that all over the place.

So I haven't studied this case, but I would also consider whether the code change in question allowed the compiler to optimize away some significant chunk of code, which no longer has any impact on global state. For example, does this change eliminate the dependency on a previous computation involving either idx1d or media[idx1d]? If so, this code change could result in that section of code/computation being dropped/skipped.

#43
Posted 03/28/2016 05:23 PM   
For the original observation (10x perf difference CUDA 6.5->7.5) discussed up through about comment 30 in this thread, the dev team seems to have narrowed it down to a particular compiler behavior. As njuffa previously surmised, the modification would be related to ptxas. I don't have liberty to describe in detail at the moment, and confirmation (AFAIC) cannot be discussed until an actual updated driver appears (see below). I can't discuss schedule for an updated ptxas with the proposed change at this time. With respect to related components in the driver, a future new version of r361 driver branch may appear that will incorporate the proposed change. The proposed change has been tested internally already to demonstrate that it restores the performance that was "lost" in CUDA 7.5 currently. Thus it may be possible to test a future r361 driver by eliminating the SASS portion of the fatbinary, and allowing JIT to create the necessary SASS. I'll update the thread when I have more details, but probably not until said r361 driver appears.
For the original observation (10x perf difference CUDA 6.5->7.5) discussed up through about comment 30 in this thread, the dev team seems to have narrowed it down to a particular compiler behavior. As njuffa previously surmised, the modification would be related to ptxas. I don't have liberty to describe in detail at the moment, and confirmation (AFAIC) cannot be discussed until an actual updated driver appears (see below).

I can't discuss schedule for an updated ptxas with the proposed change at this time.

With respect to related components in the driver, a future new version of r361 driver branch may appear that will incorporate the proposed change. The proposed change has been tested internally already to demonstrate that it restores the performance that was "lost" in CUDA 7.5 currently.

Thus it may be possible to test a future r361 driver by eliminating the SASS portion of the fatbinary, and allowing JIT to create the necessary SASS.

I'll update the thread when I have more details, but probably not until said r361 driver appears.

#44
Posted 03/28/2016 10:02 PM   
thank you txbob for the update, also appreciate the dev team for their effort to quickly identify and fix the issue. I look forward to the new drive to appear.
thank you txbob for the update, also appreciate the dev team for their effort to quickly identify and fix the issue. I look forward to the new drive to appear.

#45
Posted 03/29/2016 02:40 AM   
Scroll To Top

Add Reply