CUDA Cycle-Accurate Simulator Is there one?
I'm writing some highly optimised code for a 4000 word essay for my school, however I do not own a CUDA enabled card (And the one I need to test on is a $500 GTX580 -- a bit expensive for a student).

I was wondering if there was a cycle-accurate simulator for CUDA, that would allow me to debug, test and benchmark my code (Obviously in 'slow-motion' as it were)

While I obviously expect such a thing to be 100-500 times slower than the real deal, it would be really handy if there was one. My code should only take 10 real seconds to simulate, waiting an hour for that may be boring but sure beats $500.
I'm writing some highly optimised code for a 4000 word essay for my school, however I do not own a CUDA enabled card (And the one I need to test on is a $500 GTX580 -- a bit expensive for a student).



I was wondering if there was a cycle-accurate simulator for CUDA, that would allow me to debug, test and benchmark my code (Obviously in 'slow-motion' as it were)



While I obviously expect such a thing to be 100-500 times slower than the real deal, it would be really handy if there was one. My code should only take 10 real seconds to simulate, waiting an hour for that may be boring but sure beats $500.

Ultimate Gaming Rig:

Dell Latitude XT2

Windows 7 64bit

Intel Core 2 Duo U9600 1.6 GHz

3GB DDR3 1200MHz underclocked to 800 MHz (YAY DELL!)

Intel GMA4500MHD

156GB SATAII 5400RPM HDD

Cold Boot Time: 12 Seconds to desktop (Take that Lenovo with i5 + SSD & 40 second boot)

#1
Posted 01/30/2012 03:26 PM   
gpuocelot simulates at the PTX code level.

Barra simulates at the G80 (i.e. nVidia GTX 8800) instruction level.

Both are open source projects, but Barra seems more incomplete (GPU feature-wise) and is a bit orphaned.
gpuocelot simulates at the PTX code level.



Barra simulates at the G80 (i.e. nVidia GTX 8800) instruction level.



Both are open source projects, but Barra seems more incomplete (GPU feature-wise) and is a bit orphaned.

#2
Posted 01/30/2012 04:28 PM   
neither one is cycle accurate in any way. (also a cycle accurate simulator would be at least thousands of times slower than hardware, probably a lot more than that)
neither one is cycle accurate in any way. (also a cycle accurate simulator would be at least thousands of times slower than hardware, probably a lot more than that)

#3
Posted 01/30/2012 06:34 PM   
[quote name='cbuchner1' date='30 January 2012 - 06:28 PM' timestamp='1327940913' post='1362596']
Both are open source projects, but Barra seems more incomplete (GPU feature-wise) and is a bit orphaned.
[/quote]

Sorry to hear that. /wink.gif' class='bbc_emoticon' alt=';)' />

Actually, it is still actively being developed, even though we do not publicize it much. We now have a decent timing model of the SM. It is by no means cycle-accurate (that would require much more reverse-engineering than reasonnable), but it is already useable for hardware feature experimentation and design space exploration.
We focus on improving the accuracy of the simulation and supporting more micro-architectural features, rather than keeping up with the latest CUDA versions and supporting new programmer-visible features (read: no CUDA Toolkit >3.0, no CC 2.x).

If somebody is interested in beta-testing the new timing model, please drop me a line. We always appreciate feedback.

As of simulation time, an instruction-level functional simulator like Barra or Ocelot's emulator is ~1000 times slower than hardware. For cycle-accurate simulation, add 2 or 3 more orders of magnitude...
[quote name='cbuchner1' date='30 January 2012 - 06:28 PM' timestamp='1327940913' post='1362596']

Both are open source projects, but Barra seems more incomplete (GPU feature-wise) and is a bit orphaned.





Sorry to hear that. /wink.gif' class='bbc_emoticon' alt=';)' />



Actually, it is still actively being developed, even though we do not publicize it much. We now have a decent timing model of the SM. It is by no means cycle-accurate (that would require much more reverse-engineering than reasonnable), but it is already useable for hardware feature experimentation and design space exploration.

We focus on improving the accuracy of the simulation and supporting more micro-architectural features, rather than keeping up with the latest CUDA versions and supporting new programmer-visible features (read: no CUDA Toolkit >3.0, no CC 2.x).



If somebody is interested in beta-testing the new timing model, please drop me a line. We always appreciate feedback.



As of simulation time, an instruction-level functional simulator like Barra or Ocelot's emulator is ~1000 times slower than hardware. For cycle-accurate simulation, add 2 or 3 more orders of magnitude...
#4
Posted 01/30/2012 07:09 PM   
Thanks for the information :) And the assignment I'm doing can afford to have a 1 week simulation time if needed.

I'll post back after giving both of these a try
Thanks for the information :) And the assignment I'm doing can afford to have a 1 week simulation time if needed.



I'll post back after giving both of these a try

Ultimate Gaming Rig:

Dell Latitude XT2

Windows 7 64bit

Intel Core 2 Duo U9600 1.6 GHz

3GB DDR3 1200MHz underclocked to 800 MHz (YAY DELL!)

Intel GMA4500MHD

156GB SATAII 5400RPM HDD

Cold Boot Time: 12 Seconds to desktop (Take that Lenovo with i5 + SSD & 40 second boot)

#5
Posted 02/01/2012 02:47 PM   
Scroll To Top