Speedy CUDA tool to win EngineYard SHA1 contest
  1 / 16    
Engineyard is a company that deals with Ruby web apps, often with distributed cloud computing.

[url="http://www.engineyard.com/blog/2009/programming-contest-win-iphone-3gs-2k-cloud-credit/"]They announced a contest yesterday[/url], starting on the morning of July 20, and it lasts one day.
The basic idea is they give a 160 bit SHA1 hash value, and you have a limited dictionary and mixing rules to try to make a phrase which hashes as close to that target as possible. An exact match is likely impossible, but you just want to match more bits than anyone else.
The prizes are several new iPhones, and some $2000 of cloud compute credit.

The only practical attack to win is raw brute force... compute trillions of hash candidates and keep the best one.

Gee, I wonder if anyone on this forum knows of an accessible technology which has a lot of horsepower? [b]CUDA to the rescue![/b]

So I decided to spend this Saturday afternoon putting together a CUDA program to attack this exact contest. I could keep it private and try to win the contest myself, but it will be a lot more fun to let everyone use it! And we still have over a day to tweak the code if you want to start hacking.

I'm attaching source and compiled code here for a ready-to-run contest cracker! If you win the contest, the prize is yours to keep! (But please, tell us on the forum!)\
This code is crude but it does power through over 50M hashes per second on a Quadro.. I bet a 285GTX would do 70M.

It's run from command line, so if you have multiple GPUs, you can run one instance on each GPU. (No time for threading niceties!). No time for a pretty GUI.

[b]The technical strategy[/b]

If you read the SHA1 algorithm, you'll see that there's an 80 word lookup table. Ideally we'd like to have every thread do a compute, but there's not enough shared memory to make such a table. But luckily, if we look at the algorithm, it's possible to reorder the computation of the table to use just 16 values in a rolling array per thread. At 192 threads per block (for register stall efficiency), that's 192*16*4=12K.. which fits perfectly into our 16K of shared memory.

The strategy is to have the CPU pick some initial words like "tokyo memcached" and then the GPU takes over and powers through 93^5 variations of that phrase by appending 5 characters to the end. (There are 93 printable ASCII characters, and we can append 5 of them.) Each block assigns the first three chars, and inside the block, the threads compute the 93*93 variations.

The host app doesn't do any work after setting up the base string (which takes a millisecond.). The GPU compute takes about 2 minutes to try the 7 billion hashes. After the GPU finishes, the CPU picks a new base string and the GPU is launched again.
To keep the GPU from locking up, the GPU's speed is timed as it does blocks. Each kernel launch does perhaps 400 blocks. If that takes a short time like 10 ms, the number of blocks is increased for the next launch. The goal is to get kernels of about 50ms.. enough to keep your machine responsive, but not so short that you get efficiency losses due to idle SMs.


[b]The contest[/b]

This code here is working right now.. it's ready to use in the contest! But hopefully we can tweak it before launch and I invite all hackers to speed it up and find problems before launch.

Even if you're not interested in hacking, you can still enter the contest! And if you win, the prize is yours. This is your chance to show off your 4-board 295GTX monster! I'd like to show how much power our GPUs can harness. Please also let other people know that there's a tool for people to use.. hopefully we can get it easy enough to use for even non-hackers.


I am NOT involved with the contest or company, I just found it via Twitter and thought it would be great to brute force on a GPU.
[url="http://forums.nvidia.com/index.php?showtopic=102228"]I see now that Profquail noticed the same contest![/url]

[b]Compiling. running.[/b]

Easy! This is windows, but it should work on Linux and Mac. I'm on my laptop now so I haven't had a chance to try Linux or multi-gpu.

To compile, just:
[code] nvcc -I "E:\CUDA\common\inc" gpusha1.cu e:\cuda\common\lib\cutil32.lib -o gpusha1search.exe[/code]

To run, just open a shell and use a command like:
[code] gpusha1search.exe -device 0 6cac827bae250971a8b1fb6e2a96676f7a077b60 Cloud Ruby DHH one eight six active record controller data rspec mongrel MySQL postgresSQL tokyo MRI jruby rubinius memcached exception metaprogramming reflection[/code]


Attachments in the next post, let me zip them up with a reference to this thread.
Engineyard is a company that deals with Ruby web apps, often with distributed cloud computing.



They announced a contest yesterday, starting on the morning of July 20, and it lasts one day.

The basic idea is they give a 160 bit SHA1 hash value, and you have a limited dictionary and mixing rules to try to make a phrase which hashes as close to that target as possible. An exact match is likely impossible, but you just want to match more bits than anyone else.

The prizes are several new iPhones, and some $2000 of cloud compute credit.



The only practical attack to win is raw brute force... compute trillions of hash candidates and keep the best one.



Gee, I wonder if anyone on this forum knows of an accessible technology which has a lot of horsepower? CUDA to the rescue!



So I decided to spend this Saturday afternoon putting together a CUDA program to attack this exact contest. I could keep it private and try to win the contest myself, but it will be a lot more fun to let everyone use it! And we still have over a day to tweak the code if you want to start hacking.



I'm attaching source and compiled code here for a ready-to-run contest cracker! If you win the contest, the prize is yours to keep! (But please, tell us on the forum!)\

This code is crude but it does power through over 50M hashes per second on a Quadro.. I bet a 285GTX would do 70M.



It's run from command line, so if you have multiple GPUs, you can run one instance on each GPU. (No time for threading niceties!). No time for a pretty GUI.



The technical strategy



If you read the SHA1 algorithm, you'll see that there's an 80 word lookup table. Ideally we'd like to have every thread do a compute, but there's not enough shared memory to make such a table. But luckily, if we look at the algorithm, it's possible to reorder the computation of the table to use just 16 values in a rolling array per thread. At 192 threads per block (for register stall efficiency), that's 192*16*4=12K.. which fits perfectly into our 16K of shared memory.



The strategy is to have the CPU pick some initial words like "tokyo memcached" and then the GPU takes over and powers through 93^5 variations of that phrase by appending 5 characters to the end. (There are 93 printable ASCII characters, and we can append 5 of them.) Each block assigns the first three chars, and inside the block, the threads compute the 93*93 variations.



The host app doesn't do any work after setting up the base string (which takes a millisecond.). The GPU compute takes about 2 minutes to try the 7 billion hashes. After the GPU finishes, the CPU picks a new base string and the GPU is launched again.

To keep the GPU from locking up, the GPU's speed is timed as it does blocks. Each kernel launch does perhaps 400 blocks. If that takes a short time like 10 ms, the number of blocks is increased for the next launch. The goal is to get kernels of about 50ms.. enough to keep your machine responsive, but not so short that you get efficiency losses due to idle SMs.





The contest



This code here is working right now.. it's ready to use in the contest! But hopefully we can tweak it before launch and I invite all hackers to speed it up and find problems before launch.



Even if you're not interested in hacking, you can still enter the contest! And if you win, the prize is yours. This is your chance to show off your 4-board 295GTX monster! I'd like to show how much power our GPUs can harness. Please also let other people know that there's a tool for people to use.. hopefully we can get it easy enough to use for even non-hackers.





I am NOT involved with the contest or company, I just found it via Twitter and thought it would be great to brute force on a GPU.

I see now that Profquail noticed the same contest!



Compiling. running.



Easy! This is windows, but it should work on Linux and Mac. I'm on my laptop now so I haven't had a chance to try Linux or multi-gpu.



To compile, just:

nvcc -I "E:\CUDA\common\inc"  gpusha1.cu  e:\cuda\common\lib\cutil32.lib -o gpusha1search.exe




To run, just open a shell and use a command like:

gpusha1search.exe -device 0 6cac827bae250971a8b1fb6e2a96676f7a077b60 Cloud Ruby DHH one eight six active record controller data rspec mongrel MySQL postgresSQL tokyo MRI jruby rubinius memcached exception metaprogramming reflection






Attachments in the next post, let me zip them up with a reference to this thread.

#1
Posted 07/19/2009 07:28 AM   
Here is source and executable for Windows. I'll should be back home tomorrow to compile Linux.
(A nice tip of the hat to CUDA.. this is evidence you can put together quite powerful supercomputing apps in ONE DAY.... and on a laptop!)


EDIT: updated to version 0.11
3X faster, from loop unrolling, smaller block work sizes.
Now actually sets the device for multi GPU like it's supposed to.
std:: prefix on vector and strings for gcc


EDIT: updated to version 0.12
Now uses two-block hashes (no speed cost) so it can have full 12 word strings
displayed hash speed now based on proper block work size of 64*64 not 93*93
small speed boost from final round simplification

EDIT: version 0.13
Minor bugfixes, compiler warnings, 1% speed boost by using a final reduction

EDIT: version 0.14
Minor tweaks. 1% speed boost (inner loop optimization).
Now prints link to forum for users to find discussion
192 thread variable now documented and at top of source

EDIT: version 0.15
Hack, for now, to dynamically switch to fewer threads on G80/G90. Using 128 threads on G80/G90 drops performance 25% though!

EDIT: version 0.16
Trivial changes.. cutDeleteTimer fixes leaked timer handle
-blocksize option to turn off dynamic block sizing and hardwire specific block size to optimize a specific machine/GPU performance
Here is source and executable for Windows. I'll should be back home tomorrow to compile Linux.

(A nice tip of the hat to CUDA.. this is evidence you can put together quite powerful supercomputing apps in ONE DAY.... and on a laptop!)





EDIT: updated to version 0.11

3X faster, from loop unrolling, smaller block work sizes.

Now actually sets the device for multi GPU like it's supposed to.

std:: prefix on vector and strings for gcc





EDIT: updated to version 0.12

Now uses two-block hashes (no speed cost) so it can have full 12 word strings

displayed hash speed now based on proper block work size of 64*64 not 93*93

small speed boost from final round simplification



EDIT: version 0.13

Minor bugfixes, compiler warnings, 1% speed boost by using a final reduction



EDIT: version 0.14

Minor tweaks. 1% speed boost (inner loop optimization).

Now prints link to forum for users to find discussion

192 thread variable now documented and at top of source



EDIT: version 0.15

Hack, for now, to dynamically switch to fewer threads on G80/G90. Using 128 threads on G80/G90 drops performance 25% though!



EDIT: version 0.16

Trivial changes.. cutDeleteTimer fixes leaked timer handle

-blocksize option to turn off dynamic block sizing and hardwire specific block size to optimize a specific machine/GPU performance

#2
Posted 07/19/2009 07:33 AM   
OK, past midnight here and time for sleep. I'll be traveling tomorrow morning so likely no updates until the afternoon.
Remember the contest starts on Monday July 20 (I think it's at noon) and lasts until Tuesday July 21 until 6pm. So that's 30 hours to let [b]everyone's[/b] GPUs cook! And you may get a free iPhone if your GPU is lucky!

If someone could compile a Linux version (and/or Mac) that'd be great!

Also, please, I totally encourage hacking of the code. There's more that can be done. Several potential optimizations are listed in a comment at the end of the source code. The code is also BSD licensed so you could even use it later in any of your own projects, maybe if you're doing SHA1 computes.


I admit I want to show these Ruby cloud guys that they underestimate GPU power...
OK, past midnight here and time for sleep. I'll be traveling tomorrow morning so likely no updates until the afternoon.

Remember the contest starts on Monday July 20 (I think it's at noon) and lasts until Tuesday July 21 until 6pm. So that's 30 hours to let everyone's GPUs cook! And you may get a free iPhone if your GPU is lucky!



If someone could compile a Linux version (and/or Mac) that'd be great!



Also, please, I totally encourage hacking of the code. There's more that can be done. Several potential optimizations are listed in a comment at the end of the source code. The code is also BSD licensed so you could even use it later in any of your own projects, maybe if you're doing SHA1 computes.





I admit I want to show these Ruby cloud guys that they underestimate GPU power...

#3
Posted 07/19/2009 07:40 AM   
Just found this when googleing for the contest. Thanks for posting it.

I had figured gpu computing would be perfect for this contest, but I don't know enough to implement it myself though, so this is great.

Two things:
- To compile on the mac I needed to add 'using namespace std;' to the top of the file.
- My interpretation of the rules is that you must select 12 words per trial string. This seems to use less than that.

On my MacBookPro4,1 (GeForce 8600M GT) I get about 6 M hashes/s. Compared to my c++ implementation using openssl libraries getting 2.5 M. per core on the 2.6 GHz cpu.
Just found this when googleing for the contest. Thanks for posting it.



I had figured gpu computing would be perfect for this contest, but I don't know enough to implement it myself though, so this is great.



Two things:

- To compile on the mac I needed to add 'using namespace std;' to the top of the file.

- My interpretation of the rules is that you must select 12 words per trial string. This seems to use less than that.



On my MacBookPro4,1 (GeForce 8600M GT) I get about 6 M hashes/s. Compared to my c++ implementation using openssl libraries getting 2.5 M. per core on the 2.6 GHz cpu.

#4
Posted 07/19/2009 02:40 PM   
Here's some benchmarks of version 0.1 on my mishmash of different GPUs:

(This is peak. There is some fluctuation as the block sizing heuristic varies things)

GTX 275: 73.338 M/sec
GTX 280: 63.870 M/sec
GTX 285: 63.890 M/sec
GTX 295 (1 half): 62.642 M/sec
GTX 295 (both halves): (24.134 + 26.658) = 50.792 M/sec

(275 is in the Phenom 9900 2.6 GHz, 295 is in the Phenom II 3 GHz, and the 280 and the 285 are in a Core i7 2.67 GHz motherboard)


Definitely some peculiar performance results. Based on the GTX 295, it looks like this code is PCI-Express bandwidth bound. I'm not sure why the GTX 275 comes out the highest, because that computer ranks bottom of the pack in device-to-host bandwidth.

I haven't looked at the code in-depth yet, but I also agree with the previous poster that the contest spec says 12 words.
Here's some benchmarks of version 0.1 on my mishmash of different GPUs:



(This is peak. There is some fluctuation as the block sizing heuristic varies things)



GTX 275: 73.338 M/sec

GTX 280: 63.870 M/sec

GTX 285: 63.890 M/sec

GTX 295 (1 half): 62.642 M/sec

GTX 295 (both halves): (24.134 + 26.658) = 50.792 M/sec



(275 is in the Phenom 9900 2.6 GHz, 295 is in the Phenom II 3 GHz, and the 280 and the 285 are in a Core i7 2.67 GHz motherboard)





Definitely some peculiar performance results. Based on the GTX 295, it looks like this code is PCI-Express bandwidth bound. I'm not sure why the GTX 275 comes out the highest, because that computer ranks bottom of the pack in device-to-host bandwidth.



I haven't looked at the code in-depth yet, but I also agree with the previous poster that the contest spec says 12 words.

#5
Posted 07/19/2009 03:26 PM   
[quote name='crmoore' post='567173' date='Jul 19 2009, 07:40 AM']- To compile on the mac I needed to add 'using namespace std;' to the top of the file.
- My interpretation of the rules is that you must select 12 words per trial string. This seems to use less than that.

On my MacBookPro4,1 (GeForce 8600M GT) I get about 6 M hashes/s. Compared to my c++ implementation using openssl libraries getting 2.5 M. per core on the 2.6 GHz cpu.[/quote]

Thanks for the catch!

The std:: qualifier just needs to be added to the vector and string types.. I'll post the code updates later. I wonder why windows nvcc didn't complain.

The 12 words per string rule: Wow! I totally missed that since it's not mentioned in the contest rule list, just up in the intro.
This is an annoyance .. not because picking exactly 12 rules is hard, but because the length of 12 words will almost certainly push the hash to beyond one hash block, halving compute speed!

But the solution is easy.. we make SURE that the string is two blocks long. We compute the hash of the first block, which will be CONSTANT and use that as an initializer for all 6 billion CUDA computed hashers. Search speed should be unaffected.

I'll post code (and windows/linux binaries) later today. crmoore, I hope you can post a Mac compile too.

6M hashes/sec is quite nice.. that's almost identical to the 6.5 Mhashes/sec I get on my laptop's old quadro 570m which has very similar specs.
The search is compute limited, not bandwidth limited, so it's mostly a product of core count and shader frequency.

crmoore, can you talk about your CPU version? Just curious what strategy you use for enumerating trials.
[quote name='crmoore' post='567173' date='Jul 19 2009, 07:40 AM']- To compile on the mac I needed to add 'using namespace std;' to the top of the file.

- My interpretation of the rules is that you must select 12 words per trial string. This seems to use less than that.



On my MacBookPro4,1 (GeForce 8600M GT) I get about 6 M hashes/s. Compared to my c++ implementation using openssl libraries getting 2.5 M. per core on the 2.6 GHz cpu.



Thanks for the catch!



The std:: qualifier just needs to be added to the vector and string types.. I'll post the code updates later. I wonder why windows nvcc didn't complain.



The 12 words per string rule: Wow! I totally missed that since it's not mentioned in the contest rule list, just up in the intro.

This is an annoyance .. not because picking exactly 12 rules is hard, but because the length of 12 words will almost certainly push the hash to beyond one hash block, halving compute speed!



But the solution is easy.. we make SURE that the string is two blocks long. We compute the hash of the first block, which will be CONSTANT and use that as an initializer for all 6 billion CUDA computed hashers. Search speed should be unaffected.



I'll post code (and windows/linux binaries) later today. crmoore, I hope you can post a Mac compile too.



6M hashes/sec is quite nice.. that's almost identical to the 6.5 Mhashes/sec I get on my laptop's old quadro 570m which has very similar specs.

The search is compute limited, not bandwidth limited, so it's mostly a product of core count and shader frequency.



crmoore, can you talk about your CPU version? Just curious what strategy you use for enumerating trials.

#6
Posted 07/19/2009 03:39 PM   
[quote]Definitely some peculiar performance results. Based on the GTX 295, it looks like this code is PCI-Express bandwidth bound. I'm not sure why the GTX 275 comes out the highest, because that computer ranks bottom of the pack in device-to-host bandwidth.[/quote]

EXCELLENT, thanks for the reports. I was especially interested in a 295's performance!

Wow! That's surprising.
The PCIe communication is pretty small, I never expected it to be a bottleneck!
It's [b]not [/b] PCIe bandwidth! There's a result of just one word per block that's sent over from the GPU to the CPU.
so even for a fast 2 minute compute, that's only 200 MB/sec.

But it's likely easily solvable, we can have the GPU do more work on the card and not even report the best results every call.
We'll just accumulate a few million results on the device, then run a reduction kernel and send just one byte back.
This solves any bandwidth problem, any transaction overhead problem, and even any CPU overhead while scanning results.
Definitely some peculiar performance results. Based on the GTX 295, it looks like this code is PCI-Express bandwidth bound. I'm not sure why the GTX 275 comes out the highest, because that computer ranks bottom of the pack in device-to-host bandwidth.




EXCELLENT, thanks for the reports. I was especially interested in a 295's performance!



Wow! That's surprising.

The PCIe communication is pretty small, I never expected it to be a bottleneck!

It's not PCIe bandwidth! There's a result of just one word per block that's sent over from the GPU to the CPU.

so even for a fast 2 minute compute, that's only 200 MB/sec.



But it's likely easily solvable, we can have the GPU do more work on the card and not even report the best results every call.

We'll just accumulate a few million results on the device, then run a reduction kernel and send just one byte back.

This solves any bandwidth problem, any transaction overhead problem, and even any CPU overhead while scanning results.

#7
Posted 07/19/2009 03:46 PM   
[quote name='seibert' post='567179' date='Jul 19 2009, 08:26 AM'](This is peak. There is some fluctuation as the block sizing heuristic varies things)[/quote]

That might be another easy optimization. Maybe once we get to a rough block size, we can vary the block count within a range to find the highest performance.
For example if your card has 30 SMs, it's likely that 117, 118, 119, 120 blocks will be a lot more efficient than 121 blocks, since in that case one SM will get an extra block to do while all the other SMs are idle. It's only a rough estimate (CUDA scheduling is deliberately opaque!).

I could also resdesign the kernel to do less work per block to minimize the effect, but that's actually pretty annoying because of my choice to make the work partition based on the 5 appended characters. One character per block is only 93 hashes. Two characters is what I use now, but it'd be nice if it were about 10 times smaller than the 93*93 to hide that quantum work size.

Fixable, but maybe not a big deal to attack in the 1 day remaining.
[quote name='seibert' post='567179' date='Jul 19 2009, 08:26 AM'](This is peak. There is some fluctuation as the block sizing heuristic varies things)



That might be another easy optimization. Maybe once we get to a rough block size, we can vary the block count within a range to find the highest performance.

For example if your card has 30 SMs, it's likely that 117, 118, 119, 120 blocks will be a lot more efficient than 121 blocks, since in that case one SM will get an extra block to do while all the other SMs are idle. It's only a rough estimate (CUDA scheduling is deliberately opaque!).



I could also resdesign the kernel to do less work per block to minimize the effect, but that's actually pretty annoying because of my choice to make the work partition based on the 5 appended characters. One character per block is only 93 hashes. Two characters is what I use now, but it'd be nice if it were about 10 times smaller than the 93*93 to hide that quantum work size.



Fixable, but maybe not a big deal to attack in the 1 day remaining.

#8
Posted 07/19/2009 03:58 PM   
[quote name='SPWorley' post='567187' date='Jul 19 2009, 09:46 AM']Wow! That's surprising.
The PCIe communication is pretty small, I never expected it to be a bottleneck!
It's [b]not [/b] PCIe bandwidth! There's a result of just one word per block that's sent over from the GPU to the CPU.
so even for a fast 2 minute compute, that's only 200 MB/sec.[/quote]

Yeah, this sounded wrong to me after I wrote it. I just changed h_Best to pinned memory, and that didn't help. I shouldn't be surprised, since the Core i7 has pageable memory transfers that are nearly the same speed as pinned memory, yet gets the same speed as the rest of the devices.

My next theory was that it was actually transfer latency rather than bandwidth, so I tried switching h_Best and d_Best to use the CUDA 2.2 zero-copy feature. Also no change in performance on my main test system (the GTX 275). I extended the target kernel time from 40 ms to 100 ms, and also no change.

Some bottleneck is holding things back that has no correlation with shader clock rate, device memory bandwidth, or pageable memory performance... Very weird. :)
[quote name='SPWorley' post='567187' date='Jul 19 2009, 09:46 AM']Wow! That's surprising.

The PCIe communication is pretty small, I never expected it to be a bottleneck!

It's not PCIe bandwidth! There's a result of just one word per block that's sent over from the GPU to the CPU.

so even for a fast 2 minute compute, that's only 200 MB/sec.



Yeah, this sounded wrong to me after I wrote it. I just changed h_Best to pinned memory, and that didn't help. I shouldn't be surprised, since the Core i7 has pageable memory transfers that are nearly the same speed as pinned memory, yet gets the same speed as the rest of the devices.



My next theory was that it was actually transfer latency rather than bandwidth, so I tried switching h_Best and d_Best to use the CUDA 2.2 zero-copy feature. Also no change in performance on my main test system (the GTX 275). I extended the target kernel time from 40 ms to 100 ms, and also no change.



Some bottleneck is holding things back that has no correlation with shader clock rate, device memory bandwidth, or pageable memory performance... Very weird. :)

#9
Posted 07/19/2009 04:16 PM   
Well, 73M/sec is pretty low for GTX275. I've got about 165M/s for GTX260/192SP. My CUDA code wasn't made for EngineYard contest, it's general SHA1/MD4/MD5 hash brute-forcer. You can get win32 executable for tests from [url="http://www.golubev.com/hashgpu.htm"]http://www.golubev.com/hashgpu.htm[/url].

I have several suggestions for you though:
1. You don't need any shared memory for SHA1. It used only 16 DWORDS (+5 for hash state) not 80.
2. Unroll everything in SHA1_Transform function.
3. In fact, better just use OpenSSL SHA1 implementation as it already heavily optimized :).
4. Remove all divisions. No need to have full 93-symbols charset (and btw, it's possible to produce some "bad" word with 93 random chars which isn't allowed by contest's rules ;)), reduce it to 64/32 chars, so it'll be possible just to shift & bitwise-and instead of division.

5. ATI's GPUs still better for such (32-bit ints only) computations. Atm I have about 600M hashes/sec with HD4850+HD4770 config. While (estimated) peak for GTX295 only 415M hashes/sec.
6. It's still pure luck to find winning hash -- people getting length == 34 on CPU while I'm stuck at 36 having about 100 times faster SHA1 code. It's really like lottery: having 100 tickets better than having just one but do you really believe you can win it with 100 tickets when you need 2^80 of them to cover everything? ;)
Well, 73M/sec is pretty low for GTX275. I've got about 165M/s for GTX260/192SP. My CUDA code wasn't made for EngineYard contest, it's general SHA1/MD4/MD5 hash brute-forcer. You can get win32 executable for tests from http://www.golubev.com/hashgpu.htm.



I have several suggestions for you though:

1. You don't need any shared memory for SHA1. It used only 16 DWORDS (+5 for hash state) not 80.

2. Unroll everything in SHA1_Transform function.

3. In fact, better just use OpenSSL SHA1 implementation as it already heavily optimized :).

4. Remove all divisions. No need to have full 93-symbols charset (and btw, it's possible to produce some "bad" word with 93 random chars which isn't allowed by contest's rules ;)), reduce it to 64/32 chars, so it'll be possible just to shift & bitwise-and instead of division.



5. ATI's GPUs still better for such (32-bit ints only) computations. Atm I have about 600M hashes/sec with HD4850+HD4770 config. While (estimated) peak for GTX295 only 415M hashes/sec.

6. It's still pure luck to find winning hash -- people getting length == 34 on CPU while I'm stuck at 36 having about 100 times faster SHA1 code. It's really like lottery: having 100 tickets better than having just one but do you really believe you can win it with 100 tickets when you need 2^80 of them to cover everything? ;)

#10
Posted 07/19/2009 04:26 PM   
Hey, I'm glad to see that someone else is going to take advantage of CUDA for this contest! I was actually just getting ready to sit down and write my kernel for the contest tomorrow. If I can get something good going, I will post my code up here so we can compare and contrast before tomorrow morning and get the best performance that we can (in a day, anyway).
Hey, I'm glad to see that someone else is going to take advantage of CUDA for this contest! I was actually just getting ready to sit down and write my kernel for the contest tomorrow. If I can get something good going, I will post my code up here so we can compare and contrast before tomorrow morning and get the best performance that we can (in a day, anyway).

GPU.NET: Write your GPU code in 100% pure C#.

Learn more at tidepowerd.com, and download a free 30-day trial of GPU.NET. Follow @tidepowerd for release updates.



GPU.NET example projects

#11
Posted 07/19/2009 04:31 PM   
Hah! OK, found the bug which invalidates some of my previous benchmarks. This code never calls cudaSetDevice(). So it always uses device 0 regardless of the command line option. That's why the GTX 295 benchmark came out weird.

Here's the new results for ver 0.1 + cudaSetDevice:

GTX 275: 73 M/sec
GTX 280: 64 M/sec
GTX 285: 75 M/sec
GTX 295 (1 half): 63 M/sec
GTX 295 (both halves): (62.624 + 62.627) = 125 M/sec

This totally changes my conclusions. This looks like it is definitely shader bound, like you would expect. All of these devices have 30 SM, so if you divide by the shader clock in each case:

GTX 275: 73 / 1404 = 0.052
GTX 280: 64 / 1296 = 0.049
GTX 285: 75 / 1476 = 0.051
GTX 295: 63 / 1242 = 0.051

You see the shader clock is a near perfect predictor of performance. So I would stop fussing with on-card reductions. :)
Hah! OK, found the bug which invalidates some of my previous benchmarks. This code never calls cudaSetDevice(). So it always uses device 0 regardless of the command line option. That's why the GTX 295 benchmark came out weird.



Here's the new results for ver 0.1 + cudaSetDevice:



GTX 275: 73 M/sec

GTX 280: 64 M/sec

GTX 285: 75 M/sec

GTX 295 (1 half): 63 M/sec

GTX 295 (both halves): (62.624 + 62.627) = 125 M/sec



This totally changes my conclusions. This looks like it is definitely shader bound, like you would expect. All of these devices have 30 SM, so if you divide by the shader clock in each case:



GTX 275: 73 / 1404 = 0.052

GTX 280: 64 / 1296 = 0.049

GTX 285: 75 / 1476 = 0.051

GTX 295: 63 / 1242 = 0.051



You see the shader clock is a near perfect predictor of performance. So I would stop fussing with on-card reductions. :)

#12
Posted 07/19/2009 04:34 PM   
Wow, empty_knapsack is right about the loop unrolling. nvcc isn't doing it by default (?!) despite the loops all having fixed # of iterations.

Now with ver 0.1 + cudaSetDevice + #pragma unroll:

GTX 285 = 180 M/sec (2.4x faster!)
Wow, empty_knapsack is right about the loop unrolling. nvcc isn't doing it by default (?!) despite the loops all having fixed # of iterations.



Now with ver 0.1 + cudaSetDevice + #pragma unroll:



GTX 285 = 180 M/sec (2.4x faster!)

#13
Posted 07/19/2009 04:47 PM   
OK, I've doubled the speed by a small tweak to the scheduling!
Still more to investigate.
OK, I've doubled the speed by a small tweak to the scheduling!

Still more to investigate.

#14
Posted 07/19/2009 04:48 PM   
[quote name='seibert' post='567205' date='Jul 19 2009, 09:47 AM']Wow, empty_knapsack is right about the loop unrolling. nvcc isn't doing it by default (?!) despite the loops all having fixed # of iterations.

Now with ver 0.1 + cudaSetDevice + #pragma unroll:

GTX 285 = 180 M/sec (2.4x faster!)[/quote]
Wow! That's surprising, I also expected it to be unrolled.

My scheduling tweak plus the unroll are independent.. now at 24 Mhashes/sec, up from 7.0. So over 3 times faster than version 0.10.
[quote name='seibert' post='567205' date='Jul 19 2009, 09:47 AM']Wow, empty_knapsack is right about the loop unrolling. nvcc isn't doing it by default (?!) despite the loops all having fixed # of iterations.



Now with ver 0.1 + cudaSetDevice + #pragma unroll:



GTX 285 = 180 M/sec (2.4x faster!)

Wow! That's surprising, I also expected it to be unrolled.



My scheduling tweak plus the unroll are independent.. now at 24 Mhashes/sec, up from 7.0. So over 3 times faster than version 0.10.

#15
Posted 07/19/2009 04:51 PM   
  1 / 16    
Scroll To Top