Speedy CUDA tool to win EngineYard SHA1 contest

SPWorley · July 19, 2009, 7:28am

Engineyard is a company that deals with Ruby web apps, often with distributed cloud computing.

They announced a contest yesterday, starting on the morning of July 20, and it lasts one day.

The basic idea is they give a 160 bit SHA1 hash value, and you have a limited dictionary and mixing rules to try to make a phrase which hashes as close to that target as possible. An exact match is likely impossible, but you just want to match more bits than anyone else.

The prizes are several new iPhones, and some $2000 of cloud compute credit.

The only practical attack to win is raw brute force… compute trillions of hash candidates and keep the best one.

Gee, I wonder if anyone on this forum knows of an accessible technology which has a lot of horsepower? CUDA to the rescue!

So I decided to spend this Saturday afternoon putting together a CUDA program to attack this exact contest. I could keep it private and try to win the contest myself, but it will be a lot more fun to let everyone use it! And we still have over a day to tweak the code if you want to start hacking.

I’m attaching source and compiled code here for a ready-to-run contest cracker! If you win the contest, the prize is yours to keep! (But please, tell us on the forum!)\

This code is crude but it does power through over 50M hashes per second on a Quadro… I bet a 285GTX would do 70M.

It’s run from command line, so if you have multiple GPUs, you can run one instance on each GPU. (No time for threading niceties!). No time for a pretty GUI.

The technical strategy

If you read the SHA1 algorithm, you’ll see that there’s an 80 word lookup table. Ideally we’d like to have every thread do a compute, but there’s not enough shared memory to make such a table. But luckily, if we look at the algorithm, it’s possible to reorder the computation of the table to use just 16 values in a rolling array per thread. At 192 threads per block (for register stall efficiency), that’s 192164=12K… which fits perfectly into our 16K of shared memory.

The strategy is to have the CPU pick some initial words like “tokyo memcached” and then the GPU takes over and powers through 93^5 variations of that phrase by appending 5 characters to the end. (There are 93 printable ASCII characters, and we can append 5 of them.) Each block assigns the first three chars, and inside the block, the threads compute the 93*93 variations.

The host app doesn’t do any work after setting up the base string (which takes a millisecond.). The GPU compute takes about 2 minutes to try the 7 billion hashes. After the GPU finishes, the CPU picks a new base string and the GPU is launched again.

To keep the GPU from locking up, the GPU’s speed is timed as it does blocks. Each kernel launch does perhaps 400 blocks. If that takes a short time like 10 ms, the number of blocks is increased for the next launch. The goal is to get kernels of about 50ms… enough to keep your machine responsive, but not so short that you get efficiency losses due to idle SMs.

The contest

This code here is working right now… it’s ready to use in the contest! But hopefully we can tweak it before launch and I invite all hackers to speed it up and find problems before launch.

Even if you’re not interested in hacking, you can still enter the contest! And if you win, the prize is yours. This is your chance to show off your 4-board 295GTX monster! I’d like to show how much power our GPUs can harness. Please also let other people know that there’s a tool for people to use… hopefully we can get it easy enough to use for even non-hackers.

I am NOT involved with the contest or company, I just found it via Twitter and thought it would be great to brute force on a GPU.

I see now that Profquail noticed the same contest!

Compiling. running.

Easy! This is windows, but it should work on Linux and Mac. I’m on my laptop now so I haven’t had a chance to try Linux or multi-gpu.

To compile, just:

nvcc -I "E:\CUDA\common\inc"  gpusha1.cu  e:\cuda\common\lib\cutil32.lib -o gpusha1search.exe

To run, just open a shell and use a command like:

gpusha1search.exe -device 0 6cac827bae250971a8b1fb6e2a96676f7a077b60 Cloud Ruby DHH one eight six active record controller data rspec mongrel MySQL postgresSQL tokyo MRI jruby rubinius memcached exception metaprogramming reflection

Attachments in the next post, let me zip them up with a reference to this thread.

SPWorley · July 19, 2009, 7:33am

Here is source and executable for Windows. I’ll should be back home tomorrow to compile Linux.
(A nice tip of the hat to CUDA… this is evidence you can put together quite powerful supercomputing apps in ONE DAY… and on a laptop!)

EDIT: updated to version 0.11
3X faster, from loop unrolling, smaller block work sizes.
Now actually sets the device for multi GPU like it’s supposed to.
std:: prefix on vector and strings for gcc

EDIT: updated to version 0.12
Now uses two-block hashes (no speed cost) so it can have full 12 word strings
displayed hash speed now based on proper block work size of 6464 not 9393
small speed boost from final round simplification

EDIT: version 0.13
Minor bugfixes, compiler warnings, 1% speed boost by using a final reduction

EDIT: version 0.14
Minor tweaks. 1% speed boost (inner loop optimization).
Now prints link to forum for users to find discussion
192 thread variable now documented and at top of source

EDIT: version 0.15
Hack, for now, to dynamically switch to fewer threads on G80/G90. Using 128 threads on G80/G90 drops performance 25% though!

EDIT: version 0.16
Trivial changes… cutDeleteTimer fixes leaked timer handle
-blocksize option to turn off dynamic block sizing and hardwire specific block size to optimize a specific machine/GPU performance

SPWorley · July 19, 2009, 7:40am

OK, past midnight here and time for sleep. I’ll be traveling tomorrow morning so likely no updates until the afternoon.
Remember the contest starts on Monday July 20 (I think it’s at noon) and lasts until Tuesday July 21 until 6pm. So that’s 30 hours to let everyone’s GPUs cook! And you may get a free iPhone if your GPU is lucky!

If someone could compile a Linux version (and/or Mac) that’d be great!

Also, please, I totally encourage hacking of the code. There’s more that can be done. Several potential optimizations are listed in a comment at the end of the source code. The code is also BSD licensed so you could even use it later in any of your own projects, maybe if you’re doing SHA1 computes.

I admit I want to show these Ruby cloud guys that they underestimate GPU power…

crmoore · July 19, 2009, 2:40pm

Just found this when googleing for the contest. Thanks for posting it.

I had figured gpu computing would be perfect for this contest, but I don’t know enough to implement it myself though, so this is great.

Two things:

To compile on the mac I needed to add ‘using namespace std;’ to the top of the file.
My interpretation of the rules is that you must select 12 words per trial string. This seems to use less than that.

On my MacBookPro4,1 (GeForce 8600M GT) I get about 6 M hashes/s. Compared to my c++ implementation using openssl libraries getting 2.5 M. per core on the 2.6 GHz cpu.

seibert · July 19, 2009, 3:26pm

Here’s some benchmarks of version 0.1 on my mishmash of different GPUs:

(This is peak. There is some fluctuation as the block sizing heuristic varies things)

GTX 275: 73.338 M/sec
GTX 280: 63.870 M/sec
GTX 285: 63.890 M/sec
GTX 295 (1 half): 62.642 M/sec
GTX 295 (both halves): (24.134 + 26.658) = 50.792 M/sec

(275 is in the Phenom 9900 2.6 GHz, 295 is in the Phenom II 3 GHz, and the 280 and the 285 are in a Core i7 2.67 GHz motherboard)

Definitely some peculiar performance results. Based on the GTX 295, it looks like this code is PCI-Express bandwidth bound. I’m not sure why the GTX 275 comes out the highest, because that computer ranks bottom of the pack in device-to-host bandwidth.

I haven’t looked at the code in-depth yet, but I also agree with the previous poster that the contest spec says 12 words.

SPWorley · July 19, 2009, 3:39pm

Thanks for the catch!

The std:: qualifier just needs to be added to the vector and string types… I’ll post the code updates later. I wonder why windows nvcc didn’t complain.

The 12 words per string rule: Wow! I totally missed that since it’s not mentioned in the contest rule list, just up in the intro.

This is an annoyance … not because picking exactly 12 rules is hard, but because the length of 12 words will almost certainly push the hash to beyond one hash block, halving compute speed!

But the solution is easy… we make SURE that the string is two blocks long. We compute the hash of the first block, which will be CONSTANT and use that as an initializer for all 6 billion CUDA computed hashers. Search speed should be unaffected.

I’ll post code (and windows/linux binaries) later today. crmoore, I hope you can post a Mac compile too.

6M hashes/sec is quite nice… that’s almost identical to the 6.5 Mhashes/sec I get on my laptop’s old quadro 570m which has very similar specs.

The search is compute limited, not bandwidth limited, so it’s mostly a product of core count and shader frequency.

crmoore, can you talk about your CPU version? Just curious what strategy you use for enumerating trials.

SPWorley · July 19, 2009, 3:46pm

EXCELLENT, thanks for the reports. I was especially interested in a 295’s performance!

Wow! That’s surprising.

The PCIe communication is pretty small, I never expected it to be a bottleneck!

It’s not PCIe bandwidth! There’s a result of just one word per block that’s sent over from the GPU to the CPU.

so even for a fast 2 minute compute, that’s only 200 MB/sec.

But it’s likely easily solvable, we can have the GPU do more work on the card and not even report the best results every call.

We’ll just accumulate a few million results on the device, then run a reduction kernel and send just one byte back.

This solves any bandwidth problem, any transaction overhead problem, and even any CPU overhead while scanning results.

SPWorley · July 19, 2009, 3:58pm

That might be another easy optimization. Maybe once we get to a rough block size, we can vary the block count within a range to find the highest performance.

For example if your card has 30 SMs, it’s likely that 117, 118, 119, 120 blocks will be a lot more efficient than 121 blocks, since in that case one SM will get an extra block to do while all the other SMs are idle. It’s only a rough estimate (CUDA scheduling is deliberately opaque!).

I could also resdesign the kernel to do less work per block to minimize the effect, but that’s actually pretty annoying because of my choice to make the work partition based on the 5 appended characters. One character per block is only 93 hashes. Two characters is what I use now, but it’d be nice if it were about 10 times smaller than the 93*93 to hide that quantum work size.

Fixable, but maybe not a big deal to attack in the 1 day remaining.

seibert · July 19, 2009, 4:16pm

Yeah, this sounded wrong to me after I wrote it. I just changed h_Best to pinned memory, and that didn’t help. I shouldn’t be surprised, since the Core i7 has pageable memory transfers that are nearly the same speed as pinned memory, yet gets the same speed as the rest of the devices.

My next theory was that it was actually transfer latency rather than bandwidth, so I tried switching h_Best and d_Best to use the CUDA 2.2 zero-copy feature. Also no change in performance on my main test system (the GTX 275). I extended the target kernel time from 40 ms to 100 ms, and also no change.

Some bottleneck is holding things back that has no correlation with shader clock rate, device memory bandwidth, or pageable memory performance… Very weird. :)

empty_knapsack · July 19, 2009, 4:26pm

Well, 73M/sec is pretty low for GTX275. I’ve got about 165M/s for GTX260/192SP. My CUDA code wasn’t made for EngineYard contest, it’s general SHA1/MD4/MD5 hash brute-forcer. You can get win32 executable for tests from [url=“golubev.com”]golubev.com.

I have several suggestions for you though:

You don’t need any shared memory for SHA1. It used only 16 DWORDS (+5 for hash state) not 80.
Unroll everything in SHA1_Transform function.
In fact, better just use OpenSSL SHA1 implementation as it already heavily optimized :).
Remove all divisions. No need to have full 93-symbols charset (and btw, it’s possible to produce some “bad” word with 93 random chars which isn’t allowed by contest’s rules ;)), reduce it to 64/32 chars, so it’ll be possible just to shift & bitwise-and instead of division.
ATI’s GPUs still better for such (32-bit ints only) computations. Atm I have about 600M hashes/sec with HD4850+HD4770 config. While (estimated) peak for GTX295 only 415M hashes/sec.
It’s still pure luck to find winning hash – people getting length == 34 on CPU while I’m stuck at 36 having about 100 times faster SHA1 code. It’s really like lottery: having 100 tickets better than having just one but do you really believe you can win it with 100 tickets when you need 2^80 of them to cover everything? ;)

jack · July 19, 2009, 4:31pm

Hey, I’m glad to see that someone else is going to take advantage of CUDA for this contest! I was actually just getting ready to sit down and write my kernel for the contest tomorrow. If I can get something good going, I will post my code up here so we can compare and contrast before tomorrow morning and get the best performance that we can (in a day, anyway).

seibert · July 19, 2009, 4:34pm

Hah! OK, found the bug which invalidates some of my previous benchmarks. This code never calls cudaSetDevice(). So it always uses device 0 regardless of the command line option. That’s why the GTX 295 benchmark came out weird.

Here’s the new results for ver 0.1 + cudaSetDevice:

GTX 275: 73 M/sec
GTX 280: 64 M/sec
GTX 285: 75 M/sec
GTX 295 (1 half): 63 M/sec
GTX 295 (both halves): (62.624 + 62.627) = 125 M/sec

This totally changes my conclusions. This looks like it is definitely shader bound, like you would expect. All of these devices have 30 SM, so if you divide by the shader clock in each case:

GTX 275: 73 / 1404 = 0.052
GTX 280: 64 / 1296 = 0.049
GTX 285: 75 / 1476 = 0.051
GTX 295: 63 / 1242 = 0.051

You see the shader clock is a near perfect predictor of performance. So I would stop fussing with on-card reductions. :)

seibert · July 19, 2009, 4:47pm

Wow, empty_knapsack is right about the loop unrolling. nvcc isn’t doing it by default (?!) despite the loops all having fixed # of iterations.

Now with ver 0.1 + cudaSetDevice + #pragma unroll:

GTX 285 = 180 M/sec (2.4x faster!)

SPWorley · July 19, 2009, 4:48pm

OK, I’ve doubled the speed by a small tweak to the scheduling!
Still more to investigate.

SPWorley · July 19, 2009, 4:51pm

Wow! That’s surprising, I also expected it to be unrolled.

My scheduling tweak plus the unroll are independent… now at 24 Mhashes/sec, up from 7.0. So over 3 times faster than version 0.10.

SPWorley · July 19, 2009, 4:57pm

Easy bugs are the best kind!

seibert, NICE CATCH, that was simple but not obvious.

Especially since I printed out the device I said I was going to use, I just never used it.

jack · July 19, 2009, 5:08pm

I took a quick look at your code…you did a bunch of stuff assuming 93 allowable characters, but it’s actually only uppercase and lowercase letters (I think, but it doesn’t specifically rule out digits on the contest page), plus the single space between each word. No other punctuation is allowed. However, the appended 5 characters can include digits, to get 62 possibilities each.

Also, like I said, I’m going to try to get my kernel up and running soon and post the code. I actually had cool idea of how to do it last night that basically involves basically no transfer back and forth from the device (just a single small struct as a kernel parameter, then the host retrieves a single value on completion). I think it will be a bit more efficient this way since it’s effectively compute-bound at that point. But I don’t even have anything in code yet, so technically yours is still faster ;)

SPWorley · July 19, 2009, 5:11pm

OK, updated to version 0.11.

SPWorley · July 19, 2009, 5:14pm

The contest includes only printable ASCII characters. There are 94. I skipped space just because it’s confusing.

However to make the block size smaller for better work distribution, I just cut down the number of characters to 64 in version 0.11 of the code posted above. It’s still more than just a-z though.

seibert · July 19, 2009, 5:18pm

Wow, nice improvement there:

GTX 285: 373 M/sec

GTX 295 (both halves): 620 M/sec

Two steps on a log2 scale! Only 50 more to go until this becomes practical. :)