MD5 hash on a GPU: Implementation and benchmarks

Hi all,
We’ve recently had an interesting discussion at a local HPC Users Group meeting, on how (in theory) GPUs should be perfect for a range of cryptanalysis/cryptography applications. So last weekend I wrote a simple CUDA MD5 hash computation routine (based on RSA’s Md5.c source) to test it out in practice.

The results were pretty good, with speedups (when run on 8800 Ultra) of ~36x over a single core of a Q6700 Core Duo (or ~9x over all four cores, assuming linear scaling). This is a pretty nice gain for an unoptimized piece of code written in 1.5 days by a curious astrophysicist :-).

In absolute terms, the GPU was able to compute ~110 million MD5 hashes per second, while it was even faster (~160 Mhash/sec) when the results weren’t stored to global memory but instead compared against a “target” hash and discarded if no match was found (“search mode”). In laymen terms, this means a single 8800 Ultra can brute-force break an MD5 hashed password of eight or less alphanumerics+numbers (A-Z, a-z, 0-9) in about ~16 days. Kind of scary.

A short writeup, together with an exploration (and pretty plots!) of how the performance depends on the block size, number of blocks, threads, etc., can be found at [url=“http://majuric.org/software/cudamd5/”]http://majuric.org/software/cudamd5/[/url]. The source code is there as well. Feel free to download/use/modify it.

PS: If you download the code, I’d be especially interested to hear how it fares on other GPUs (especially the G92 series). Just run ‘make benchmark’, and e-mail me the results!

Cheers,

isn’t the 8800GTS a G92 core?

Very interesting btw, that 63 threads per block can be optimal.

I am afraid I have quite some code that is coded for a specific number of threads per block, seeing this I think I will have try to make some of my kernels that use shared memory based on the number of threads more generic in this respect and do some extra benchmarking.

mjuric,
Nice job!

However, it’s all been done before, check this for little more information and performance figures.

Not if it is the 8800 GTS with 320 or 640 MB of memory. Those cards came out shortly after the 8800 GTX and use the G80 core. Someone at NVIDIA decided it would be OK to reuse the model name for the revised card. You have to either write “8800 GTS (G92)” or “8800 GTS 512 MB” (or 1 GB) to be clear you are talking about the newer card.

AndreiB: Yes, I became aware of Elcomsoft’s work halfway through coding mine. But I finished it out of academic curiosity & for tinkering value (I couldn’t any actual CUDA MD5 sourcecode that I could compile and play with myself).
seibert: My GTS is the old one (it’s actually a QuadroFX 4600).

PS: I made performance vs. threads per block plots for 8800 GTS 512 (a G92 card; data courtesy of Ales Koval). I’m surprised by substantial the change in ‘calc’ test results, compared to the old 8800 GTS (and little difference in the ‘search’ test). Given that the main difference between the two modes is reading/writing to global memory, the G92s seem to be much better at memory access management than the old G80ies.

Hey mjuric, thanks for sharing your code!

haven’t tested yet but looking forward to it!

btw, i have written multi-threaded SIMD version of MD5 for CORE2.

On a Q6600 unclocked 2.4ghz, my MD5 program with salts crunches 88,000,000 keys a second, which is 2x faster than MDCrack.

Hopefully the algorithm will be ported to GPU at some point, when i get hang of programming them of course.

I have made some benchmarks of both CUDA-based and CPU-based md5 bruteforcers, you may find it here: [url=“MD5 bruteforce benchmark : Svarichevsky Mikhail”]MD5 bruteforce benchmark : Svarichevsky Mikhail

Hi BarsMonster,

Your work are great!

Why don"t you add cudamd5-v1.2.1 (http://majuric.org/software/cudamd5/) in the benchmark?

Could you please provide BarsWF in 32-bit version because other competitors are in 32-bit version?

With best regards,

Deguang

32-bit versions for both CUDA based and SSE based versions are available at [url=“World Fastest MD5 cracker BarsWF : Svarichevsky Mikhail”]http://3.14.by/en/md5[/url] .

Unfortunately cudamd5 is Linux only, and I do not feel that I can recompile it on windows quickly.
Based on speed benchmark in the beginning of the thread (110 million on 8800 ultra) it is going to be twice slower than CPU (9600GT C2D 3Ghz).

Interesting thread. I started looking at using a NVS 295 card in my workstation for some password auditing and came across this thread. Given a hash that is known to come from a maximum of 8-chars, it might practically take longer than 16 days. Here’s why, please correct me if I am wrong.

The possible number of characters that can be used are 26 uppercase, 26 lowercase, 10 digits, and 32 special characters. Total 94. An 8-char password would yield 94^8 permutations. But the password you are looking to crack could be less than 8 characters unless you have enforced some password policy that requires no less than 8-chars. So the number of permutations is 94^1 + 94^2 … + 94^8 = 6161234432565770. At 110 Mhash/sec, it would take ~648 days to scan all possible permutations.

This thread is year+ old. Since then new GPUs arrived, some optimizations made, one GTX295 able to do 1400M MD5 hashes/sec, one 4870x2 even more – 2300M, fresh 5870 – (my estimations) is around 2665M. So cracking 8 symbols long MD5 password containing even all printable characters can be done in less than 14 days with system costs less than $1500.

If you’re interested in MD5 hashing speed on different systems (including estimations for GT300 & HD5870) you can read this:

http://golubev.com/about_cpu_and_gpu_2_en.htm

Hello,

I know this is an old post but does anyone still have the source code? I tried to download it but the site ( http://majuric.org/software/cudamd5/) is no longer available. I would really appreciate it a lot if some can help me out.

mak

i’ve seen a lot of cryptohash implementations on github. not sure whether they are complete though