My program can currently do 2-4MH/s.
I created the same program but for cpu and I can do over 6 millions hashes in a second.
I’m pretty sure I can optimize this but I don’t know how.
What I’m doing is that I run a for loop which starts a kernel and processes hashes. One thread does one hash as it can’t be parallelised. If one thread finds the right nonce, a global variable is set to 1 then the other threads return. After the kernel call, I run cudaDeviceSynchronize and I check if the global variable is set to 1. If not, I rerun the kernel with an updated nonce offset.
Is it the right way, calling multiple times the kernel ?
Maybe should I change my code’s design ?
Please tell me what should be changed in order to optimize this :)
I did like you said, I can now calculate up to 45 millions of hash/s (I can hear my gpu whistling) but my program crashes after ~30sec
How many blocks should I run parallel in a kernel ? I set 1024 threads in a block but don’t know how much blocks.
Also, what’s the difference between cudaSetDevice(0) & cudaSetDevice(1)
Can you give me a quick explanation of what can be achieved with cuda streams ?
A reasonable minimum target is to launch a total number of threads of at least # of SM * 2048.
These should be split between blocks with usually something in the range of 128,256, or 512 threads per block. It might be that 1024 threads per block is “OK”, it just requires some analysis to confirm.
Your GTX970 has 13 SMs, so target 13*2048 = 26K threads, ballpark, minimum. If you put 1024 threads per block, that would be 26 blocks. I’m not saying I know how to transform your code from 4 blocks to 26, but that is a reasonable performance goal, to maximize throughput.
I’m not going to try and teach you CUDA here on this forum. Use your google search to answer basic questions. There is documentation that will define what cudaSetDevice does:
can you just start with existing sha256 GPU calculation code? there are lot of mining algo sources on githuи
and anyway, you need to learn CUDA in order to get idea how GPU works and how code should be optimized. otherwise, you easily can get worse performance than on CPU
if you don’t believe me, print out those quantities.
No, not correct. A kernel with enough blocks will fill all the SMs on a GPU, and this is generally desirable. Please avail yourself of an organized introduction to CUDA. I know you’d like to learn it in 5 minutes, but in my experience it takes longer than that. Piecemeal Q+A “Socratic” method might seem “efficient” but is actually quite inefficient for learning a body of knowledge like this, IMO.
the github repo is the only thing I have to look at. It still says 3200 as of this posting, at this moment. The code I excerpted was pulled from your github repo, and still reflects that.
Obviously I can’t see what code you are actually running.
Roughly speaking, the “part1” I posted teaches you how to write syntactically correct CUDA, that will give you the correct answer (arguably the first step/requirement for any programmer). The “part2” teaches the basics of how to write CUDA code that runs “fast”. Your questions here mostly revolve around the latter, not the former.
With a lack of understanding of the latter, you can easily write code that gives you the correct answer but runs slow. You may then arrive at the conclusion that GPUs are worthless. Presumably they are not.
As each thread can find a correct answer to return to the host, how to only make one thread pass through the if statement which copies the data back to the host and avoid race conditions ?
Example :
bool isOk = check(someVar); // Many different someVar can work
if(isOk){
// I want this only accessed by one thread, if possible the one with the lowest id
memcpy(h_someVar, someVar, x);
}
Edit: Never mind, I did this
bool isOk = check(someVar); // Many different someVar can work
if(isOk && atomicExch(g_found, 1) == 0){
// I want this only accessed by one thread, if possible the one with the lowest id
memcpy(h_someVar, someVar, x);
}
Can you help me with this kind of problem, but using Python?
I mean is there any kind of example SHA256 Calculation using Python in implementation with CUDA?
I don’t know of any examples. They may exist somewhere. The common CUDA-python language bindings I am familiar with are pycuda, numba, and cupy. It should be possible to write a sha256 calculation using any of those. For example, if you had an already-written CUDA C++ kernel (such as the one on github linked in this thread - sha256_kernel) you can use that directly in pycuda. The process is straightforward, you use the kernel code directly, and the conversion of host code from C++ to python suitable for pycuda is basically mechanical. I won’t be able to give a tutorial on how to use pycuda here. That info is available elsewhere, though.