Calculating sha256 hash against 10MB data on GTX 1070 is taking just under 1 minute to complete.
Is this normal? I mean on CPU (using sha256sum command) I get the result almost instantly on the same 10MB input file.
I understand that sha256 algorithm can not be parallelized, only the number of concurrent threads can be increased. And I’ve tested that running the program on GPU with two 10MB files takes the same time as running the program with one file.
What I am not sure about, is the very big time difference (almost 1 min).
If interested, here is my code, nvprof output and specs. Oh, and Im runnign on Ubuntu 16.04 x64, with latest cuda.
==22686== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 57.8597s 1 57.8597s 57.8597s 57.8597s sha256_cuda(JOB**, int)
0.00% 736ns 1 736ns 736ns 736ns [CUDA memcpy HtoD]
API calls: 99.54% 57.8597s 1 57.8597s 57.8597s 57.8597s cudaDeviceSynchronize
0.31% 180.73ms 10 18.073ms 45.257us 179.20ms cudaMallocManaged
0.15% 87.775ms 1 87.775ms 87.775ms 87.775ms cudaDeviceReset
0.00% 428.18us 94 4.5550us 628ns 161.54us cuDeviceGetAttribute
0.00% 118.66us 1 118.66us 118.66us 118.66us cuDeviceTotalMem
0.00% 86.322us 1 86.322us 86.322us 86.322us cudaLaunch
0.00% 50.983us 1 50.983us 50.983us 50.983us cudaMemcpyToSymbol
0.00% 40.856us 1 40.856us 40.856us 40.856us cuDeviceGetName
0.00% 34.007us 22 1.5450us 768ns 5.0980us cudaGetLastError
0.00% 5.5870us 2 2.7930us 838ns 4.7490us cudaSetupArgument
0.00% 3.8410us 3 1.2800us 768ns 2.0250us cuDeviceGetCount
0.00% 2.0950us 2 1.0470us 908ns 1.1870us cuDeviceGet
0.00% 1.8160us 1 1.8160us 1.8160us 1.8160us cudaConfigureCall
My specs
./deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1070"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8114 MBytes (8507752448 bytes)
(15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS