Tell me a way to GPU speed up with banal elementwise multiplication.. Thanks in advance!

Vitalybn · July 22, 2017, 9:30am

Hello!

I have this code on matlab… here is the code snippet which I show… banal elementwise multiplication… but here is… a bit chaotic. The rest of the code in the same spirit… Tell me a way to GPU speed up. If there is one, of course. Thanks in advance!

Y(:,506)=X.*V19;
X=Y(:,188).*V18;
Y(:,507)=X.*V18;
Y(:,508)=X.*V19;
Y(:,509)=Y(:,190).*X18;
X=Y(:,191).*V18;
Y(:,510)=X.*V18;
Y(:,511)=X.*V19;
X=Y(:,194).*V14;
Y(:,512)=X.*V14;
Y(:,513)=X.*V19;
X=Y(:,7).*X18;
Y(:,514)=X.*V18;
Y(:,515)=Y(:,198).*Y(:,9);
Y(:,516)=Y(:,198).*X18;
X=Y(:,201).*V18;
Y(:,517)=X.*V18;
Y(:,518)=X.*V19;
X=X14.*X14;
Y(:,519)=X.*V14;
Y(:,520)=X.*V15;
X=X15.*X15;
Y(:,521)=X.*V15;
X=X16.*X16;
Y(:,522)=X.*V16;
X=X17.*X17;
Y(:,523)=X.*V17;
Y(:,524)=X.*V22;
X=X18.*X18;
Y(:,525)=X.*V18;
Y(:,526)=X.*V23;
X=X19.*X19;
Y(:,527)=X.*V19;
X=X22.*X22;
Y(:,528)=X.*V22;

BulatZiganshin · July 22, 2017, 12:10pm

btw, banal translates as святой :)

Vitalybn · July 22, 2017, 3:30pm

Я так понял, это смешная шутка… тогда я посмеялся. А по существу есть, что сказать?

pasoleatis · July 24, 2017, 2:46pm

The cublas library has this impemented.

Vitalybn · July 26, 2017, 5:00am

Thank for answer!

I am use MATLAB and CUDA8 and I can’t understand why the speed of work times on the gpu is 3 times slower than cpu ? My programm…

A=double(rand(100000,1));
B=double(rand(100000,1));
tic
C=times(A,B);
toc

A=gpuArray(A);
B=gpuArray(B);

wait(gpu)
tic
C=times(A,B);
wait(gpu)
toc

0.00101 sec
0.00303 sec

Way???

cuBLAS for C ?
I am not use C or CUDA C.

BulatZiganshin · July 26, 2017, 6:26am

if this is element-wise multiplication, the speed is limited by memory bandwidth for CPU code, and PCI-E bandwidth for GPU code. you need to have more complex algo to get speedup with GPU

pasoleatis · July 26, 2017, 9:56am

He only measured the time for execution. But in the same time I have no idea how matlab works with gpu.

Hello,

Yes cublas is a library for CUDA C. I do not have experience with matlab so I can not comment about your result. What kind of gpu do yo have? A laptop gpu is quite slow.

Element wide multiplication is very simple to implement in CUDA C.

Code from here https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
 
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;
 
    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] * b[id];
}
 
int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 100000;
 
    // Host input vectors
    double *h_a;
    double *h_b;
    //Host output vector
    double *h_c;
 
    // Device input vectors
    double *d_a;
    double *d_b;
    //Device output vector
    double *d_c;
 
    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);
 
    // Allocate memory for each vector on host
    h_a = (double*)malloc(bytes);
    h_b = (double*)malloc(bytes);
    h_c = (double*)malloc(bytes);
 
    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);
 
    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sin(i)*sin(i);
        h_b[i] = cos(i)*cos(i);
    }
 
    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
 
    int blockSize, gridSize;
 
    // Number of threads in each thread block
    blockSize = 512;
 
    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);
 
    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
 
    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
 
    // Check the results here
 
    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
 
    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
 
    return 0;
}

As BulatZiganshin mentioned, in practive you have time spent with allocation, copying the data to the gpu and then copying the results back, which might result in using gpu not being efficient for your problem. It would be efficient if you would do A.*B thousands of times. If you do it only once you will not have benefit.

P.S. : I can not believe google translate is so bad! I am using it for spanish and it works most of the times very well.

Vitalybn · July 26, 2017, 5:49pm

Hello! Thanks for the replies, they’re very important for me!

I have a GTX750 512GPUs and memory 1Gb.
It’s not the fastest and not the professional video card. But the performans for elementwise multiplication of two vectors is very small. And even have to translate the vector from graphics card memory to main memory and the result back in memory card and even the rate of multiplication on the CPU is 2.5 times higher than on the GPU.

This time simple matrix multiplication (A*B) is very fast on the GPU.

Strangely, there is a feeling that is not happening parallelization of multiplications.

I need to understand this limitation possibilities or I don’t know how to do it?

pasoleatis · July 26, 2017, 6:05pm

Yes. The GTX 750 is very slow, it is slower than my laptop card GTX 765m.
Use the code I suggested. With a little work you can implement the cpu operation and do proper measurements. I do not know how matlab works with gpu.

In cuda C yo can measure the time using events.

HEre is a sample code. put the stuff you want to measure between (start) and (stop) ( How to measure total time for CPU and GPU - CUDA Programming and Performance - NVIDIA Developer Forums )

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

Vitalybn · July 28, 2017, 8:38pm

My problem is that my m-function that is called in my main program runs very slow on the GPU. See my pos before.

In this program use lot of vector multiplications. In this program, all the input variables are not global. And this program very slowly on GPU. The question is - how can to increas speed? May if use global variables?

pasoleatis · July 28, 2017, 9:06pm

Try a better computer.

Vitalybn · July 29, 2017, 6:05pm

I figured out what the problem is.

The problem is loss of performance when indexing. Please see the program code… You can see that as soon as the indexing of GPU performance drops.

View if I just multiply the matrix of the speed of the GPU is 3 times higher (Elapsed time is 0.027648 seconds for CPU and Elapsed time is 0.011477 seconds for GPU).

But as soon as the indexing that GPU performance is 50 times less than that of CPU (Elapsed time is 0.002495 seconds for CPU and Elapsed time is 0.127313 seconds for GPU).

And the smaller the indexes, the problem is reduced. So the GPU doesn’t like indexing. Why is this happening?

M=rand(1000,500,‘double’);
N=rand(1000,500,‘double’);

tic
for i=1:500
M(:,i)=N(:,501-i).*N(:,i);
end
toc

%--------------

gpu=gpuDevice();

V=rand(1000,500,‘gpuArray’);
Y=rand(1000,500,‘gpuArray’);

wait(gpu)
tic
for i=1:500
Y(:,i)=V(:,501-i).*V(:,i);
end
wait(gpu)
toc

wait(gpu)
tic
for i=1:500
C=V(:,501-i).*V(:,i);
end
wait(gpu)
toc

A=V(:,1);
B=V(:,2);
wait(gpu)
tic
for i=1:500
D=A.*B;
end
wait(gpu)
toc

%--------------

tic
E=M’*N;
toc

wait(gpu)
tic
F=V’*Y;
wait(gpu)
toc

Elapsed time is 0.002495 seconds.
Elapsed time is 0.127313 seconds.
Elapsed time is 0.068272 seconds.
Elapsed time is 0.009520 seconds.
Elapsed time is 0.027648 seconds.
Elapsed time is 0.011477 seconds.

And how to solve the problem. The code was given for example. My code where I see the problem like this:

Y(:,26)=V1.*V2.*V12;
X=V1.*V3;
Y(:,27)=X.*V7;
Y(:,28)=X.*V8;
X=V1.*V6;
Y(:,29)=X.*V7;
Y(:,30)=X.*V8;
Y(:,31)=V1.*V7.*V11;
Y(:,32)=X8.*V1;
Y(:,33)=V1.*V11.*V12;
Y(:,34)=X2.*V7;
Y(:,35)=X2.*V9;
Y(:,36)=X2.*V11;
Y(:,37)=X2.*V12;
X=V2.*V3;
Y(:,38)=X.*V7;
Y(:,39)=X.*V12;
Y(:,40)=X.*V13;
X=V2.*V4;
Y(:,41)=X.*V7;
Y(:,42)=X.*V8;
X=V2.*V6;
Y(:,43)=X.*V8;
Y(:,44)=X.*V12;
X=V2.*V7;
Y(:,45)=X.*V7;
Y(:,46)=X.*V8;
Y(:,47)=X.*V9;
Y(:,48)=X.*V12;
X=V2.*V8;
Y(:,49)=X.*V8;
Y(:,50)=X.*V12;
Y(:,51)=X9.*V2;
X=V2.*V11;
Y(:,52)=X.*V11;
Y(:,53)=X.*V12;
X=V2.*V12;
Y(:,54)=X.*V12;
Y(:,55)=X.*V13;
Y(:,56)=X3.*V8;
Y(:,57)=X3.*V12;
Y(:,58)=X3.*V13;
X=V3.*V4;
Y(:,59)=X.*V8;
Y(:,60)=X.*V13;
Y(:,61)=V3.*V6.*V8;
X=V3.*V7;
Y(:,62)=X.*V7;
Y(:,63)=X.*V8;
Y(:,64)=X.*V9;
Y(:,65)=X.*V13;
X=V3.*V8;
Y(:,66)=X.*V8;
Y(:,67)=X.*V13;
Y(:,68)=Y(:,14).*V3;
X=V3.*V12;
Y(:,69)=X.*V12;
Y(:,70)=X.*V13;
X=V3.*V13;
Y(:,71)=X.*V13;
Y(:,72)=X.*V14;
Y(:,73)=V4.*V7.*V9;
Y(:,74)=V4.*V8.*V14;
X=V4.*V13;
Y(:,75)=X.*V13;
Y(:,76)=X.*V14;
Y(:,77)=X6.*V7;
Y(:,78)=X6.*V8;
Y(:,79)=X6.*V13;
X=V6.*V7;
Y(:,80)=X.*V7;
Y(:,81)=X.*V8;
Y(:,82)=X.*V12;
Y(:,83)=X.*V16;
Y(:,84)=X.*V17;
X=V6.*V8;
Y(:,85)=X.*V8;
Y(:,86)=X.*V11;
Y(:,87)=X.*V12;
Y(:,88)=X.*V13;
X=V6.*V11;
Y(:,89)=X.*V12;
Y(:,90)=X.*V13;
X=V6.*V12;
Y(:,91)=X.*V12;
Y(:,92)=X.*V16;
Y(:,93)=X13.*V6;
Y(:,94)=X7.*V8;
Y(:,95)=X7.*V9;
Y(:,96)=X7.*V11;
Y(:,97)=X7.*V12;
Y(:,98)=X7.*V14;
Y(:,99)=X7.*V16;
Y(:,100)=X7.*V17;
Y(:,101)=Y(:,1).*V8;
Y(:,102)=Y(:,1).*V9;
Y(:,103)=Y(:,1).*V17;
Y(:,104)=Y(:,1).*V18;

pasoleatis · July 29, 2017, 6:24pm

Matlab has a column major order. This means A(i,j) is in the memory near A(i+1,j). C has row-major this mean the the elements A(i,j) and (i+1,j) are away from each other in the array Y(:,i) each element is spread in the memory very far away. This ends up in a penalty when is read from the memory. If the matlab CUDA has also row -major, just make a new variable and use the transpose of the matrix.

Vitalybn · July 30, 2017, 6:13am

you are right, there is a difference which variable to index.

I have studied your idea and it turns out, that if the index as I have A(:,i) it is the best variant, as I choose a column from the array in memory is consistent.

indeed, I noticed that if you change the indexing for A(i,:), performance is reduced in 3 times on the cpu and 1.5 on the gpu.

your comment is interesting and I will know now, because I had not thought about it. but in this my example I have everything properly indexed. perhaps the problem isn’t here?

Any ideas? maybe I should ask question directly the developer of matlab ?

pasoleatis · July 30, 2017, 12:36pm

You need access to the matlab gpu function. Not much else you can do otherwise. maybe split the matrix like an array of pointers, but I do not know if it is possible in matlab like it is in C.
Like: double **Y;

Vitalybn · August 5, 2017, 10:23am

Thank you for your help. This CODE really SOLVE problem!

V1=V(:,[1,3,6,3,7,8,3,2,1]);
V2=V(:,[5,8,1,4,6,2,1,9,3]);
V3=V(:,[9,2,7,1,3,6,4,8,5]);
Y=V1.*V2.*V3;

Many faster then this code:

Y(:,1)=V(:,1).*V(:,5).*V(:,9);
Y(:,2)=V(:,3).*V(:,8).*V(:,2);
Y(:,3)=V(:,6).*V(:,1).*V(:,7);
Y(:,4)=V(:,3).*V(:,4).*V(:,1);
Y(:,5)=V(:,7).*V(:,6).*V(:,3);
Y(:,6)=V(:,8).*V(:,2).*V(:,6);
Y(:,7)=V(:,3).*V(:,1).*V(:,4);
Y(:,8)=V(:,2).*V(:,9).*V(:,8);
Y(:,9)=V(:,1).*V(:,3).*V(:,5);

New program code runs on the GPU is now 2 times faster then CPU, as it should be so. The length of program become very short and more clearly!

Additionally, I to see how important to write programm code is correctly! I tried to reduce the number of multiplications to speedup, and for this I used the intermediate variables X. But in new version code of programm, quantity multiplications become more, but run become faster! It is amazing!