CUDA Memory Transpose

Hi,… Im running matrix Transpose program on My PC with GTX850M, i used the transpose in this blog : http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/ But i want to implement huge sizes of matrix up to 20000 and 30000,… but i get error out of memory,… is it related to my GPU thread Space ? (its 1024 thread / block ) , what do u recommend i do to solve this problem ?

A 20000x20000 square matrix of float will use about 1.6GB of storage. The method in that blog requires storage for both an input and an output matrix, meaning a total of around 3.2GB. If your GTX850M has 2GB of memory, it won’t work. This has nothing to do with threads or “thread space”.

For square matrices, you might try an in-place transpose method:

[url]Efficient in-place transpose of multiple square float matrices - CUDA Programming and Performance - NVIDIA Developer Forums

but you will still run out of memory perhaps somewhere around a size of 20000.

The other solution would be to run on a GPU with more memory.

i would like to think that, given that the problem is essentially broken up in thread blocks, that coupling the use of streams with the transpose method could limit the amount of memory required…?
if any kernel executes as thread blocks, you only need to provide sufficient memory for the thread blocks running at a time, not the kernel as a whole

i also would like to think that, with the use of proper offsets, the mega matrix can be broken down into multiple, smaller matrices; just like the transpose method breaks up the (input) matrix into blocks
one could argue that, in these cases, the code assumes an input offset of 0

I’m not talking about temporary storage used by the transpose algorithm here.

If the input matrix is resident on the device, storage required for the input matrix is not in any way dependent on how you transpose it, thread block behavior, streams, or any other aspect of your subsequent algorithm.

Are you referring to having the input be resident on the host, and the output be resident on the host, and transferring pieces to the device, and have the device transpose those pieces, before returning them to the host (result)?

hello txbob , Many thanks for your update! , actually im implementing in-place transpose so it might take around 1.5GB the half, but as you said huge sizes will not work, so it might still not fit, As for the solution you mentioned above :

“Are you referring to having the input be resident on the host, and the output be resident on the host, and transferring pieces to the device, and have the device transpose those pieces, before returning them to the host (result)?”

will it solve the memory problem ? if so then how will it transpose the external blocks (not like the diagonal) this way in the kernal to do it in-place transpose ?

another questions i think kinda stupid im sorry:

2\ how did you calculate the storage that it will take up to 3.2GB ?
1\ how did u know that my mem size is up to 2GB, is it related to nvidia model memory ?

my questions were directed to little_jimmy, I thought that was clear. I’m perplexed by the comments by little_jimmy in general, so I’d rather not delve further into it. I’m not proposing that as a sensible solution to anything. It does not appear to me to be a sensible suggestion or solution in this case.

3.2GB is just 1.6GB * 2, one for input matrix and one for output matrix, because, as I stated already, the blog post method works that way. If you use the in-place approach, then you only need one copy of the storage, i.e. 1.6GB.

I don’t know for sure that your mem size is 2GB, but obviously, given this discussion, the memory size of your GPU is an important factor, so I just did a google search. My google search suggested to me that notebooks containing the GTX850M GPU typically have 2GB of GPU memory. Your system may be different, I don’t know. (Why wouldn’t you just tell people how much memory is on your GPU rather than going through these protracted discussions?)

hello txbob,…thanks for your reply, im sorry i didnt clear my question enough or gave you the information you need, : it was because i was having trouble into finding my memory GPU. but now i managed yes as you mentioned up to 2GB,… sorry and thank you again

“If the input matrix is resident on the device”

that is indeed one of the assumptions - or premises - of the design

“I’m perplexed by the comments by little_jimmy in general, so I’d rather not delve further into it”

i am now perplexed by you being perplexed about my comments
and ‘in general’ as well as ‘not delve further into it’ are truly convenient, as there is simply no response possible to such a statement
if you said, because of A, i could address A; but i can not address ‘in general’
you are most welcome to attack, question, interrogate critique, and illuminate my remarks, statements, points, etc; however, that is where it stops

I’d rather not delve further into it with Tasim. Those comments were directed to Tasim. I’ve already stated my question to little_jimmy.

I didn’t think it made sense to have a meaningful discussion, with Tasim, about little_jimmy comments that I did not understand. I’m happy to discuss that with little_jimmy however, as my response (to little_jimmy) indicated.

Perhaps you have already answered my question:

In that case, I assume we should be in agreement that the in-place transpose method should require storage for the input matrix, which at 20000x20000 floats should be about 1.6GB

I’m not sure how to interpret these comments then:

Is this a statement about the expected memory required? I think it should still be (at least) 1.6GB for this example. Do you think it should/could be something different (i.e. lower than 1.6GB)? The in-place transpose method does not use any “extra” memory, other than what is required for the input matrix storage (excepting shared memory, which I think is not in-view in this problem, and would not be contributing to the OP’s “out of memory” error message.)

It’s not clear to me that breaking up the work in any way affects the basic storage required, which is (only) the storage necessary for the input matrix (in the case of an in-place transpose).

innocent, until proven guilty versus guilty, until proven innocent
similarly: possible, until proven impossible versus impossible, until proven possible

my initial points seemed very possible at the time of writing, and i still think they are possible
the particular case is similar to a conventional transpose; however, what sets it apart, is the notion of limited memory: doing a transpose, given limited/ constrained memory

to address such a design constraint, i would apply the premise that the minimum memory footprint of a kernel should be equal to the memory footprint of the kernel thread blocks ‘in flight’ - resident at a point in time; also that the number of kernel thread blocks resident at a time, is directly or indirectly under the programmer’s control

look at the distilled functionality as pseudo code of the conventional transpose:

for the input matrix, for the output matrix, for the thread blocks of the kernel:
shared_memory[block_id] = input_matrix[input_start_address + block_id];
flip/ shuffle shared memory
output_matrix[output_start_address + block_id] = shared_memory[block_id];

clearly, the input/output matrix reading/ writing offsets - or simply starting addresses - are a) fixed, b) known, c) can be extracted and ‘manipulated’

in order to conserve device memory, over and above keeping the input/ output matrices on the host, one could intervene by exploiting the input/ output matrix reading/ writing offsets
the matrix can be transposed by means of multiple, smaller kernels, as opposed to a single, large kernel, in order to have greater control over resident thread blocks, and their working arrays
coupling this with streams might further streamline the process

the pseudo code would become:

for the input matrix and the output matrix stored in host memory
for the matrix divided into smaller blocks, depicted by certain starting addresses
for the thread block of a kernel, corresponding to such a smaller block:

input_working_array = input_matrix[input_start_address + x];

launch kernel in stream s, feeding it input_working_array and output_working_array

output_matrix[output_start_address + y] = output_working_array[y];

shared_memory[block_id] = input_working_array[input_start_address + block_id];
flip/ shuffle shared memory
output_working_array[output_start_address + block_id] = shared_memory[block_id];

</device kernel>

now, one should be in full control of the device memory footprint, and one may even push the device footprint as low as 1MB (by controlling the number of resident thread blocks, and their sizes)
there may or may not be a performance degradation - the extent thereof i would not know - as this could be seen as a performance/ memory footprint trade-off
but, if one is memory constrained, one may have no choice but to participate in such a trade