memory allocation problem

Hi ,
I have a problem in memory allocation in cuda. when i get memory it gets more than memory i need.
for example when the first pointer get 496 and teh second one get 4096 the first pointer just used 4096bytes but the second one use 65536.

snippet code:

mem_size = (4096);//sizeof(float);
cutilSafeCall( cudaMalloc( (void
*) &d_idata[i], mem_size));
used_mem =free_mem;
cuMemGetInfo (&free_mem,&total_mem);
used_mem = used_mem-free_mem;
printf(“%d_ free: %u , mem_size: %d , used : %u\n”,i,free_mem, mem_size,used_mem);

i++;
mem_size = (4096);//*sizeof(float);
cutilSafeCall( cudaMalloc( (void**) &d_idata[i], mem_size));
used_mem =free_mem;
cuMemGetInfo (&free_mem,&total_mem);
used_mem = used_mem-free_mem;
printf("%d_ free: %u , mem_size: %d , used : %u\n",i,free_mem, mem_size,used_mem);

and the output:
1_ free: 1025608448 , mem_size: 4096 , used : 4096
2_ free: 1025542912 , mem_size: 4096 , used : 65536

could anyone help me please? . Why it gets more than I need these too much memory. I want just 4k but it get 64k. it is terrible.
I have lots of memory allocation and deallocation in my program, and I need memory. but it give me “out of memory” error because after some allocation ,all memory dedicated to those pointers.there aren’t any memory for other pointer as i know I must have memory.
Is there any way to handle it and don’t allow to get these too much memory?
I know i can have a pointer and have just one allocation and handle memory myself by one pointer, but i want to know can cuda handle it itself?

Regards,
Marjan

It looks like the driver or device maintains a set of allocatable memory pages of differing size. There appears to be a preallocate pool of 16Mb from which you can make up to 65536 mallocs of 4096b or less, then a small set of 4096b pages, and then a much larger set of 65536b pages. Once you fill the pool, then the memory manager seems to move to individual pages, first 4096b, then 65536b. This seems to be the source of the fragmentation you are seeing. If you run this:

#include <stdio.h>

#include "cuda.h"

#define N (65536)

int main()

{

	CUresult status;

	CUdeviceptr *mem_d=(CUdeviceptr *)malloc(sizeof(CUdeviceptr)*(size_t)N);

	CUdevice device;

	CUcontext ctext;

	const size_t memsize = 1;

	const int handle=0;

	if ((status = cuInit(0)) != CUDA_SUCCESS) {

		fprintf(stderr,"Cuda Error : %d %s %d\n", status, __FILE__, __LINE__);

		return 0;

	}

	if ((status = cuDeviceGet(&device, handle)) != CUDA_SUCCESS) {

		fprintf(stderr,"Cuda Error : %d %s %d\n", status, __FILE__, __LINE__);

		return 0;

	}

	if ((status = cuCtxCreate(&ctext, 0, device)) != CUDA_SUCCESS) {

		fprintf(stderr,"Cuda Error : %d %s %d\n", status, __FILE__, __LINE__);

		return 0;

	}

	unsigned int memtotal, memfree, memfreelast;

	for(unsigned int i=0; i<N; i++) {

		if ((status = cuMemAlloc(&mem_d[i],memsize)) != CUDA_SUCCESS) {

			fprintf(stderr,"Cuda Error : %d %s %d\n", status, __FILE__, __LINE__);

			return 0;

		}

		

		if ((status = cuMemGetInfo(&memfree, &memtotal)) != CUDA_SUCCESS) {

			fprintf(stderr,"Cuda Error : %d %s %d\n", status, __FILE__, __LINE__);

			return 0;

		}

		fprintf(stdout, "%u %u %u %u\n", i, memfree, memtotal-memfree, (i>0) ? (memfreelast-memfree) : 0);

		memfreelast = memfree;

	}

	

	for(int i=0; i<N; i++) 

		cuMemFree(mem_d[i]);

	cuCtxDestroy(ctext);

	free(mem_d);

	return 0;

		

}

and change N and memsize, you can see the effect of differing numbers of different size allocations. It seems that to avoid this sort of memory fragmentation, the best strategy is to allocate memory in large, 64k aligned blocks, so in your case allocate your 4096 byte structures in multiples of 16, rather than multiples of one.

yes , it seems like what you’ve explained.

Thanks alot.
Marjan