Cuda Memory Usage TX1

pweir · November 26, 2015, 3:42pm

I’m new to these forums, but am struggling with an issue and could use getting pointed in the right direction.

I am using a TX1 to perform some fairly memory intensive CUDA calculations.

I’ve noticed that in the CUDA7 samples deviceQuery reports that the GPU only has visibility to 2GB of RAM (plenty for my needs). The OS reports 4GB available, which is great. The trouble that I’m having is that when I try to request memory for my CUDA task, it appears that all the GPU can allocate is the Free memory minus the Kernel Cache. This usually is somewhere around 100 to 200 MB.

Could someone please help me understand how the memory is structured here, and what I can try to do to ensure that there is still plenty of memory available for the GPU?

I experience a similar issue on the K1, but never pursued it. I am hoping that there is something simple in the os that I can configure to get this working.

Thank you!

pweir · November 26, 2015, 11:48pm

I’ve been doing additional testing, and I have learned that by dropping the disk cache prior to starting my process, it is capable of getting all the memory it needs.

I used the following that I found in another post on askubuntu found by searching for how to delete disk cache.

sync && echo 3 > /proc/sys/vm/drop_caches

This is kind of terrible, in a way. In normal linux applications the kernel releases the disk cache immediately when any application asks for it, but in CUDA applications on the TX1, it appears that the GPU can only request what is actually free, and the Kernel doesn’t release the disk caches. Has anyone else experienced this?

dusty_nv · November 29, 2015, 4:35pm

I will check it out, thanks for reporting the behavior and the work-around.

I do know that even though deviceQuery only reports ~2GB available memory (a known issue), during our testing CUDA was able to allocate beyond that (up to 4GB), so I’ll investigate how the drop_caches is now enables it.

bmerry · November 30, 2015, 9:37am

What type of memory are you allocating? I’ve written a small test app that allocates managed memory in a loop, and it gets to 3312 MiB before failing, regardless of how much is cached. I suspect I am limited by address space rather than physical memory.

pweir · November 30, 2015, 2:59pm

The behavior is that I prepare a rather large context of data in a boost io stream that I accumulate. The application was originally written to support desktop GPU usage without shared host memory, and that has not been changed.

The application will prepare a lot (2gb or so) of data, and then pass it through the GPU in ~100 to 800MB chunks. The chunk sizes are bound by the algorithm in question and can’t be changed without sacrificing some aspects of the algorithm.

The application memory allocations go just fine, and the total available system memory (assuming disk caches freed correctly) is more than sufficient. The issue actually arises during some error checking around when cumalloc gets called and we check the memory available to the GPU to determine whether or not it is safe to proceed. The reported value is only what is literally free and not being used in any caches.

It could just be that the error check is no longer valid, but I do find it interesting that the GPU is reporting that it can only ask for what isn’t in the disk cache at the moment.

I’m less worried about the total memory size, my allocations aren’t massive, and I could rework the io stream to hold less data in memory. I am worried that I have to manually dump clean caches to get this to work. Since it is a lot of data, in my benchmark tests the cache dropping appears to have a large negative impact on the workflow.

dusty_nv · November 30, 2015, 3:07pm

cuMalloc() allocates CUDA device memory specific to the GPU, but on Jetson the memory and memory controller are physically shared. Using zero copy / CUDA mapped memory, might it be possible to accumulate the data directly into a buffer that CUDA can access without penalty?
See [url]http://arrayfire.com/zero-copy-on-tegra-k1/[/url] for example.

pweir · December 1, 2015, 1:43am

Thank you for the idea, I will give that a try and see if that resolves the issue. I’ll report back what I find.

bmerry · December 1, 2015, 7:19am

What happens if you just go ahead and call cuMalloc, without first checking how much memory the driver thinks is free? I wouldn’t be surprised if the query for free memory returns how much is available right now (without dropping any caches) but that it’s actually possible to allocate more.

pweir · December 16, 2015, 3:46pm

Sorry for taking so long between replies. The project was shifted to another individual on the team that I work on.

The transition to Zero Copy looks like it will solve the problems that I’m seeing. The codebase is pretty mature though, so it is taking some time to see the fruits.

@bmerry, I wouldn’t be surprised if removing the check for available memory would let me allocate more memory, however we learned in the past that allocating more memory than is actually available for the GPU caused some pretty terrible side effects. Frequently causing a reboot to recover the ability to use the GPU. This is considered a major failure criteria for our application. After I discussed it with our team we decided that for that particular reason we weren’t willing to go that route.