cuBLAS kernels always run serially despite streams and AsyncMemCpy?!?

I’m having some difficulty in getting cuBLAS kernels to run in parallel. I’m using sgemm and hgemm kernels at relatively small sized matrices (say 800x400). NVVP reports that the kernels are running serially and have a theoretical occupancy of 50%. No problem, I figure I’ll just use Streams to parallelize since I’ve several such matrices to compute. However, no matter what I try, the kernals that cuBLAS generates are serialized.

I figured a simple setup as below would get parallel execution, but alas it is serial both under Cuda 6.5 on Tegra K1 and 7.0 on Tegra X1.

// Get cuBLAS handle, etc.
    check_error(cudaMallocHost(&source, 500*500*3*sizeof(float)));
    cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking); 

    // fill it with some data
    // ...

    // Space on the GPU.. Avoid complications on zero copy for now.
    check_error(cudaMalloc(&firstMat, 500*500*3*sizeof(float)));
    check_error(cudaMalloc(&secondMat, 500*500*3*sizeof(float)));
    check_error(cudaMalloc(&result1, 500*500*3*sizeof(float)));
    check_error(cudaMalloc(&result2, 500*500*3*sizeof(float)));

    //Do some ASync MemCopies
    check_error(cudaMemcpyAsync(firstMat, source, 500*500*3*sizeof(float), cudaMemcpyHostToDevice, stream1));
    check_error(cudaMemcpyAsync(secondMat, source, 500*500*3*sizeof(float), cudaMemcpyHostToDevice, stream2));

    for(i=0; i<20; i++) {
       	cublasSetStream(handle, stream1);
       	check_error(cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 500, 500, 500, &ALPHA, firstMat, 500, secondMat, 500, &BETA, result1, 500));
       	cublasSetStream(handle, stream2);
        check_error(cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 500, 500, 500, &ALPHA, firstMat, 500, secondMat, 500, &BETA, result2, 500));
    }

Instead of parallel execution, I get kernels running on different streams but serialized. I even tried a variation of this example using threads and the --default-stream per-thread option for streams, but I get serial execution also. Although in that case, it uses the default stream as well as new streams for some reason rather than solely using the stream per thread. :-(

I must be missing something obvious? Please help!

What makes you think 2 cublas kernels can run concurrently? This will be almost impossible to realize on a TK1 (1 SM) or a TX1 (2 SMs). Even on a device with many SMs, 400x800 to a first order approximation is a load of 120,000 threads. The TK1 can process at most about 2048 threads at a time. Those will generally fully utilize the device before your second cublas call comes along. And 50% occupancy doesn’t necessarily mean that the device is 50% utilized. shared memory usage and other factors (which you have essentially no control of in CUBLAS) can drive occupancy down.

If your small matrices are of uniform size, what you want is to be able to run the operations in a batched mode with the additional block index available. I have kernels that do this for maxwell, though not for kepler. I’m not sure what support cublas has for this type of batching (none I think). See the discussion between Maddy and me here:
https://devtalk.nvidia.com/default/topic/776043/cuda-programming-and-performance/whats-new-in-maxwell-sm_52-gtx-9xx-/post/4339547/#4339547

Here’s my code if you want to take a crack at using it… I don’t yet have a fully operational c api, but you should be able to extend what Baidu has provided. These kernels are accross the board faster than cublas, and in some dimensions 2-3x faster.
https://github.com/NervanaSystems/nervanagpu

cublas has a batched gemm function for small matrices ~(100x100) of uniform size:

[url]http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemmbatched[/url]

not sure if it does what you’re referring to for “this type of batching”, I don’t know if that’s just referring to the z-dimension comment or something else.

Hadn’t seen that api call before now. Not sure how it’s implemented or why they say this: “For small sizes, typically smaller than 100x100”. The batched gemm I implemented has no such restriction. It looks like those kernels are from the templated C versions of cublas. Those are typically designed for small tile sizes and have very high bandwidth/power requirements. If you use a larger tile size like 64x128 (and preferably hand assembled) you very quickly hit full utilization and stay there no matter how much additional data you throw at it.

Very interesting… I assumed the Theoretical Occupancy in NVVP expressed how many of the cores were getting used, but it sounds like that’s not the case, since threads should fully occupy as long as there aren’t dependencies… So then what is Theoretical and Achieved Occupancy trying to express to me? Is there a good metric to determine other dependencies?

Does the number of SMXs really limit in this case? I’ve seen straight CUDA kernels (<<<>>>) run as parallel streams on the TK1 hardware.

ScottGray: Yes, in fact I’ve downloaded and played with your libraries last week when I was trying to overcome cuBLAS’s limitations. Alas I couldn’t get a sgemm to run via C. Why are the kernels loaded at run-time compared to just compiled in?

Theoretical occupancy may be dictated by the maximum number of threads (warps) that can be resident on an SM, and the number of threads (warps) in a kernel threadblock. The actual definition as far as I know is the maximum number of warps that can be resident as limited by execution configuration and machine limits. Execution configuration includes threads per block and shared memory, but not necessarily registers per thread (I don’t think).

If your SM supports 2048 threads (true for Kepler and Maxwell) and you are launching kernels of 512 threads per block (for example), then the theoretical occupancy would be 4 blocks (ignoring other factors - just using threads or warps as an example).

The achieved occupancy is just how many blocks (or what that implies for threads, or warps, depending on which measure you want to use) were actually running concurrently on an SM. But let’s be clear that a TK1 with a single SM is at most only going to have “a few” blocks running at a time. And if your 400x800 matrix multiply causes cublas to create a kernel of, lets say, 15x30 (i.e. 450) blocks, then one possible (and likely) threadblock scheduling would be to send all 450 of those blocks through the SM before the blocks of any other kernel are scheduled.

This gives a crisper definition of theoretical occupancy, achieved occupancy, and some reasons achieved occupancy may be lower than theoretical:

[url]http://http.developer.nvidia.com/NsightVisualStudio/2.2/Documentation/UserGuide/HTML/Content/Profile_CUDA_Settings.htm[/url]

In any event, for some reason this is a common question. Streams don’t provide a magical way to increase performance. They have their uses, certainly, but they don’t guarantee additional performance. If the machine is continuously, fully utilized, then exposing additional parallelism through stream usage is unlikely to have any significant performance benefit.

If you want to attack the reported 50% occupancy number, then that will come down to threadblock design for resource utilization. With CUBLAS, you have essentially no control over this.

CUBLAS kernels, for reasonably large matrix sizes, are likely to fully utilize the machine. This would be especially true for small machines like TK1 and TX1 that have 1 or 2 SMs.

The Kepler SM in your TK1 has 192 cores. Yet, we’re talking about 512 threads per block (maybe) and 2048 threads per SM (maybe) so it’s clear you’re going to have to reset your thinking about how CUDA “cores were getting used”.

The number of SMs don’t limit anything as far as how many streams you can launch, whether we are talking CUBLAS or “straight CUDA kernels”. What it does limit is the machine maximum throughput. And the only way you’re going to witness actual threadblock level concurrency between two different kernels on TK1 is to launch very small kernels with very small threadblock resource requirements. In practice, you will almost never witness it, unless you work hard to make an artificial case.

Even on a larger GPU, it’s uncommon to witness concurrent kernel execution unless the kernels themselves have a relatively small number of blocks, and the resource utilization is also limited.

I load cubins with the driver api because cubins are what my assembler produces. It is also the format that is most easy to import into pycuda, which I used to build a fairly complete numpy compatible api on top of.

It is entirely possible to compile those cubins into a program and use them with the runtime api. nvcc does this for you. You can see this if you study the steps output with the --verbose and --keep options when compiling a normal runtime kernel. There’s actually a hidden option embedded in nvcc that lets you do it in one step:

nvcc -lib -arch sm_50 -use-cubin code=sm_50,cubin=sgemm.cubin -o sgemm.lib sgemm.cu

That would produce a linkable lib from which you can call the kernel with the <<<>>> syntax. The cu file must have a single kernel and have an empty function body. So you’d have to use a modified version of one of my cu files. I use the cu files to generate the initial cubin for the assembler to inject sass into.

I wish I had the time to build a nice easy to use C api to import all my kernls with, but I’m too swamped with other tasks to spend time on this. Baidu did some work for this (they are heavy users of my kernels) but only exposed the features they needed (batched mode not being one of them).

As far as streams go I’ve yet to find a use for them in my work. Nearly all the kernels I call do a pretty good job of filling the SMs with work, or are serially dependent on previous operations.

Many thanks, both of you… That’s corrected my thinking on the ‘not very helpful’ Occupancy stats from NVVP and given me some great new pointers on using NervanaGPU kernels.

I’m late to this thread. Concurrent kernel support on the TK1 was broken as of L4T 21.2.

The bug was reported back in 2014.

The existence of a fix and that the TK1 supports 4 concurrent streams was stated in this GTC 2015 session:

Maximizing Face Detection Performance

I don’t know if the fix ever made it into L4T 21.3 or .4.

But, as others have commented here, occupying a single TK1 sm_32 SMX is easily accomplished with high-intensity kernels like those found in cuBLAS or Scott’s kernels. :)

Here’s a great post from Baidu comparing my gemm kernels to cublas over a wide parameter range:
http://svail.github.io/

The key difference is that I spent a lot of time trying to get decent performance out of small minibatch dimensions. My 32x128 and 128x32 tiles would probably make a worthy addition to cublas.

Are you offering your code under a BSD license?

It is Apache2. That licence is pretty permissive.

Ideally I’d like to see Nvidia be a little more open with their libraries. There’s just so much more you can do with an open source implementation compared to being forced to use a closed source API.

For example, a lot of the poor performance of cublas is just the optimizer making bad choices about which tile size to use (or combination of tiles). A way to manually override that selection would be pretty useful. Or if you want to embed some custom code inside the gemm kernels (that is essentially what my convolution kernels do).

I asked because as you can see from the CUBLAS documentation, third-party code under a BSD license has been incorporated into CUBLAS, while last I checked I saw no mention of code used under any other license. License terms are a serious matter to corporate lawyers. In my experience, a 2-clause or 3-clause BSD license is most palatable to open source projects and corporate entities alike, so to increase the likelihood of others picking up one’s code, that is a good way to go.

Having spent a significant portion of my professional career providing library functionality, I would point out that all libraries are necessarily compromises: they will never deliver top performance for every possible use case. The same applies to compiler code generation. At some point one has to weigh the implementation and maintenance cost of additional code against economic benefits; simply creating more specialized code paths stops being economical.

That said, much of library development everywhere is driven by market forces, so improvements are most likely to occur in areas with the highest demand. Clearly, your field is such an area at this point in time :-)

As for open sourcing code produced by corporate entities at significant expense, the minimum pre-requisite is that there needs to be an economic incentive to do so. In my experience, those incentives rarely exist, but it depends on a company’s business model. If the business model is service contracts, using open source may work great (see Red Hat). For most other business models, usually not so much. I observe that quite a bit of code is turned into open source simply to save a company further expenses for maintaining some project, i.e. they “dump” the project but hope to retain some residual benefits. Sometimes this works, but often such projects go nowhere.

Seems to me nvidia does just fine selling hardware. Having a more open and flexible software platform would only foster increased community support and innovation in parallel computing.

The question as I see it is (and this may sound cynical), does more open software drive additional revenues and profits for a corporate entity providing the software, rather than the entities using it?

I am very skeptical about opening source fostering increased community involvement. I was the original maintainer of CUBLAS (code for which was made available once, around CUDA 1.1), I was the maintainer of the CUDA standard math library for many years (source code for which was completely available for inspection for the entire time, but not under open-source license), and I also provided BSD-licensed code for download from the CUDA developer website. While there were a couple contributions that were incorporated, overall the feedback received over nine years was absolutely minimal.

If open sourcing pretty requires providing already polished projects on the repository site du jour as a prerequisite for any community involvement, that seems more like a cost factor to me more than anything else. I would be interested to learn how much useful feedback and contributions you have received for your open source efforts such as the Maxwell assembler.

Well I’ve had lots of interest in Maxas but mostly inquiries into support for Kepler. I haven’t had any help in adding any assembler features, but I suspect that is mainly due to my poor choice of using a mostly dead language to implement it (perl). I’ll be porting it to python in the not too distant future (probably prior to Pascal release which should have a largely binary compatible ISA to Maxwell). I also plan on adding some features that will make it vastly easier to write assembly (the big one being full kernel register liveness evaluation and automatic optimal register allocation).

With deep learning as the new killer app for GPUs it’s hard to imagine a tool designed to optimize dense linear algebra operations not attracting more interest. The amount of investment poring into this field is only going to grow, which means plenty of opportunity for clever engineers (both inside and outside of nvidia) to innovate. They’re going to pick the tools that give them the most flexibility to do so.

In the case of neon, our open source deep learning python framework built on top of my kernels, we’ve had a huge interest from the community. There we’re more often having to reject submissions in order to keep things organized.

So maybe your experience was just bad timing. There just weren’t enough skilled engineers who knew what to do with the technology and not a big enough application to draw more people into it.

Your point about chances for community involvement being significantly higher for “hot” technologies probably applies. And it applies to open source projects as well. I recall bugs in the math library portion of glibc that sat unfixed for about ten years (2004 to 2014) until someone volunteered fixes.

I probably should not have used the word “feedback” in what I wrote above, that was poor terminology. There always was a fair amount of feedback in the form of requests for new features or higher performance. That seems akin to the situation with your Maxwell assembler from what you write above.

What rarely happened was someone pointing out a better way to accomplish some task, or pointing to a relevant paper, or providing demonstrably faster code. This is what I would consider community involvement. Such non-collaborative community involvement is possible and does occur even with entirely closed-source software, as I know from personal experience.