How to concurrent cublas-sgemm by stream?

I do some practice on GTX1080,when I use mutithread with different stream and compile with “–default-stream per-thread”.my hand write kernel code concurrent well,but when I call cublas gemm() it run in sequential,even in small matrix size.
I find same some message here like mine ,
https://devtalk.nvidia.com/default/topic/880560/?comment=4679441

A cublas gemm call is likely to “fill up” your GPU, so that actually witnessing concurrency is difficult or impossible.

In any event, there are other possibilities (e.g. an implementation error in your code) which would not be possible to determine without seeing a full test case. If you want to see concurrent gemm operations, you would need to use very small matrix sizes, and I’m not sure that multithreadinng with --default-stream-per-thread is the right way to do it.

Hi txbob
My test code mainly refer to “A Multi-threading Example” here: https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

#include <pthread.h>
#include <stdio.h>

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159,i));
    }
}

void *launch_kernel(void *dummy)
{
    float *data;
    cudaMalloc(&data, N * sizeof(float));

    kernel<<<1, 64>>>(data, N);

    cudaStreamSynchronize(0);

    return NULL;
}

int main()
{
    const int num_threads = 8;

    pthread_t threads[num_threads];

    for (int i = 0; i < num_threads; i++) {
        if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {
            fprintf(stderr, "Error creating threadn");
            return 1;
        }
    }

    for (int i = 0; i < num_threads; i++) {
        if(pthread_join(threads[i], NULL)) {
            fprintf(stderr, "Error joining threadn");
            return 2;
        }
    }

    cudaDeviceReset();

    return 0;
}

Also I run it in nvvp ,When I replace code call sgemm in launch_kernel function ,it can’t concurrent,even
2*2 square matrix.But,the code kernel below concurrent well.

__global__ void kernel(float *x, int n)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159,i));
    }
}

Thanks all.
I have read the document http://docs.nvidia.com/cuda/cublas/index.html#parallelism-with-streams
I move all initialize work in thread ,only call sgemm in thread.I add cublasSetStream() in different thread with different thread.Then it show few sgemm concurrent.
I create 16 threads,test small matrix size : M 512,N1024,K1320,finally there three groups of parallel excution of two.
here is my thread function code

void *launch_kernel(void *dummy)
{
  struct Matrix * mat=(struct Matrix*)dummy;
  cudaStream_t stream;
  cublasHandle_t handle;
  cudaStreamCreate(&stream);
  cublasCreate(&handle);
  cublasSetStream(handle,stream);
  float alpha=1.0f;                                             
  float beta=0.0f;                                              
  cublasSgemm(handle,CUBLAS_OP_T,CUBLAS_OP_N,n,m,k,&alpha,mat->b,k, mat->a, k, &beta, mat->c, n); 
  cublasDestroy(handle);
  cudaStreamSynchronize(0);                                                                                                                                                                                                                    
  return NULL;
}

The nvvp profile result here: