Cub Library

Hi all,

I’m trying to compute prefix sum using CUB library but I got bunch of errors, does anyone know how to run this?

my errors :
: error: class “cub::BlockScan<int, 1024, cub::BLOCK_SCAN_RAKING, 1, 1, 200>” has no member “SmemStorage”
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(250): here

inclusive.cu(82): error: identifier “BlockLoadVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(250): here

inclusive.cu(89): error: no instance of overloaded function “cub::BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ExclusiveSum [with T=int, BLOCK_DIM_X=1024, ALGORITHM=cub::BLOCK_SCAN_RAKING, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=200]” matches the argument list
argument types are: (, int [1], int [1], int)
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(250): here

inclusive.cu(95): error: identifier “BlockStoreVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=1024, ITEMS_PER_THREAD=1]”
(250): here

inclusive.cu(78): error: class “cub::BlockScan<int, 512, cub::BLOCK_SCAN_RAKING, 1, 1, 200>” has no member “SmemStorage”
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(251): here

inclusive.cu(82): error: identifier “BlockLoadVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(251): here

inclusive.cu(89): error: no instance of overloaded function “cub::BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ExclusiveSum [with T=int, BLOCK_DIM_X=512, ALGORITHM=cub::BLOCK_SCAN_RAKING, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=200]” matches the argument list
argument types are: (, int [2], int [2], int)
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(251): here

inclusive.cu(95): error: identifier “BlockStoreVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=512, ITEMS_PER_THREAD=2]”
(251): here

inclusive.cu(78): error: class “cub::BlockScan<int, 256, cub::BLOCK_SCAN_RAKING, 1, 1, 200>” has no member “SmemStorage”
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(252): here

inclusive.cu(82): error: identifier “BlockLoadVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(252): here

inclusive.cu(89): error: no instance of overloaded function “cub::BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ExclusiveSum [with T=int, BLOCK_DIM_X=256, ALGORITHM=cub::BLOCK_SCAN_RAKING, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=200]” matches the argument list
argument types are: (, int [4], int [4], int)
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(252): here

inclusive.cu(95): error: identifier “BlockStoreVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=256, ITEMS_PER_THREAD=4]”
(252): here

inclusive.cu(78): error: class “cub::BlockScan<int, 128, cub::BLOCK_SCAN_RAKING, 1, 1, 200>” has no member “SmemStorage”
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(253): here

inclusive.cu(82): error: identifier “BlockLoadVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(253): here

inclusive.cu(89): error: no instance of overloaded function “cub::BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ExclusiveSum [with T=int, BLOCK_DIM_X=128, ALGORITHM=cub::BLOCK_SCAN_RAKING, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=200]” matches the argument list
argument types are: (, int [8], int [8], int)
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(253): here

inclusive.cu(95): error: identifier “BlockStoreVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=128, ITEMS_PER_THREAD=8]”
(253): here

inclusive.cu(78): error: class “cub::BlockScan<int, 64, cub::BLOCK_SCAN_RAKING, 1, 1, 200>” has no member “SmemStorage”
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(254): here

inclusive.cu(82): error: identifier “BlockLoadVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(254): here

inclusive.cu(89): error: no instance of overloaded function “cub::BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ExclusiveSum [with T=int, BLOCK_DIM_X=64, ALGORITHM=cub::BLOCK_SCAN_RAKING, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=200]” matches the argument list
argument types are: (, int [16], int [16], int)
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(254): here

inclusive.cu(95): error: identifier “BlockStoreVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=64, ITEMS_PER_THREAD=16]”
(254): here

inclusive.cu(78): error: class “cub::BlockScan<int, 32, cub::BLOCK_SCAN_RAKING, 1, 1, 200>” has no member “SmemStorage”
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(255): here

inclusive.cu(82): error: identifier “BlockLoadVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(255): here

inclusive.cu(89): error: no instance of overloaded function “cub::BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ExclusiveSum [with T=int, BLOCK_DIM_X=32, ALGORITHM=cub::BLOCK_SCAN_RAKING, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=200]” matches the argument list
argument types are: (, int [32], int [32], int)
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(255): here

inclusive.cu(95): error: identifier “BlockStoreVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=32, ITEMS_PER_THREAD=32]”
(255): here

inclusive.cu(78): error: class “cub::BlockScan<int, 16, cub::BLOCK_SCAN_RAKING, 1, 1, 200>” has no member “SmemStorage”
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(256): here

inclusive.cu(82): error: identifier “BlockLoadVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(256): here

inclusive.cu(89): error: no instance of overloaded function “cub::BlockScan<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::ExclusiveSum [with T=int, BLOCK_DIM_X=16, ALGORITHM=cub::BLOCK_SCAN_RAKING, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=200]” matches the argument list
argument types are: (, int [64], int [64], int)
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(256): here

inclusive.cu(95): error: identifier “BlockStoreVectorized” is undefined
detected during:
instantiation of “void BlockPrefixSumKernel<BLOCK_THREADS,ITEMS_PER_THREAD>(int *, int *, clock_t *) [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(183): here
instantiation of “void Test<BLOCK_THREADS,ITEMS_PER_THREAD>() [with BLOCK_THREADS=16, ITEMS_PER_THREAD=64]”
(256): here

28 errors detected in the compilation of “/tmp/tmpxft_0000d7bc_00000000-8_inclusive.cpp1.ii”.

Without showing any code, or describing your setup, it’s difficult to make specific suggestions.

The following code runs correctly for me, on CUDA 7, Fedora20, Quadro5000 GPU, and with the latest cub master (1.4.1):

http://nvlabs.github.io/cub/

downloaded and properly installed in /usr/local/cuda/include/cub:

$ cat t736.cu
#include <cub/cub.cuh>
#include <stdio.h>

int main(){

  // Declare, allocate, and initialize device pointers for input and output
  int num_items = 7;
  int *d_in;
  int h_in[]  = {8, 6, 7, 5, 3, 0, 9};
  int sz = sizeof(h_in)/sizeof(h_in[0]);
  int *d_out; // e.g., [ , , , , , , ]
  cudaMalloc(&d_in,  sz*sizeof(h_in[0]));
  cudaMalloc(&d_out, sz*sizeof(h_in[0]));
  cudaMemcpy(d_in, h_in, sz*sizeof(h_in[0]), cudaMemcpyHostToDevice);
  printf("\nInput:\n");
  for (int i = 0; i < sz; i++) printf("%d ", h_in[i]);
  // Determine temporary device storage requirements
  void *d_temp_storage = NULL;
  size_t temp_storage_bytes = 0;
  cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
  // Allocate temporary storage
  cudaMalloc(&d_temp_storage, temp_storage_bytes);
  // Run inclusive prefix sum
  cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
// d_out s<-- [8, 14, 21, 26, 29, 29, 38]
  cudaMemcpy(h_in, d_out, sz*sizeof(h_in[0]), cudaMemcpyDeviceToHost);
  printf("\nOutput:\n");
  for (int i = 0; i < sz; i++) printf("%d ", h_in[i]);
  printf("\n");
  return 0;
}
$ nvcc -o t736 t736.cu
$ cuda-memcheck ./t736
========= CUDA-MEMCHECK

Input:
8 6 7 5 3 0 9
Output:
8 14 21 26 29 29 38
========= ERROR SUMMARY: 0 errors
$

The above code is basically a straightforward completion of the code shown here:

http://nvlabs.github.io/cub/structcub_1_1_device_scan.html#a9416ac1ea26f9fde669d83ddc883795a

I want to run prefix scan on a large array like 35000000. My code only works for 1024 elements. I dont know why!

Here is my code:

#define CUB_STDERR

#include <stdio.h>
#include <iostream>
#include <sys/time.h>
#include "cub/cub.cuh"
#include <stdlib.h>
using namespace cub;

//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------

bool g_verbose      = false;
unsigned long g_iterations    = 100;
cudaEvent_t cstart, stop;
unsigned long TILE_SIZE;
//#define  dim  = 2;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------

/**
 * Simple kernel for performing a block-wide exclusive prefix sum over unsigned longegers
 */
template <
    unsigned long         BLOCK_THREADS,
    unsigned long         ITEMS_PER_THREAD,
	BlockScanAlgorithm  ALGORITHM>
__global__ void BlockPrefixSumKernel(
    unsigned long         *d_in,          // Tile of input
    unsigned long         *d_out,         // Tile of output
    clock_t     *d_elapsed)     // Elapsed cycle count of block scan
{
	
	// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
    typedef BlockLoad<unsigned long*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockLoadT;

    // Specialize BlockStore type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
    typedef BlockStore<unsigned long*, BLOCK_THREADS, ITEMS_PER_THREAD> BlockStoreT;
    // Parameterize BlockScan type for our thread block
    typedef BlockScan<unsigned long, BLOCK_THREADS,ALGORITHM> BlockScanT;
	 __shared__ union
    {
        typename BlockLoadT::TempStorage    load;
        typename BlockStoreT::TempStorage   store;
        typename BlockScanT::TempStorage    scan;
    } temp_storage;

    // Per-thread tile data
    unsigned long data[ITEMS_PER_THREAD];
    BlockLoadT(temp_storage.load).Load(d_in, data);
	
 // Barrier for smem reuse
    __syncthreads();

    // Start cycle timer
    clock_t start = clock();

    // Compute exclusive prefix sum
    unsigned long aggregate;
	
	
	BlockScanT(temp_storage.scan).InclusiveSum(data, data, aggregate);


    // Stop cycle timer
    clock_t stop = clock();
	
	 // Barrier for smem reuse
    __syncthreads();

    // Store output
   BlockStoreT(temp_storage.store).Store(d_out, data);

    // Store aggregate and elapsed clocks
    if (threadIdx.x == 0)
    {
        *d_elapsed = (start > stop) ? start - stop : stop - start;
        d_out[BLOCK_THREADS * ITEMS_PER_THREAD] = aggregate;
    }
}



//---------------------------------------------------------------------
// Host utilities
//---------------------------------------------------------------------

/**
 * Initialize exclusive prefix sum problem (and solution).
 * Returns the aggregate
 */
unsigned long Initialize(
    unsigned long *h_in,
    unsigned long *h_reference,
    unsigned long num_elements)
{
    unsigned long inclusive = 0;
	unsigned long dim=2; 
	
	
	//unsigned long  y[num_elements];
    unsigned long *y= (unsigned long *) malloc(sizeof(unsigned long) *num_elements);

  for (unsigned long i = 0; i < num_elements; i++) {
    y[i] = i ;
     h_in[i] = y[i];

        h_reference[i] = inclusive;
        inclusive += h_in[i];
  }
if(g_verbose){
  printf("Input data: ");
        for (unsigned long i = 0; i < num_elements; i++)
            printf("%lu, ", y[i]);
        printf("\n\n");
}
 
    return inclusive;
}


/**
 * Test thread block scan
 */
template <
    unsigned long BLOCK_THREADS,
    unsigned long ITEMS_PER_THREAD,
	BlockScanAlgorithm  ALGORITHM>
void Test()
{
	
    //unsigned long TILE_SIZE = 35000000;//BLOCK_THREADS * ITEMS_PER_THREAD;
//fprintf(stdout, "I am here\n");

    // Allocate host arrays
    unsigned long *h_in           = new unsigned long[TILE_SIZE];
    unsigned long *h_reference    = new unsigned long[TILE_SIZE];
    unsigned long *h_gpu          = new unsigned long[TILE_SIZE + 1];

    // Initialize problem and reference output on host
    unsigned long h_aggregate = Initialize(h_in, h_reference, TILE_SIZE);

    // Initialize device arrays
    unsigned long *d_in           = NULL;
    unsigned long *d_out          = NULL;
    clock_t *d_elapsed  = NULL;
   if(cudaSuccess!= cudaMalloc((void**)&d_in, sizeof(unsigned long) * TILE_SIZE)) fprintf(stderr,"could not allocate array\n");
    if(cudaSuccess!= cudaMalloc((void**)&d_out,         sizeof(unsigned long) * (TILE_SIZE + 1))) fprintf(stderr,"could not allocate array\n");
    if(cudaSuccess!= cudaMalloc((void**)&d_elapsed,     sizeof(clock_t))) fprintf(stderr,"could not work");
	
    // Display input problem data
    if (g_verbose)
    {
        printf("Residuals: ");
        for (unsigned long i = 0; i < TILE_SIZE; i++)
            printf("%lu, ", h_in[i]);
        printf("\n\n");
    }

    // Copy problem to device
    cudaMemcpy(d_in, h_in, sizeof(unsigned long) * TILE_SIZE, cudaMemcpyHostToDevice);
	cudaEventCreate(&cstart);
    cudaEventCreate(&stop);
	float milliseconds;

    printf("BlockScan %lu items (%lu threads, %lu items per thread): ",
        TILE_SIZE, BLOCK_THREADS, ITEMS_PER_THREAD);

    // Run this several times and average the performance results
    clock_t elapsed_scan_clocks     = 0;
    cudaEventRecord(cstart, 0);
        // Run aggregate/prefix kernel
        BlockPrefixSumKernel<BLOCK_THREADS, ITEMS_PER_THREAD,ALGORITHM><<<1, BLOCK_THREADS>>>(
            d_in,
            d_out,
            d_elapsed);
            
            cudaEventRecord(stop, 0);
   cudaEventSynchronize(stop);
   cudaEventElapsedTime(&milliseconds, cstart, stop);
        // Copy results from device
        clock_t scan_clocks;
		

  

  cudaEventSynchronize(stop);
  
  printf("compute time :  %.4lf\n\n ", milliseconds);
        cudaMemcpy(h_gpu, d_out, sizeof(unsigned long) * (TILE_SIZE + 1), cudaMemcpyDeviceToHost);
        cudaMemcpy(&scan_clocks, d_elapsed, sizeof(clock_t), cudaMemcpyDeviceToHost);
        elapsed_scan_clocks += scan_clocks;
    

   

    
    // Display results problem data
    if (g_verbose)
    {
        printf("GPU output (reference output): ");
        for (unsigned long i = 0; i < TILE_SIZE; i++)
            printf("%lu, ", h_gpu[i]);
       
        printf("\n\n");
    }

    // Display timing results
    printf("Average clocks per 32-bit unsigned long scanned: %.3f\n\n", float(elapsed_scan_clocks) / TILE_SIZE / g_iterations);

    // Cleanup
    if (h_in) delete[] h_in;
    if (h_reference) delete[] h_reference;
    if (h_gpu) delete[] h_gpu;
    if (d_in) cudaFree(d_in);
    if (d_out) cudaFree(d_out);
    if (d_elapsed) cudaFree(d_elapsed);
}



int main(int argc, char** argv)
{if (argc != 2) {
		printf("not enough inputs\n\n");
		exit(0);
    	}
    // Display GPU name
    
    cudaDeviceProp props;
    cudaGetDeviceProperties(&props, 0);
    printf("Using device %s\n", props.name);
    TILE_SIZE = strtoull(argv[1], NULL, 10);
      
    Test<1024, 1, BLOCK_SCAN_RAKING>();
    Test<512, 2, BLOCK_SCAN_RAKING>();
    Test<256, 4, BLOCK_SCAN_RAKING>();
    Test<128, 8, BLOCK_SCAN_RAKING>();
  
    printf("-------------\n");

    Test<1024, 1, BLOCK_SCAN_RAKING_MEMOIZE>();
    Test<512, 2, BLOCK_SCAN_RAKING_MEMOIZE>();
    Test<256, 4, BLOCK_SCAN_RAKING_MEMOIZE>();
    Test<128, 8, BLOCK_SCAN_RAKING_MEMOIZE>();
   

    printf("-------------\n");

    Test<1024, 1, BLOCK_SCAN_WARP_SCANS>();
    Test<512, 2, BLOCK_SCAN_WARP_SCANS>();
    Test<256, 4, BLOCK_SCAN_WARP_SCANS>();
    Test<128, 8, BLOCK_SCAN_WARP_SCANS>();
 
    



    return 0;
}

Why are you using the block-level scan? Why not use the device-wide scan?

Your first kernel template parameter is block threads. You will not be able to use more than 1024 for that, because the underlying CUDA kernels have a limit of 1024 threads per block.

You were already sent a sample code demonstrating a cub scan on a large array in your email to the cub mailing list:

[url]Redirecting to Google Groups

@DIBID, to address his original question, I’m guessing (as I’ve obtained the same errors when running Udacity’s cs344, Lesson 7 Code Snippet for example_block_scan_cum.cu) DIBID is referring to running the Udacity example code for parallel programming. Also, from a google search, others have had the same problem, and looking at the error messages, we also see the problem.

cf. original: [url]https://github.com/udacity/cs344/blob/master/Lesson%20Code%20Snippets/Lesson%207%20Code%20Snippets/cub/example_block_scan_cum.cu[/url], my correction: [url]https://github.com/ernestyalumni/cs344/blob/master/Lesson%20Code%20Snippets/Lesson%207%20Code%20Snippets/cub/example_block_scan_cum.cu[/url]

Now, I’ve searched, on the documentation page of CUB and its modules doc page for

SmemStorage, BlockLoadVectorized

I don’t think they exist anymore! When I do a google search for them, I only find presentations from as early as 2007 using them.

So I think that they’ve been superceded.

Looking at the latest (12-2016) CUB, with the subdirectory for examples, I copied that implementation of the template for BlockPrefixSumKernel, from example_block_scan.cu. Of note is using

  • BlockLoad
  • BlockStore

and declaring, to be used in ‘shared’ memory a union:

 __shared__ union
    {
        typename BlockLoadT::TempStorage    load;
        typename BlockStoreT::TempStorage   store;
        typename BlockScanT::TempStorage    scan;
    } temp_storage;  
```'  

So that's new and probably (from what I found out) superceded `SmemStorage`, `BlockLoadVectorized`.  

So an update of the code would be nice, and a new explanation of what this `__shared__ union` would help me in particular.