Will this cause troubles?

I tried to implement an object which automatically allocate and release device memory. It follows rule of 5. I am getting errors(error code = 30) after upgrading to cuda 8 rc (and with 1080 cards). My class looks like this but is it OK?

template<typename T>
    class Mat {
    public:
        Mat() {};
        Mat(const T * data, int width, int height, int depth) {
            set_size(width, height, depth);
            checkCudaErrors( cudaMemcpy(data_dev_, data, width_ * height_ * depth_ * sizeof(T), cudaMemcpyHostToDevice) );    
        }
        
        Mat(int width, int height, int depth) {
            set_size(width, height, depth);
        }
        
        //copy constructor
        Mat(const Mat & other) {
            if (other.width_ > 0 && other.height_ > 0 && other.depth_ > 0) {
                width_ = other.width_;
                height_ = other.height_;
                depth_ = other.depth_;
                if(other.data_dev_ != nullptr) {
                    checkCudaErrors( cudaMalloc((void **)&data_dev_, other.width_ * other.height_ * other.depth_ * sizeof(T)) );
                    checkCudaErrors( cudaMemcpy(data_dev_, other.data_dev_, width_ * height_ * depth_ * sizeof(T), cudaMemcpyDeviceToDevice) ); 
                }
            }
        }
        //move constructor
        Mat(Mat&& other) {
            if (other.width_ > 0 && other.height_ > 0 && other.depth_ > 0) {
                width_ = other.width_;
                height_ = other.height_;
                depth_ = other.depth_;
                data_dev_ = other.data_dev_;
                other.data_dev_ = nullptr;
            }
        }
        //destructor
        ~Mat() {
            clear();
        }
        
        //copy assignment operator
        Mat& operator= (const Mat& other) {
            Mat mat(other); // re-use copy-constructor
            *this = std::move(mat); // re-use move-assignment
            return *this;
        }
        
        //move assignment operator
        Mat& operator= (Mat&& other) {
            // simplified move-constructor that also protects against move-to-self.
            std::swap(width_, other.width_); // repeat for all elements
            std::swap(height_, other.height_);
            std::swap(depth_, other.depth_);
            std::swap(data_dev_, other.data_dev_);
            return *this;
        }

        int clear() {
            if (data_dev_ != nullptr) {
                checkCudaErrors( cudaFree(data_dev_) );
                data_dev_ = nullptr;
            }
            width_ = 0;
            height_ = 0;
            depth_ = 0;
            
            return 0;
        }
        int set_size(int width, int height, int depth) {
            clear();
            
            width_ = width;
            height_ = height;
            depth_ = depth;
            
            checkCudaErrors( cudaMalloc((void **)&data_dev_, width_ * height_ * depth_ * sizeof(T)) );
            
            return 0;
        }
        int width_ = 0;
        int height_ = 0;
        int depth_ = 0;
        T * data_dev_ = nullptr;
        
    };

If your Mat object goes out of scope at program termination, the call of cudaFree in clear() in the destructor could be problematic. Didn’t we cover this already?

If you want help, why not at least indicate the circumstances under which you get the error 30. For example, is it upon calling the constructor? Destructor? One of the other methods?

Even better, provide a short, simple reproducer code that uses this class/object and produces the indicated error.

@txbob, thanks for the reply! I have been trying to narrow down the error section and provide code that can reproduce the error. But it’s not easy, the error does not take place in a certain circumstance. It just has high chances every time you run the program. I am trying my best so i will upload everything as soon as I can reproduce it exactly.

Excuse me for my debugging ability, this error is too hard for me to locate. It’s like ‘ghost’ and can appear anywhere in my code (or third party libs) where there is cudaMallocHost.

But after replacing all cudaMallocHost/cudaFreeHost to malloc/free, the errors seem disappeared. Now my code runs as excpted, no annoying errors any more. May I say that there is PROBABLY an issue with cudaMallocHost?

Just about any sort of defect is possible, in just about any piece of software.

Certainly the work to locate and fix software defects will be significantly accelerated when a reliable reproducer can be created.

And if you have a broken design practice, such as using cuda calls in destructors that get called after application tear-down begins, then the fact that you have massaged such an application to not throw errors by judicious use of the cuda runtime API does not mean that there is a defect in a particular portion of the API - it means that you have a flawed design but have a found a way to hide the flaw.

Yes you are right. But my program is a fastcgi which means it is supposed to keep running all the time. And the errors occurred during the application was running, not in shutdown time. And the issue only happens when two GPUs are used simultaneously, so I guess it must be an issue with 2 cards (fine with only one GPU).
I’ve reported a bug along with some necessary logs and hope they could confirm it soon.
Tons of thanks to you @txbob, you always help me a lot!

@cysin, Here, I provide a RAII (Resource Acquisition Is Initialization) implementation following the Rule of 5 that wraps CUDA linear memory in a header file called LinearMemory.h:

The things that are different and were very non-obvious were, in the move constructor, to remove the initialization to nullptr of the device data pointer in its member declaration, and to be sure to member initialize your data_dev_ to nullptr in the move constructor.

Also, in the move assignment, consider cudaFree to free the resources from the target object, and then swapping the device data pointers, so there’d be only 1 resource, and not a copy.

It wasn’t obvious, but the code examples in LinearMemory.h (and in the associated directories and subdirectories) should have a good implementation of the rule of 5.

has a main driver function that has unit tests for these classes.

Hope this helps cysin!