Hi,
I am training a convolutional network using Torch. If I use cuDNN, my system freezes at the beginning of the training and shuts down eventually. This happens by adding more convolutional layers, increasing the number of filters in a convolutional layer or increasing the batch size. The same kind of system freeze happens with using cuda but with larger parameters. I monitor GPU features with nvidia-smi, but I do not see any abnormal values.
I use Ubuntu 14.04 with GTX Titan X. My driver version is 361.28. I use Cuda Toolkit 7.5 and cuDNN 4, Torch 7. My colleague who uses Caffe experiences the same problem.
Here is a network I have problem with when I train Cifar-10 dataset. If I take out the last convolutional layer with 512 filters, things work. Reducing the number of filters from 512 to 128 makes it work at the first run, but crashes the system at the second run. So I cannot find bordering paramters that will crash the system.
Your help is greatly appreciated!
nn.Sequential {
[input → (1) → (2) → (3) → (4) → (5) → (6) → (7) → (8) → (9) → (10) → (11) → (12) → (13) → (14) → (15) → (16) → (17) → (18) → (19) → output]
(1): cudnn.SpatialConvolution(3 → 128, 3x3, 1,1, 1,1)
(2): cudnn.ReLU
(3): cudnn.SpatialConvolution(128 → 128, 3x3, 1,1, 1,1)
(4): cudnn.ReLU
(5): cudnn.SpatialMaxPooling(2,2,2,2)
(6): cudnn.SpatialConvolution(128 → 256, 3x3, 1,1, 1,1)
(7): cudnn.ReLU
(8): cudnn.SpatialConvolution(256 → 256, 3x3, 1,1, 1,1)
(9): cudnn.ReLU
(10): cudnn.SpatialMaxPooling(2,2,2,2)
(11): cudnn.SpatialConvolution(256 → 512, 3x3, 1,1, 1,1)
(12): cudnn.ReLU
(13): cudnn.SpatialConvolution(512 → 512, 3x3, 1,1, 1,1)
(14): cudnn.ReLU
(15): cudnn.SpatialMaxPooling(2,2,2,2)
(16): nn.Reshape(8192)
(17): nn.Linear(8192 → 1024)
(18): nn.Linear(1024 → 1024)
(19): nn.Linear(1024 → 10)
}