Hi,
I’ trying to train with FP16 on Tesla v100. I am on the latest NGC tensorflow docker.
My code is working fine in FP32 but when I set the export TF_ENABLE_AUTO_MIXED_PRECISION=1
I get an error
File "trainer_od.py", line 252, in train_op
apply_gradient_op = self.apply_gradients(grads)
File "trainer_od.py", line 307, in apply_gradients
train_op = opt_conv.apply_gradients(zip(grads, all_trainable),name="train_op")
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 591, in apply_gradients
shift_update_op = self._update_gradient_shift(all_finite)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 876, in _update_gradient_shift
return control_flow_ops.cond(all_finite, finite_branch, overflow_branch)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2097, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1941, in BuildCondBranch
original_result = fn()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 874, in finite_branch
return control_flow_ops.cond(should_update, boost_branch, incr_branch)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2097, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1941, in BuildCondBranch
original_result = fn()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 862, in boost_branch
new_scale_val = clip_ops.clip_by_value(scalar * 2.0, scale_min, scale_max)
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'
My code is:
all_trainable = [v for v in tf.trainable_variables()]
opt_conv = tf.train.MomentumOptimizer(self.learning_rate, self.config.train.momentum,name="MomentumOptimizer")
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(extra_update_ops):
train_op = opt_conv.apply_gradients(zip(grads, all_trainable),name="train_op")
Do I have to use some specific optimizers? Any clue?
Thanks