AMP error from tensorflow

Hi,
I’ trying to train with FP16 on Tesla v100. I am on the latest NGC tensorflow docker.
My code is working fine in FP32 but when I set the export TF_ENABLE_AUTO_MIXED_PRECISION=1
I get an error

File "trainer_od.py", line 252, in train_op
    apply_gradient_op = self.apply_gradients(grads)
  File "trainer_od.py", line 307, in apply_gradients
    train_op = opt_conv.apply_gradients(zip(grads, all_trainable),name="train_op")
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 591, in apply_gradients
    shift_update_op = self._update_gradient_shift(all_finite)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 876, in _update_gradient_shift
    return control_flow_ops.cond(all_finite, finite_branch, overflow_branch)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2097, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1941, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 874, in finite_branch
    return control_flow_ops.cond(should_update, boost_branch, incr_branch)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2097, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1941, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 862, in boost_branch
    new_scale_val = clip_ops.clip_by_value(scalar * 2.0, scale_min, scale_max)
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

My code is:

all_trainable = [v for v in tf.trainable_variables()]
        opt_conv = tf.train.MomentumOptimizer(self.learning_rate, self.config.train.momentum,name="MomentumOptimizer")
        extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(extra_update_ops):
            train_op = opt_conv.apply_gradients(zip(grads, all_trainable),name="train_op")

Do I have to use some specific optimizers? Any clue?
Thanks

It is not clear from the code posted how grads is being computed. The AMP loss scale wrapper requires that you use both opt_conv.compute_gradients() and opt_conv.apply_gradients() (or equivalently opt_conv.minimize()) in order for loss scaling to be handled automatically.

If for some reason you prefer not to use a tf.Optimizer for gradient computation, you can implement loss scaling manually and then export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=1 to enable the AMP graph optimizer without affecting the behavior of tf.Optimizer.

Hi,

Thanks for the quick response.

I changed my code to

self.restore_var = tf.global_variables()
all_trainable = [v for v in tf.trainable_variables() if
                         ('beta' not in v.name and 'gamma' not in v.name)]

opt_conv = tf.train.MomentumOptimizer(self.learning_rate, self.config.train.momentum,name="MomentumOptimizer")
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(extra_update_ops):
    train_op = opt_conv.minimize(losses[0],var_list=all_trainable, name="train_op")

and still has an error (Not the same though)

2019-04-11 09:14:49.344732: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-11 09:14:49.378327: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-11 09:14:56.941674: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-11 09:14:56.942671: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-11 09:14:59.067222: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-11 09:14:59.166386: F tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:820] Check failed: attr_def 
Aborted (core dumped)

This time there is not much indication but still the code works without AMP enabled.

Best Regards

Hi,

Since this issue now appears to be the same as Check failed: attr_def - Deep Learning (Training & Inference) - NVIDIA Developer Forums, I’ll keep the conversation over on that thread. Two comments:

  1. Are you able to provide the full script to reproduce?
  2. If not, could you take a look at my most recent comment on that thread? It explains how to dump a serialized version of the graph that triggers the issue – if we can take a look at that, then I expect we can root-cause the failure.

Thanks!