AMP error from tensorflow

michael4e2ca · April 10, 2019, 2:47pm

Hi,
I’ trying to train with FP16 on Tesla v100. I am on the latest NGC tensorflow docker.
My code is working fine in FP32 but when I set the export TF_ENABLE_AUTO_MIXED_PRECISION=1
I get an error

File "trainer_od.py", line 252, in train_op
    apply_gradient_op = self.apply_gradients(grads)
  File "trainer_od.py", line 307, in apply_gradients
    train_op = opt_conv.apply_gradients(zip(grads, all_trainable),name="train_op")
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 591, in apply_gradients
    shift_update_op = self._update_gradient_shift(all_finite)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 876, in _update_gradient_shift
    return control_flow_ops.cond(all_finite, finite_branch, overflow_branch)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2097, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1941, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 874, in finite_branch
    return control_flow_ops.cond(should_update, boost_branch, incr_branch)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2097, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1941, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 862, in boost_branch
    new_scale_val = clip_ops.clip_by_value(scalar * 2.0, scale_min, scale_max)
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

My code is:

all_trainable = [v for v in tf.trainable_variables()]
        opt_conv = tf.train.MomentumOptimizer(self.learning_rate, self.config.train.momentum,name="MomentumOptimizer")
        extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(extra_update_ops):
            train_op = opt_conv.apply_gradients(zip(grads, all_trainable),name="train_op")

Do I have to use some specific optimizers? Any clue?
Thanks

nluehr · April 10, 2019, 4:20pm

It is not clear from the code posted how grads is being computed. The AMP loss scale wrapper requires that you use both opt_conv.compute_gradients() and opt_conv.apply_gradients() (or equivalently opt_conv.minimize()) in order for loss scaling to be handled automatically.

If for some reason you prefer not to use a tf.Optimizer for gradient computation, you can implement loss scaling manually and then export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=1 to enable the AMP graph optimizer without affecting the behavior of tf.Optimizer.

michael4e2ca · April 11, 2019, 9:28am

Hi,

Thanks for the quick response.

I changed my code to

self.restore_var = tf.global_variables()
all_trainable = [v for v in tf.trainable_variables() if
                         ('beta' not in v.name and 'gamma' not in v.name)]

opt_conv = tf.train.MomentumOptimizer(self.learning_rate, self.config.train.momentum,name="MomentumOptimizer")
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(extra_update_ops):
    train_op = opt_conv.minimize(losses[0],var_list=all_trainable, name="train_op")

and still has an error (Not the same though)

2019-04-11 09:14:49.344732: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-11 09:14:49.378327: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-11 09:14:56.941674: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-11 09:14:56.942671: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-11 09:14:59.067222: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-11 09:14:59.166386: F tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:820] Check failed: attr_def 
Aborted (core dumped)

This time there is not much indication but still the code works without AMP enabled.

Best Regards

carlc · April 11, 2019, 8:36pm

Hi,

Since this issue now appears to be the same as Check failed: attr_def - Deep Learning (Training & Inference) - NVIDIA Developer Forums, I’ll keep the conversation over on that thread. Two comments:

Are you able to provide the full script to reproduce?
If not, could you take a look at my most recent comment on that thread? It explains how to dump a serialized version of the graph that triggers the issue – if we can take a look at that, then I expect we can root-cause the failure.

Thanks!