Execute a DIGITS trained tensorflow model on TX2 using python

Hi !

I’m using DIGITS to train my tensorflow models, currently a LeNet network with gray[28x28] input using my own classified images.
I prepare a dataset with two labels, 0 and 1 that stands for :

  • 0 => not a ball (~ 6000 images)
  • 1 => a ball (~ 1000 images)
    When I train it using DIGITS, I get a model with an accuracy of ~94% and a loss of 0.27.
    When I classify one image using DIGITS, it classifies it well, as you can see below :
    External Media

Very well, so now I want to use this model in one of my Python script. So I define the model, derived from the network.py provided with DIGITS :

class LeNetModel():

    # A placeholder version, allowing to load an image from a numpy array (OpenCV in my case)
    def placeholder_gray28(self, nclasses):
        x = tf.placeholder(tf.float32, shape=[28, 28, 1], name="x")
        return x, self.gray28(x, nclasses)

    def gray28(self, x, nclasses, is_training=False):
        rs = tf.reshape(x, shape=[-1, 28, 28, 1])
        # scale (divide by MNIST std)
        rs = rs * 0.0125
        with slim.arg_scope([slim.conv2d, slim.fully_connected],
                            weights_initializer=tf.contrib.layers.xavier_initializer(),
                            weights_regularizer=slim.l2_regularizer(0.0005)):
            model = slim.conv2d(rs, 20, [5, 5], padding='VALID', scope='conv1')
            model = slim.max_pool2d(model, [2, 2], padding='VALID', scope='pool1')
            model = slim.conv2d(model, 50, [5, 5], padding='VALID', scope='conv2')
            model = slim.max_pool2d(model, [2, 2], padding='VALID', scope='pool2')
            model = slim.flatten(model)
            model = slim.fully_connected(model, 500, scope='fc1')
            model = slim.dropout(model, 0.5, is_training=is_training, scope='do1')
            model = slim.fully_connected(model, nclasses, activation_fn=None, scope='fc2')
            
            # I only append this softmax, that changes output tensor values but doesn't change the classification
            model = tf.nn.softmax(model)

            return model

Except the latest softmax, this is the same network as the one that has been trained by DIGITS.

I use this model by shaping and providing a Tensor obtained from the same JPG image as I use in DIGITS :

def name_in_checkpoint(var):
    return 'model/' + var.op.name

TF_INTRA_OP_THREADS = 0
TF_INTER_OP_THREADS = 0
MIN_LOGS_PER_TRAIN_EPOCH = 8  # torch default: 8
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_boolean('log_device_placement', False, """Whether to log device placement.""")
tf.app.flags.DEFINE_boolean('serving_export', False, """Flag for exporting an Tensorflow Serving model""")

if __name__ == '__main__':

    filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once("img-535.jpg"))

    image_reader = tf.WholeFileReader()

    key, image_file = image_reader.read(filename_queue)

    ball = tf.image.decode_jpeg(image_file)
    ball = tf.to_float(ball)
    ball = tf.image.resize_bicubic([ball],(28,28))
    ball = tf.image.rgb_to_grayscale([ball])
    ball = tf.divide(ball, 255)

    single_batch = [key, ball]
    
    inference_op = LeNetModel().gray28(ball,2,False)

    sess = tf.Session(config=tf.ConfigProto(
                          allow_soft_placement=True, 
                          inter_op_parallelism_threads=TF_INTER_OP_THREADS,
                          intra_op_parallelism_threads=TF_INTRA_OP_THREADS,
                          log_device_placement=FLAGS.log_device_placement))
    
    variables_to_restore = slim.get_variables_to_restore(exclude=["is_training"])
    variables_to_restore = {name_in_checkpoint(var):var for var in variables_to_restore}
    saver = tf.train.Saver(variables_to_restore)

    # Initialize variables
    init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
    sess.run(init_op)

    saver.restore(sess, "snapshot_30.ckpt")    

    tf.train.start_queue_runners(sess)
    print sess.run(inference_op * 100)

    sess.close()
    exit(0)

And I am not able to have the same results. Executing this script, I get this result :

[[ 82.83679962  17.16320229]]

Neither the score nor the classification are right. But I can’t understand what I am doing wrong. I take a look at the DIGITS source code and I can’t find significant differences with my code. Does anybody encounter this problem ?

You can download the full use case here : http://vps166675.ovh.net/digits-issue.tar.gz

Thank you in advance.
Damien.

Hi,

Guess that there is something different in the image preprocessing.
Could you check if your workflow is identical to the DIGITs inference here:
[url]DIGITS/tensorflow_train.py at master · NVIDIA/DIGITS · GitHub

Thanks.

Thank you.

The processing steps that I can identify on the DIGITS source code are :

_float_array_feature(image.flatten())

This is a no-op processing, from the image point-of-view

with tf.name_scope(digits.STAGE_INF) as stage_scope:
    inf_model = Model(digits.STAGE_INF, FLAGS.croplen, nclasses)
    ...

The dataloader is instantiated. It is a TFRecordsLoader :

The interresting parameters are :

self.float_data = False  # For now only strings
self.unencoded_data_format = 'hwc'
self.unencoded_channel_scheme = 'rgb'
self.image_dtype = tf.uint8

So it does :

Then

It returns a FIxedLenFeature :

tf.FixedLenFeature([self.height, self.width, self.channels], tf.float32)

Then :

that does :

data = tf.image.decode_jpeg(data, name='image_decoder')
...
data = tf.to_float(data)

It adds :

single_data = tf.image.resize_image_with_crop_or_pad(single_data, self.croplen, self.croplen)

And to finish, it creates a batch and launch it :

single_batch = [single_key, single_data]
...
 batch = tf.train.batch(
                single_batch,
                batch_size=self.batch_size,
                dynamic_pad=True,  # Allows us to not supply fixed shape a priori
                enqueue_many=False,  # Each tensor is a single example
                # set number of threads to 1 for tfrecords (used for inference)
                num_threads=NUM_THREADS_DATA_LOADER if not self.is_inference else 1,
                capacity=max_queue_capacity,  # Max amount that will be loaded and queued
                allow_smaller_final_batch=True,  # Happens if total%batch_size!=0
                name='batcher')

So there is some differences :

  • The image doesn’t seem to be converted to grayscale using DIGITS inference tool
  • The resize is done by resize_image_with_crop_or_pad where I use a resize_bicubic

So my questions are :

I will adapt my sample to use reshape and crop_pad and give it a try, but it seems to me that it is counterintuitive. So I guess that I miss a step done by DIGITS…

I try with this image pre-processing :

ball = tf.image.decode_jpeg(image_file)
ball = tf.to_float(ball)
# ball = tf.image.resize_bicubic([ball],(28,28))
# ball = tf.image.rgb_to_grayscale([ball])
# ball = tf.reshape(ball,(49,49,1))
ball = tf.image.resize_image_with_crop_or_pad(ball, 28, 28)
ball = tf.divide(ball, 255)

The result is :

[[ 82.9730835   17.02691078]
 [ 82.78138733  17.21861267]
 [ 82.86641693  17.13358116]]

So it didn’t change anything…

Given the DIGITS pre-processing I analyse, it tries to classify this image :
External Media
this image is the out image of resize_image_with_crop_or_pad

Instead of :
External Media

The DIGITS pre-processing steps are not really obvious to me…

Hi,

Here is the control of crop function:

Thanks.

Thank you.

But what if I don’t choose to crop images, but to squash them ? I can’t find where DIGITS does the “squash” in the case of a “classify one image”.

In the inference.py tools, I can see that the resize is done by :

image = utils.image.resize_image(
                        image,
                        height,
                        width,
                        channels=channels,
                        resize_mode=resize_mode)

That leads to :

scipy.misc.imresize(image, (height, width), interp=interp)

(The doc doesn’t explain what is the used resize algorithm : https://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.imresize.html)

But it seems that this tool is used by the REST API, not the “classify one image”.
I will try with the REST API to see the result in this case.

Hi,

Have you tried the REST API? If yes, does it fix this issue?
Any feedback will be appreciated.

Thanks.

Hi,

I haven’t had time yet to test the REST API, but I will get you involved when I will do.
In fact, I would like to install Digits outside of docker to add some logs.

Thank you.

Hi,

Thanks for your feedback.
Feel free to let us know if you need help.

Thanks.

Hi,

Using the REST API, I get the same result :

curl localhost:5000/models/images/classification/classify_one.json -XPOST -F job_id=20171204-192410-f734 -F image_file=@img-535.jpg

{
  "predictions": [
    [
      "1", 
      74.85
    ], 
    [
      "0", 
      25.15
    ]
  ]
}

Here is some relevant logs from this classification :

2017-12-17 08:40:38 [20171217-084037-6f48] [INFO ] Infer Model task started.
2017-12-17 08:40:38 [20171217-084037-6f48] [INFO ] Task subprocess args: "/usr/bin/python /usr/local/lib/python2.7/dist-packages/digits/tools/inference.py /jobs/20171217-084037-6f48/tmpOhvSZr.txt /jobs/20171217-084037-6f48 20171204-192410-f734 --jobs_dir=/jobs --layers=none --gpu=0"
...
2017-12-17 08:40:40 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:39 [20171204-192410-f734] [INFO ] tensorflow classify one task started.
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Train batch size is 1 and validation batch size is 1
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Training epochs to be completed for each validation : 1
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Training epochs to be completed before taking a snapshot : 1.0
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Model weights will be saved as network_<EPOCH>_Model.ckpt
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Loading mean tensor from /jobs/20171204-190738-edf5/mean.binaryproto file
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Loading label definitions from /jobs/20171204-190738-edf5/labels.txt file
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Found 2 classes
2017-12-17 08:40:41 [20171217-084037-6f48] [WARNING] Infer Model unrecognized output: 2017-12-17 08:40:41 [20171204-192410-f734] [DEBUG] tensorflow classify one task : Found 1 images in db /tmp/tmpcoLYeu.tfrecords

Particularly :

Loading mean tensor from /jobs/20171204-190738-edf5/mean.binaryproto file

Could “Loading” this mean tensor change the result ?

Hi,

For ‘Loading mean tensor…’, it’s a preprocessing step to subtract images with a given binaryproto file.
Is your issue solved after applying REST API?

Thanks.

Thank you.

Given

  • A = [[1,1,1],[0,1,0],[1,0,0]] the image I want to classify,
  • B = [[1,0,0],[0,1,0],[0,0,0]] the image described by mean.binaryproto file,

The classify step of Digits does :

  • a substraction C = A-B = [[0,1,1],[0,0,0],[1,0,0]]
  • then feeds the C matrix as the input of the TensorFlow network

Is this what Digits actually do ? If so, I can do the same but I’m not sure to understand how it is relevant to do that to classify. In fact, I’m pretty sure to understand how it can pollute the results…

My issue is not resolved because I try to classify images with a Digits trained TensorFlow network using custom Python code (Python in a first phase, C++ will be used later) and i’m not able to do. I think that using TensorRT would resolve these problems but I really need to understand why I can’t classify as Digits do before going further.

Currently I am training networks using tf-learn and I can predict results properly. But Digits adds some interesting stuff, so I will not surrender :)

Hi,

You can set different mean-subtraction approach on DIGITs.

Data Transformations > Subtract Mean

  • None
  • Image
  • Pixel
    This setting will lead to different preprocess handling.

For using TensorRT on TX2, there are two things want to share with you first:
1. Python API is not available on Jetson. You need to export Tensorflow model to UFF on x86-machine and inference it via C++ interface on Jetson.
2. Currently, we don’t have an interface for a TensorFlow/UFF user to set their custom layer implementation.

Let us know if you need help.
Thanks.