lazy summary adding in tensorflow

When creating a training script in tensorflow, there rises the need to sometimes add summary items(summary protobufs to be exact) later on in the same step.

For example, lets say a training session is in play with a metric calculation step included. Periodically, I want to run a prediction with a validation/test data and record the metrics for these predictions along with the summary writer used to log the process of the training steps. In other words, a tensorboard image like the following is desired:

a tensorboard screen capture where loss/metric tab items are logged for every training step and the test tab items are only logged occasionally.

In the above capture, loss and metric is recorded for every training step. On the other hand, the tab named “test” which logs the accuracy metric of the given test data is only logged every 5 training steps. That is why only three points are saved with the same scale of x-axis(number of steps).

When I was a tensorflow noob, I thought this stuff was something hard to do. I presumed tat something like this would require a deeper knowledge of how summary writer works and needed to be hacked.

However after reading the docs more carefully and getting to know a little bit more on how the session,graph, saver, summary filewriters work with each other, this turned out to be something very easy to do.

The example that will be demonstrated is the one that gave out the tensorboard capture image above. I will use the inception_v3 model provided by default in the tensorflow package. The code is available in this gist link.

I will only go through the core parts.

When building the model notice how I have two summary merging operations. The first one named train_summary_op will be pointing to loss_summary and train_accuracy_summary summary operations only. Don’t let the merge_all() function call fool you.

And after calling merge_all(), I then create another summary operation named test_accuracy_summary which is practically identical with train_accuracy_summary but only with a different tag name. And then I create a separate summary merge which contains the newly created test_summary_op summary operation.

This this strategy, I have separated the summary operations that I will need in training steps and in test/validation predictions.

See how these two merged summary operations work in different situations:

While looping through training steps, it will obtain summary protobuf by executing train_summary_op. The summary file writer will save the summary protobuf with the explicitly specified global_step set to the training step value.

In every 5 training steps, it will make a prediction with the test data and in this case test_summary_op will be executed in the graph. The resulting summary protobuf is added to the summary event file with the same writer. In other to retain the step number, notice that I have also specified the global_step parameter value to the current training step when calling add_summary.

This is the trick that I have used to record the interval validation/test prediction metric values along with the training. This trick can also be applied to save custom metrics that cannot be calculated inside the computational graph.

To demonstrate such situations, I will modify the example training code a little bit. The modified version is available here.

The changes made compared to the first example are:

  • Moved the accuracy calculating code outside of the TF compuational graph. In the new example code, it will be calculated in the main python thread using numpy and scikit-learn functions. Below are the related code lines:
  • After computing the metric(accuracy in this case) is calculated manually, we want it to be recorded in the tf summary event. To do this we need to create a summary protobuf, and in order to do this, we need to utilize the session. But we don’t need to flow through the entire model graph, but only a three operations:
    - placeholder that will receive the pre-calculated metric value
    - any operations that will convert the placeholder input to a normal tensor
    - summary operation that will produce the summary protobuf

The below are the related code lines

  • Once we get the summary protobuf, all that needs to be done is add it to the summary writer with appropriate global step value.

The tensorboard result screen capture is shown as below.

Overall, some complicated metric calculations that are simply impossible(or extremely confusing and complicated) to do inside tensorflow computational graph can be simply moved out from the computational graph and execute it in a normal python thread. An example case where I faced such need was applying a custom NMS to the predicted bounding boxes. At the time I didn’t know that tensorflow offered its own NMS function and I wrote a python function that did this job. In order to use this python function on the predictions, it naturally lead me to calculate the metric outside of the computational graph.

I guess, if there are metric calculating can be done in the GPU, perhaps that may be more efficient. See if any metric/loss calculations that you thought needs to be customized are already supported by tensorflow. If not, and the calculations that you seek just seems to overwhelming to implement with matrice computations alone, then I advise you to try out this trick. There could be a loss in computation speed, but at least you can get the job done 🙂

BTW, this complicated considerations all disappears if you use the tensorflow in eager mode. The eager mode doesn’t use the computational graph, and instead computes operations at run-time. This was a feature that pytorch had an advantage over tensorflow in the past, but tensorflow supports run-time execution. Since there are no computational graphs used in eager mode, this “late summary adding” trick does not apply in tensorflow eager mode. Instead, it is much easier since you can just add the summary without any consideration of the computational graph or whatsoever!


Originally published at kwagjj.wordpress.com on September 9, 2018.