Google Machine Learning Models for Image Captioning Ported to TensorFlow and Open-Sourced

| by Dylan Raithel Follow 9 Followers on Oct 28, 2016. Estimated reading time: 2 minutes |

Google chronicled their journey over the past few years with their announcement around open-sourcing a TensorFlow model for image captioning, and some of the testing for comparing accuracy and performance benchmarks between the new approach and existing implementations. The 2014 Inception V1, 2015 Inception V2, and recently the Inception V3 model are improvements over one another, at 89.6, 91.8 and 93.9 percent in top-5 accuracy against an ImageNet 2012 image classification task. The BLEU-4 metric is used to measure quality of the machine generated captions by measuring the accuracy of sentence translation from one natural language to another. The TensorFlow-based approach took a 2 point gain over the previous leading model, DistBelief.

One of the problems noted in porting and improving the new models from previous implementations is the process  of object classification in an image versus describing and relating the objects in an image to one another. The model appears to address this problem by introducing a fine-tuning phase that allows the model to extract information useful for describing details of objects, exclusive of the classification phase. It splits the image classification phase for identifying objects from another phase that adds adjectives and prepositional phrases, and from a phase in which the model gives the caption structure to make it more syntactically correct and humanlike.

An example of this would be recognizing an image as a train sitting on train-tracks, and then recognizing that it's also yellow and blue. The synthesized result becomes recognizing the image as a blue and yellow train travelling down the tracks. Although in this case, it's not a question of whether or not the model can determine the object to be in motion or not from the still image, but whether or not the input image captions in the training data had described similar images as being in motion or not.

The model can mix in componenfts of previously-learned image captions to create novel and new captions on images whose unique combination of classified objects isn't in a single training datum, but whose components are a composite of the whole. In this example, the model comes up with a caption that hadn't previously existed.

Benchmarks comparing the training time between the previous model implementation, DistBelief, and the new TensorFlow-based Inception V3 showed TensorFlow to take 25% of the training time compared to DistBelief on an Nvidia K20 GPU, at 0.7 and 3.0 seconds. In addition to the TensorFlow-based Inception V3 image classification model, Google mentioned the release of the Inception-ResNet-v2 model, but did not provide any benchmarks around its performance. Although training data sets were not provided, human-generated captions to images provide the basis of the training data.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you