image of the blog
ShadowThink Logo

Review for Image Captioning

Table of Contents

Image Captioning

Image captioning aims at describe an image using natural language. It’s a quite challenging task in computer vision because to automatically generate reasonable image caption, your model have to capture the global and local features, recognize objects and their relationships, attributes and the activities, ect. Plus, language model requires a lot work to generating grammar error-free sentences.

Image captioning is useful precise image retrieval, and might help blind people sense the world as well. For example, Facebook uses this technique to help blind people interact with visual content on social network. The Microsoft blind employee built Seeing AI, the cognitive glass for the blind. All of this significant applications are based on image captioning at some aspects.

Conventional Approaches

As forementioned reasons, image captioning is a very difficult task, thus, before the bloom of deep learning, the conventional approaches are quite straightforward, simply stitching together existing solutions of the related sub-problems. For example, one approach attempts to generate sentences by triplets of $\langle object, action, scene \rangle$ modeling as Markov Random Field (MRF). One specified example is shown as following:

object-action-scene triplet

In this approach, the sentence potentials are generated and evaluated through complicate natural language processing technique. The sentence is represented by computing the similarity between the sentence and the triplets which is generated during phase of mapping image space to meaning space. One notable limitation is that the triplet might mismatch together, such as $\langle bottle, walk, street \rangle$ makes no sense at all. Extracting corresponding triplets isn’t that easy at before.

Deep Learning Approaches


To my knowledge, Multimodal Recurrent Neural Networks (m-RNN) is the first work that incorporates the recurrent neural network in a deep multimodal architecture and is used to generate novel image captions. m-RNN models the probability distribution of generating a word given previous words and an image. In its framework, a deep recurrent neural network for sentences and a deep convolutional network for images interact with each other in a multimodal layer as shown in following figure:


In m-RNN, the two word embedding layers embed the one-hot input into a dense word representation. They are randomly initialized and capable to learn and encode both the syntactic and semantic meaning of the words. Furthermore, the calculations of the recurrent layer is slightly different from simple RNN. m-RNN re-maps the last recurrent layer activation and add it to current word representation (execute in the red box) instead of concatenating them. m-RNN uses Rectified Linear Unit (ReLU) (execute at the fuchsia line) instead of sigmoid which is harder to staurate or overfit the data, and allows longer stages and leads to better utilization of the data. The multimodal layer accepts three inputs: thw word embedding layer II, the recurrent layer and the image representation (from 7th layer of AlexNet or 15th layer of VGGNet). These three inputs are re-mapped into the same multimodal feature space and added together activated using element-wise scaled hyperbolic tangent function for fast training. Finally, the probability distribution of the next word is generated by a softmax layer.

m-RNN achieved records breaking results at image captioning. However, the ReLU activator just slightly decrease the effect of gradient vanishing or exploding. The long dependency issue still needs better solutions for it. Moreover, m-RNN trains model using maximum likelihood estimation (MLE), thus it suffers exposure bias as well.


Long-term Recurrent Convolutional Networks (LRCN) is a end-to-end trainable model suitable for visual understanding tasks, and it has been used for activity recognition, image captioning and video description based on its capability to learn compositional representations in space and time. LRCN considers the three tasks as different kinds of sequential learning task and proposes three task-specific instantiations of LRCN model.

LRCN task-specific instantiations

For image captioning task, input words are encoded as one-hot vectors and then are projected into an embedding space, i.e. using the embedding representation of the word. The visual feature representation uses deep layer of AlexNet as well. However, there are several variants of LRCN image captioning architecture. Basically, it consists of a CNN and a stack of $L$ LSTMs. If the $L - 1$ LSTMSs are separated and independent of the visual input, i.e. only represent the partial caption, (as shown at the right panel), it’s called “factored”, otherwise, it’s called “unfactored” (as shown at middle panel).

LRCN image captioning

The outputs of the final LSTM in the stack are the inputs to a learned linear prediction layer with a softmax producing a distribution $P(y_t \mid y_{1:t-1}, \phi_V(x))$ over words $y_t$ in the vocabulary. A special mark $\langle EOS \rangle$ is introduced denoting the end of the caption. At the training time, the previous words inputs $y_{1:t-1}$ at time $t$ are from the ground truth caption. At inference time, $y_{1:t-1}$ are the predicted words at previous time step.

Show and Tell (NIC)

Neural Image Caption (NIC) is another significant work from Google for the Microsoft COCO 2014 image captioning challenge. Like LRCN, NIC also follow end-to-end fashion both for training and inference. NIC directly maximize the probability of the correct description given the image and uses LSTM-based sentence generator as well. This makes LRCN and NIC more robust at solving gradient vanishing or explode problem. However, the visual input only works at the fist update of LSTM and it uses a more computional CNN. The architecture is illustrated as following figure:


At the inference time, apart from sampling approach as LRCN, the second one is Beam Search used by NIC. During beam search, it iteratively consider the set of the $k$ best sentences up to time $t$ as candidates to generate sentences of size $t+1$, and keep only the resulting best $k$ of them.

Show, Attend and Tell

Like region proposals in R-CNN, attention mechanism can improve the performance on object detection and other visual tasks. This paper attempts to introduce attention based model into image captioning. This model is able to automatically learn to gaze on salient objects while generating the corresponding words in the output sequence. In the introduction section, the author argued the advantage of attention model. Attention allows for salient features to dynamically come to the forefront as needed which is more important when there is a lot of clutter in an image. Moreover, current widely used visual input representation (from top layer of CNN) distill information in image effectively, but it’s potential to lose information for richer, more descriptive caption. Thus, this model tries to extract features from a lower convolutional layer as encoder which preserve more information than top layer. This allows the decoder to selectively focus on certain parts of an image by weighing a subset of all the feature vectors. The visualized attentional maps are explained as following figure:

attention mechanism in image captioning

Mathematically, the attention mechanism works in following steps where $\mathbf{a}_i$ is is the extracted annotation vector for given image corresponding to a part of the image, $y_t$ is the $t$-th word represented using one hot vector, $\hat{z}_t$ is a dynamic representation of the relevant part of the image input at time $t$.

attention mechanism workflow

The positive weights $\alpha_i$ of each annotation vector $\mathbf{a}_i$, is computed by an attention model $f$ for which the author uses a multilayer perceptron conditioned on the previous hidden state. Next, the author proposed two different attention mechanism that decides the $f$ and $\phi$. One is stochastic attention mechanism in which the positive weight $\alpha_i$ is interpreted as the probability that location $i$ is the right place to focus for producing the next word. The another is deterministic attention mechanism in which $\alpha_i$ is relative importance to give location $i$ in blending $\mathbf{a}_i$’s together, i.e. weighted summed up $\mathbf{a}_i$ is $\hat{z}_t$.