Paper summary

TIME: Text and Image Mutual-Translation Adversarial Networks

1.What is this paper about?

Under the GAN frame work, it indecates that G can be boosted substantially by training with D as a language model, which adapts the Transromer to connect between the image features and word embedding. Furthermore it introduce the annealing conditional hinge loss that keep the balances the adversak learing. It can not only ganerate the image from text but make image caption.

2.What’s better than previous paper?

Previous model depends on,

  1. a pre-trained text encoder for word and sentence embedding.
  2. an additional image enceode to ascertain image-text consistency.

To address these dependencies, it adapt to train G with D as a laungage model and the novel attention model using Tranformer. And this is the first work to handle bot text2image and image caption in a single model using GAN framework.

3.What are important parts of technique and methods?

Folloing Fig is the overall of this model. model

It propose it counter-part to the 1-D positional encodeing in Transformer.

1-D positional is difficult to desinguish the attention, because Transformer has no way to deicern spatial info from the flattened features.

In contrast 1-D ones, it is possible to destinguish the position for attention as following Fig.

2D

To consider that G can learn a good semantic visual translation at early iteration, it introduce its loss for this model.

4.How did they verify it?

It evaluates both task text2image and image-captioning.

cub coco

quan

5.Is there a debate?