Understanding the semantic relations between vision and language data has become a research trend in artificial intelligence and robotic systems. The lack of training data is an essential issue for vision-language understanding. We address the problem of image and sentence cross-modal retrieval when paired training samples are not sufficient. Inspired by recent works in variational inference, in this paper, the autoencoding variational Bayes framework is novelly extended to a semi-supervised model for image-sentence mapping task. Our method does not require all training images and sentences to be paired. The proposed model is an end-to-end system, and consists of a two-level variational embedding structure where unpaired data are involved in the first level embedding to give support to intra-modality statistics so that the lower bound of the joint marginal likelihood of paired data embeddings can be better approximated. The proposed retrieval model is evaluated on two popular datasets, i.e. Flickr30K and Flickr8K, producing superior performances compared with related state-of-the-art methods.
|Title of host publication||2017 IEEE International Conference on Robotics and Automation (ICRA)|
|Place of Publication||Piscataway|
|Publication status||E-pub ahead of print - 24 Jul 2017|