WHY?

Visual question answering task is compositional in nature.

WHAT?

image

This paper tries to solve VQA by composing modeules to construct a network architecture based on a given question. Primitive modules that can be composed into any configuration of questions are defined: attention, re-attention, combination, classification, and measurement. The key component of the modules is attention mechanism that allow model to focus on the parts of the given image. A question is parsed to form an universal dependency representation which can be map into a network layout. Embedded question is combined to the end of network to capture subtle differences. The composed model is trained end-to-end.

So?

image

This model achieved the best performance on VQA and SHAPES datasets.

Andreas, Jacob, et al. “Neural module networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.