Visual question answering task is compositional in nature.



This paper tries to solve VQA by composing modeules to construct a network architecture based on a given question. Primitive modules that can be composed into any configuration of questions are defined: attention, re-attention, combination, classification, and measurement. The key component of the modules is attention mechanism that allow model to focus on the parts of the given image. A question is parsed to form an universal dependency representation which can be map into a network layout. Embedded question is combined to the end of network to capture subtle differences. The composed model is trained end-to-end.



This model achieved the best performance on VQA and SHAPES datasets.

Andreas, Jacob, et al. “Neural module networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.