Previous works achieved successful results in VQA by modeling visual attention. This paper suggests co-attention model for VQA to pay attention to both images (where to look) and words (what words to listen to).


Co-attention model in this paper pay attention to the images and question in three level: word level, phrase level and question level. Embedding matrix is used for word level, 1D convolution with 3 window sizes(1-3) and max polling are used for phrase level, LSTM is used to encode question level vector.


Two methods for co-attention are suggested: parallel co-attention and alternating co-attention. Parallell co-attention first form bilinear affinity matrix to capture the relationship between images and words and result attened visual and word features.

Alternating co-attention method first summerize the question into a single vector, attend to the image based on the question vector, and attend to the question based on the attened image feature. In first step, X = Q and g = 0. In second step, X = V and . In third step, X = Q and .


Co-attention is performed at various level of hierarchy of question. Various level of attended features are recursively encoded with MLP.



Co-attention model with pretrained image feature achieved good result on VQA dataset.

Lu, Jiasen, et al. “Hierarchical question-image co-attention for visual question answering.” Advances In Neural Information Processing Systems. 2016.