Visual question answering task is answering natural language questions based on images. To solve questions that require multi-step reasoning, stacked attention networks(SANs) stacks several layers of attention on parts of images based on query.
Image model extracts feature map from image with VGGNet structure.
Question model uses the final layer of LSTM to encode question.
Question can also be encoded through CNN based question model.
Using the extracted features of images() and texts(), attention is applied to image. Several layers of attentions can be stacked to progressively pay attention.
SAN achieved SOTA results in DAQUAR-ALL, DAQUAR-REDUCED, COCO-QA and VQA. Also, the learned layers of attention showed progressive focusing of important part of image.
Fundamental paper in VQA area.
Subscribe via RSS