• Compositional Attention Networks for Machine Reasoning

    WHY? Previous methods for visual reasoning lacked interpretability. This paper suggests MAC network which is fully differentiable and interpretable attention based visual reasoning model. WHAT? MAC network consists of three parts: the input unit, the MAC cells, and the read unit. The input unit encodes the question and the image....


  • Hierarchical Question-Image Co-Attention for Visual Question Answering

    WHY? Previous works achieved successful results in VQA by modeling visual attention. This paper suggests co-attention model for VQA to pay attention to both images (where to look) and words (what words to listen to). WHAT? Co-attention model in this paper pay attention to the images and question in three...


  • MUTAN: Multimodal Tucker Fusion for Visual Question Answering

    WHY? While bilinear model is an effective method for capturing the relationship between two spaces, often the number of parameters is intractable. This paper suggests to reduce the number of parameters by controlling the rank of the matrix with Turker decomposition. Note With denotes the i-mode product between a tensor...


  • Chain of Reasoning for Visual Question Answering

    WHY? Previous methods for visual question answering performed one-step or static reasoning while some questions requires chain of reasonings. WHAT? Chain of reasoning(CoR) model alternatively updates the objects and their relations to solve questions that require chain of reasoning. Cor consists of three parts: Data embedding, chain of reasoning and...


  • Deformable Convolutional Networks

    WHY? Spatial sampling of convolutional neural network is geometrically fixed. This paper suggests two modules for CNN to capture the geometric structure more flexibly. WHAT? Deformable convolution modifies the regular grid of convolution network by augmenting with offests. offsets are generated by a conv layer with 2N + 1 channels....