GQA: A New Dataset for Real-World Visual Reasoning ans compositional Question Answering

20 May 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

There were some problems in previous VQA dataset. Strong language prior, non-compositional language and variablility in language were key obstacles for model to learn proper concepts and logics from VQA dataset. Synthetically generated CLEVR dataset solved these problems to some extent but lacked realisticity by remaining in relatively simple domain.

Generative Question Answering: Learning to Answer the Whole Question

02 Apr 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Discriminative question answering often overfit to datasets by catching any kinds of clue that leads to answer.

Visual Question Generation as Dual Task of Visual Question Answering

01 Apr 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Visual question answering and visual question generation are complementary tasks. Learning one task may benefit the other.

Learning to Reason: End-to-End Module Networks for Visual Question Answering

12 Feb 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Former neural module network for VQA depends on a naive semantic parser to unroll the layout of the network. This paper suggests End-to-End Module Networks(N2NMN) to directly learn the layout from the data.

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

11 Feb 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

This paper describe several tips and tricks for VQA challenge with the first place model in 2017 VQA challenge. Also, this paper conducts comprehensive experiment for ablation of each trick.

Large-Scale Answerer in Questioner's Mind for Visual Dialog Question Generation

08 Feb 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

AQM solves visual dialogue tasks with information theoratic approach. However, the information gain by each candidate question needs to be calculated explicitly which leads to lack of scalability. This paper suggests AQM+ to solve large-scale problem.

Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog

31 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Goal-oriented dialogue tasks require two agents(a questioner and an answerer) to communicate to solve the task. Previous supervised learning or reinforcement learning approaches struggled to make appropriate question due to the complexity of forming a sentence. This paper suggests information theoretic approach to solve this task.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

26 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Former methods used element-wise sum, product or concatenation to represent the relation of two vectors. Bilinear model(outer prodct) of two vectors is more sophisticated way of representing relation, but usually dimensionality become too big. This paper suggests multimodal compact bilinear pooling(MCB) to represent compact and sophisticated relations.

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

25 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

In image captioning or visual question answering, the features of an image are extracted by the spatial output layer of pretrained CNN model.

Compositional Attention Networks for Machine Reasoning

24 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Previous methods for visual reasoning lacked interpretability. This paper suggests MAC network which is fully differentiable and interpretable attention based visual reasoning model.

Hierarchical Question-Image Co-Attention for Visual Question Answering

23 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Previous works achieved successful results in VQA by modeling visual attention. This paper suggests co-attention model for VQA to pay attention to both images (where to look) and words (what words to listen to).

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

22 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

While bilinear model is an effective method for capturing the relationship between two spaces, often the number of parameters is intractable. This paper suggests to reduce the number of parameters by controlling the rank of the matrix with Turker decomposition.

Chain of Reasoning for Visual Question Answering

21 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

Previous methods for visual question answering performed one-step or static reasoning while some questions requires chain of reasonings.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

17 Jan 2019 in Studies on Deep Learning, Visual Question Answering

WHY?

A caption of an image can be generated with attention based model by aligning a word to a part of image.

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

WHY?

Pagination