WHY?

Former neural module network for VQA depends on a naive semantic parser to unroll the layout of the network. This paper suggests End-to-End Module Networks(N2NMN) to directly learn the layout from the data.

WHAT?

image

Layout policy select one of the predefined modules.

image

Instead of using semantic parser to get the layout for the modules from question, N2NMN use encoder-decoder structure to train them. RNN is used for question encoder and another RNN(with attention) is used for layout policy. Beam search is used to get the layout of the maximum probability.

image

The sequence of modules from decoder is mapped into a syntax tree using Reverse Polish Notation(the post-order traversal). Since the layout policy is not fully differentiable, REINFORCE algorithm is used to approximate the gradient. Entropy regularization(0.005) is used to encourage exploration.

Since learning the layout from the scratch is challenging, additional knowledge(expert policy) can be provided for initialization. KL-divergence between layout policy and expert policy is added to loss function.

So?

N2NMN showed better result than NMN on SHAPES, CLEVR and VQA. The expert policies of SHAPES and CLEVER are provided in the dataset, and stanford parser is used for VQA.

image

The layout policy is shown to learn the appropriate policy. Learning from expert policy is shown to be even better than cloning the expert policy.

image

Hu, Ronghang, et al. “Learning to reason: End-to-end module networks for visual question answering.” CoRR, abs/1704.05526 3 (2017).