WHY?

Previous methods for visual reasoning lacked interpretability. This paper suggests MAC network which is fully differentiable and interpretable attention based visual reasoning model.

WHAT?

image

MAC network consists of three parts: the input unit, the MAC cells, and the read unit. The input unit encodes the question and the image. The word embeddings of question is processed by d-dimensional biLSTM. The hidden states of the model are contextual representation for each word. The concatenation of the last hidden state of backward, and forward LSTM is used as a question representation. This question representaion is linearly projected to vectors for each reasoning step. For image features, the conv4 layer output of pretrained ResNet101 is used. Image features are called knowledge base in this paper.

image

Recurrent Memory, Attention, and Composition cell(MAC cell) is newly designed recurrent cell for reasoning operation. p number of MAC cells are stacked to perform multiple reasoning steps. MAC cells have two hidden states: control() and memory(). The control state represents the reasoning operation and the memory state holds intermediate information.

image

The control unit updates the control state. New control state is a soft attention-based weighted average of the question words based on the question vector for each step.

image

The read unit extracts information from memory. The interaction of memory and image features(I) and independent facts(k) are concatenated to produce information(I’). The information(I’) interacts with control units() to produce weights(rv). The knowledge bases are weighted averaged to produce the read unit().

image

The memory unit is updated based on the retrieved read unit and previous memory. There are two ways that the control unit controls the write operation: one is performing self attention and the other is to be used as memory gate.

image

The question representation(q) and the final memory state() are used as input for 2-layer-fc softmax classifier to result the answer.

So?

image

MAC showed the state-of-the-art result on both Clevr and Clevr-Humans without using information about data generation.

image

Also, the attention map of MAC showed that this model is properly solving the task.

Hudson, Drew A., and Christopher D. Manning. “Compositional attention networks for machine reasoning.” arXiv preprint arXiv:1803.03067 (2018).