• FiLM: Visual Reasoning with a General Conditioning Layer

    WHY? There are some architectures for relational reasoning but lacks general-purpose components for relational reasoning and visual question answering. WHAT? This paper propose Feature-wise Linear Modulation(FiLM) to conditionally focus on the image. By linearly transforming the output of convolution filter, FiLM conditionally choose certain filters. FiLM generator takes questions as...


  • TasNet: Time-Domain Audio Separation Network for Real-Time Single-channel Speech Separation

    WHY? Separating multiple sources of audio is difficult task. Previous works mostly made mask for each source in time-fequency domain. WHAT? This paper formulated source separation task as getting mixture meight vector of multiple sources in wave form. Time-domain Audio Separation Network(TasNet) tries to find which is relative contribution to...


  • Recurrent Relational Networks

    WHY? Some tasks such as Sudoku require serial steps of relational inference. WHAT? Recurrent relational network operates on a graph representation of objects. Message passing method is used to pass the relational information to neighbor nodes to solve the task. The loss is minimized at every step. So? This module...


  • Modularity Matters: Learning Invariant Relational Reasoning Tasks

    WHY? Former CNN models fully activate(filly distributed features) for a single input showing poor performance on invariant relational reasoning. WHAT? The reason former CNN models are poor at invariant reasoning is interference problem that learning each pattern interfere with each other while a model try to learn many patterns. To...


  • Learning Visual Question Answering by Bootstrapping Hard Attention

    WHY? Hard attention is relatively less explored than soft attention. WHAT? This paper showed that hard attention can be competitive and efficient as soft attention by bootstraping hard attention. In constrast to soft attention, hard attention discretely choose the point to attend. The key idea is to use L2-Norm of...