• MUTAN: Multimodal Tucker Fusion for Visual Question Answering

    WHY? While bilinear model is an effective method for capturing the relationship between two spaces, often the number of parameters is intractable. This paper suggests to reduce the number of parameters by controlling the rank of the matrix with Turker decomposition. Note With \tau$$ is as follows. WHAT? The Tucker...


  • Chain of Reasoning for Visual Question Answering

    WHY? Previous methods for visual question answering performed one-step or static reasoning while some questions requires chain of reasonings. WHAT? Chain of reasoning(CoR) model alternatively updates the objects and their relations to solve questions that require chain of reasoning. Cor consists of three parts: Data embedding, chain of reasoning and...


  • Deformable Convolutional Networks

    WHY? Spatial sampling of convolutional neural network is geometrically fixed. This paper suggests two modules for CNN to capture the geometric structure more flexibly. WHAT? Deformable convolution modifies the regular grid of convolution network by augmenting with offests. offsets are generated by a conv layer with 2N + 1 channels....


  • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

    WHY? A caption of an image can be generated with attention based model by aligning a word to a part of image. WHAT? A convolution neural network extracts features from raw images resulting a series of feature vectors. In order to generate a series of words as a caption, LSTM...


  • StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

    WHY? CycleGAN has been used in image-to-image translation effectively. However, handling more than two domains was difficult. This paper StarGAN to handle multiple domains with a single model. WHAT? StarGAN can be considered as an domain conditioned version of CycleGAN. The discriminator of StarGAN not only classify real and fake,...