Deep learning travelsRejoice in what you learn and spray it!
https://lyusungwon.github.io/
Mon, 16 Sep 2019 13:12:31 +0000Mon, 16 Sep 2019 13:12:31 +0000Jekyll v3.8.5COMET: Commonsense Transformers for Automatic Knowledge Graph Construction<h2 id="commonsense-knowledge-graph">Commonsense Knowledge Graph</h2>
<p>Knowledge graph is graph representation of knowledge. Entities are represented as nodes and relations between entities are represented as edges. Commonsense knowledge graph stores commonsense knowledge in form of graphs. Two of common dataset for commonsense knowledge graph are ATOMIC and ConceptNet.</p>
<h2 id="why">WHY?</h2>
<p><img src="/assets/images/comet1.png" alt="image" class="center-image" />
Commonsense knowledge graph is inevitably far from perfect since infinite number of facts can be considered as commonsense. COMET tried to transfer implicit commonsense knowledge from pretrained language model to explicit knowledge graph. More precisely, the model of COMET completes the object entity given subject entity and a relation.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/comet2.png" alt="image" class="center-image" />
COMET uses one of the most popular pretrained language model: GPT. Along with BERT, GPT pretrained with huge amount of data is one of the most succuessful natural language backbones based on self-attention mechanism that can be fine-tuned for various other tasks. COMET fine-tunes GPT on entity completion task.</p>
<p><img src="/assets/images/comet3.png" alt="image" class="center-image" />
The input structure changes a little based on dataset, but basically arranged tokens in order of subject entity, relation, and object entity with mask token between them. The goal is to output object entity given subject entity and a relation.</p>
<pre class="MathJax_Preview"><code>\mathcal{L} = - \sum_{t=|s|+|r|}^{|s|+|r|+|o|} \log P(x_t|x_{<t})</code></pre>
<script type="math/tex; mode=display">% <![CDATA[
\mathcal{L} = - \sum_{t=|s|+|r|}^{|s|+|r|+|o|} \log P(x_t|x_{<t}) %]]></script>
<h2 id="so">So?</h2>
<p>Using pretrained GPT turned out to be effective in relation completion task. The outputs are not only evaluated natural to human, but also are novel. Some of interesting outputs are shown below.
<img src="/assets/images/comet4.png" alt="image" class="center-image" /></p>
<p><a href="https://arxiv.org/abs/1906.05317">Bosselut, Antoine, et al. “COMET: Commonsense Transformers for Automatic Knowledge Graph Construction.” arXiv preprint arXiv:1906.05317 (2019).</a></p>
Mon, 16 Sep 2019 20:47:59 +0000
https://lyusungwon.github.io/studies/2019/09/16/comet/
https://lyusungwon.github.io/studies/2019/09/16/comet/deep-learningnatural-language-processingknowledge-graphstudiesRenovation<p>It has been long since my last post. I tried to post deep learning papers but I was so busy. I worked at Naver as an intern and wrote my master thesis for my graduation at the same time. Then, I refreshed for few months away from deep learning papers. Now I feel like I need to get back to business again.</p>
<p>I renovated my blog. This blog template is forked from isme2n.github.io(Thank you for amazing template!). The renovation was not just for refresh. I wanted to post more about my thoughts and I adopted new template for better separation between categories. In my past blog template, separating posts in different categories were inconvenient so that it is really hard to find my private post from the overwhelming number of paper summary posts.</p>
<p>Also, I decided to post only on English from now on, even though writing in Korean is much easier for me. I want to share my thougts and knowledge with as many as I can. Since my English is pretty far from perfect, so feedback on my English is always welcome.</p>
<p>The final change on my blog is that the posts would be more readable for readers. Honestly, I posted deep learning papers for myself to remember later and didn’t care about readers. However, I realized my poor blog posting can waste people’s valuable time and incur frustration. I decided to post less, but I’ll try harder to write kind and readable posts.</p>
<p>My blog is turning into some kind of homepage. I feel more comfortable with good old homepage style rather than SNS since homepage is more private. I’ll use my blog not only to review and share what I learned, but also to arrange my thoughts and to practice my English writing.</p>
Mon, 16 Sep 2019 18:24:59 +0000
https://lyusungwon.github.io/writings/2019/09/16/renovation/
https://lyusungwon.github.io/writings/2019/09/16/renovation/thoughtsblogwritingsGQA: A New Dataset for Real-World Visual Reasoning ans compositional Question Answering<h2 id="why">WHY?</h2>
<p>There were some problems in previous VQA dataset. Strong language prior, non-compositional language and variablility in language were key obstacles for model to learn proper concepts and logics from VQA dataset. Synthetically generated CLEVR dataset solved these problems to some extent but lacked realisticity by remaining in relatively simple domain.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/gqad1.png" alt="image" class="center-image" width="50px" /></p>
<p>GQA dataset enhanced the VQA dataset by adopting the methodology of CLEVR dataset. GQA dataset starts with Visual Genome scene graphs dataset. This paper augments the VG scene graph by cleaning up the language of graphs, pruning the edge of graphs and utilizing the object detector. Question engine build questions from scene graph with 274 structural patterns, object references and decoys by traversing the graphs. Each question pattern is related with a functional representation. With semantic program, the questions are balanced in two granulaity levels. Through these process, GQA dataset resulted 113,018 images with 22,669,678 questions in total with vocab size of 3097 for questions, and 1878 for answers. Also this dataset provide additional metrics other than accuracy such as consistency, validity, plausibility and distribution.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/gqad2.png" alt="image" class="center-image" width="50px" /></p>
<p>This dataset provided reasonable criteria for visual reasoning models which reflects reality.</p>
<p><a href="https://arxiv.org/abs/1902.09506">Hudson, Drew A., and Christopher D. Manning. “GQA: a new dataset for compositional question answering over real-world images.” arXiv preprint arXiv:1902.09506 (2019).</a></p>
Mon, 20 May 2019 10:39:59 +0000
https://lyusungwon.github.io/studies/2019/05/20/gqad/
https://lyusungwon.github.io/studies/2019/05/20/gqad/deep-learningvisual-question-answeringstudiesGenerative Question Answering: Learning to Answer the Whole Question<h2 id="why">WHY?</h2>
<p>Discriminative question answering often overfit to datasets by catching any kinds of clue that leads to answer.</p>
<h2 id="what">WHAT?</h2>
<p>This paper suggests Generative Question Answering model(GQA) to generate both questions and answers given contexts. Though this paper applied the model on both QA and VQA tasks, following explanation would only focus on VQA task. The model uses Bayes’ rule to minimize the NLL of joint distribtion of questions and answers given context(images).</p>
<pre class="MathJax_Preview"><code>\mathcal{L} = -\log p(q,a|c) = -\log p(a|c) - \sum_t \log p(q_t|a, c, q_{0 ... t-1})</code></pre>
<script type="math/tex; mode=display">\mathcal{L} = -\log p(q,a|c) = -\log p(a|c) - \sum_t \log p(q_t|a, c, q_{0 ... t-1})</script>
<p>c is encoded through pretrained convolution model(Resnet101) with 32-dimension positional encoding. Answer prior(<code class="MathJax_Preview">p(a\|c)</code><script type="math/tex">p(a\|c)</script>) is simply calculated with fc on image features. The likelihood of the question(<code class="MathJax_Preview">p(q\|a, c)</code><script type="math/tex">p(q\|a, c)</script>) is modeled with conditional language model. Language model(question decoder) is modeled with several blocks of residual self attentive LSTM layers and GLU on the top.</p>
<p><img src="/assets/images/gqa1.png" alt="image" class="center-image" width="50px" /></p>
<p>At inference time, the answer maximizing the posterior(<code class="MathJax_Preview">p(q,a\|c)</code><script type="math/tex">p(q,a\|c)</script>) is returned which requires calculation of likelihood given all possible answers.</p>
<pre class="MathJax_Preview"><code>a* = argmax_a p(q|a,c)p(a|c)</code></pre>
<script type="math/tex; mode=display">a* = argmax_a p(q|a,c)p(a|c)</script>
<h2 id="so">So?</h2>
<p><img src="/assets/images/gqa2.png" alt="image" class="center-image" width="50px" /></p>
<p>GQA showed comparable performance to other VQA models while showing interpretable attention map and word probabilities on the last layer.</p>
<p><img src="/assets/images/gqa3.png" alt="image" class="center-image" width="50px" /></p>
<p><a href="https://openreview.net/forum?id=Bkx0RjA9tX">Lewis, Mike, and Angela Fan. “Generative Question Answering: Learning to Answer the Whole Question.” (2018).</a></p>
Tue, 02 Apr 2019 09:47:59 +0000
https://lyusungwon.github.io/studies/2019/04/02/gqa/
https://lyusungwon.github.io/studies/2019/04/02/gqa/deep-learningvisual-question-answeringstudiesVisual Question Generation as Dual Task of Visual Question Answering<h2 id="why">WHY?</h2>
<p>Visual question answering and visual question generation are complementary tasks. Learning one task may benefit the other.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/iqan1.png" alt="image" class="center-image" width="50px" /></p>
<p>This paper suggests Invertible Question Answering Network(iQAN) that is trained on two tasks sharing pipline in reverse order.</p>
<p><img src="/assets/images/iqan2.png" alt="image" class="center-image" width="50px" /></p>
<p>VQA model of iQAN is <a href="https://lyusungwon.github.io/visual-question-answering/2019/01/22/mutan.html">MUTAN fusion module</a>, and another MUTAN-based attention module is used for VQG.</p>
<p><img src="/assets/images/iqan3.png" alt="image" class="center-image" width="50px" /></p>
<p>In order to benefit from the other task, weights are shared between two processes. Dual MUTAN shares core tensor and projection matrices of two MUTAN modules. In addition, input answer projection matrix of VQG model and linear classifier matrix of VQA model are shared. While two RNN for VQG and VQA may not share the parameter, word embedding matrices are shared.</p>
<p>Since two processes are cyclic, duality regularizer is used to promote consistency. Two losses from both tasks(multinomial classification loss for VQA and sequence generation loss for VQG) and two regularizers are jointly trained in dual training manner.</p>
<pre class="MathJax_Preview"><code>Loss = L_{(VQA)}(a, a*) + L_{(VQG)}(q, q*) + smooth_{L1}(\mathbf{q} - \hat{\mathbf{q}}) + smooth_{L1}(\mathbf{a} - \hat{\mathbf{a}})</code></pre>
<script type="math/tex; mode=display">Loss = L_{(VQA)}(a, a*) + L_{(VQG)}(q, q*) + smooth_{L1}(\mathbf{q} - \hat{\mathbf{q}}) + smooth_{L1}(\mathbf{a} - \hat{\mathbf{a}})</script>
<p>Yes/no or number questions are filtered out.</p>
<h2 id="so">So?</h2>
<p>iQAN showed comparable result in VQA2 and CLEVR dataset.</p>
<p><a href="http://openaccess.thecvf.com/content_cvpr_2018/html/Li_Visual_Question_Generation_CVPR_2018_paper.html">Li, Yikang, et al. “Visual question generation as dual task of visual question answering.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.</a></p>
Mon, 01 Apr 2019 09:27:59 +0000
https://lyusungwon.github.io/studies/2019/04/01/iqan/
https://lyusungwon.github.io/studies/2019/04/01/iqan/deep-learningvisual-question-answeringstudiesTask-Oriented Query Reformulation with Reinforcement Learning<h2 id="why">WHY?</h2>
<p>Information retrieval from search engine becomes difficult when the query is incomplete or too complex. This paper suggests a query reformulation system that rewrite the query to maximize the probability of relevant documents returned.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/qrr1.png" alt="image" class="center-image" width="50px" /></p>
<p>Since the output of query reformulator is discrete, REINFORCE algorithm is used to train the model.</p>
<p><img src="/assets/images/qrr2.png" alt="image" class="center-image" width="50px" /></p>
<p>The query and the candidate term vectors are converted to a fixed-sized vector using CNN with max pooling or RNN. To generate a reformulated query sequentially, LSTM is used.</p>
<pre class="MathJax_Preview"><code>\phi_a(v)\\
\phi_b(e_i)\\
e_i \in q_0 \cup D_0\\
P(t_i|q_0) = \sigma(U^{\top}tanh(W(\phi_a(v)\|\phi_b(e_i))+b))\\
P(t_i^k|q_0) \propto exp(\phi_b(e_i)^{\top}h_k\\
h_k = tanh(W_a\phi_a(v)+W_b\phi_b(t^{k-1}+W_h h_{k-1})</code></pre>
<script type="math/tex; mode=display">\phi_a(v)\\
\phi_b(e_i)\\
e_i \in q_0 \cup D_0\\
P(t_i|q_0) = \sigma(U^{\top}tanh(W(\phi_a(v)\|\phi_b(e_i))+b))\\
P(t_i^k|q_0) \propto exp(\phi_b(e_i)^{\top}h_k\\
h_k = tanh(W_a\phi_a(v)+W_b\phi_b(t^{k-1}+W_h h_{k-1})</script>
<p>To train the model, REINFORCE algorithm is used. Entropy regularization loss is add to encourage diversity.</p>
<pre class="MathJax_Preview"><code>C_a = (R - \bar{R})\sum_{t\in T} - log P(t|q_0)\\
\bar{R} = \sigma(S^{\top}tanh(V(\phi_a(v)\|\bar{e})+b))\\
\bar{e} = \frac{1}{N}\sum_{i=1}^N \phi_b(e_i)\\
N = |q_0 \cup D_0|\\
C_b = \alpha\|R - \bar{R}\|^2\\
C_H = -\lambda \sum_{t\in q_0\cup D_0} P(t|q_0) log P(t|q_0)</code></pre>
<script type="math/tex; mode=display">C_a = (R - \bar{R})\sum_{t\in T} - log P(t|q_0)\\
\bar{R} = \sigma(S^{\top}tanh(V(\phi_a(v)\|\bar{e})+b))\\
\bar{e} = \frac{1}{N}\sum_{i=1}^N \phi_b(e_i)\\
N = |q_0 \cup D_0|\\
C_b = \alpha\|R - \bar{R}\|^2\\
C_H = -\lambda \sum_{t\in q_0\cup D_0} P(t|q_0) log P(t|q_0)</script>
<h2 id="so">So?</h2>
<p>Query reformulation showed better performance than raw retrieval, PRF models, or SL methods in TREC-CAR, Jeopardy and MSA data measured in recall, precision and mean average precision.</p>
<p><a href="https://arxiv.org/abs/1704.04572">Nogueira, Rodrigo, and Kyunghyun Cho. “Task-oriented query reformulation with reinforcement learning.” arXiv preprint arXiv:1704.04572 (2017).</a></p>
Thu, 28 Feb 2019 09:38:59 +0000
https://lyusungwon.github.io/studies/2019/02/28/qrr/
https://lyusungwon.github.io/studies/2019/02/28/qrr/deep-learningnatural-language-processingstudiesLearning to Reason: End-to-End Module Networks for Visual Question Answering<h2 id="why">WHY?</h2>
<p>Former <a href="https://lyusungwon.github.io/computer-vision/2018/12/15/nmn.html">neural module network for VQA</a> depends on a naive semantic parser to unroll the layout of the network. This paper suggests End-to-End Module Networks(N2NMN) to directly learn the layout from the data.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/n2nmn1.png" alt="image" class="center-image" width="50px" /></p>
<p>Layout policy select one of the predefined modules.</p>
<p><img src="/assets/images/n2nmn2.png" alt="image" class="center-image" width="50px" /></p>
<p>Instead of using semantic parser to get the layout for the modules from question, N2NMN use encoder-decoder structure to train them. RNN is used for question encoder and another RNN(with attention) is used for layout policy. Beam search is used to get the layout of the maximum probability.</p>
<pre class="MathJax_Preview"><code>u_{ti} = v^{\top}tanh(W_1 h_i + W_2 h_t)\\
\alpha_{ti} = \frac{exp(u_{ti})}{\sum_{j=1}^T exp(u_{tj})}\\
c_t = \sum_{i=1}^T \alpha_{ti} h_i\\
p(m^{(t)}|m^{(1)},...,m^{(t-1)}, q) = softmax(W_3 h_t + W_4 c_t)</code></pre>
<script type="math/tex; mode=display">u_{ti} = v^{\top}tanh(W_1 h_i + W_2 h_t)\\
\alpha_{ti} = \frac{exp(u_{ti})}{\sum_{j=1}^T exp(u_{tj})}\\
c_t = \sum_{i=1}^T \alpha_{ti} h_i\\
p(m^{(t)}|m^{(1)},...,m^{(t-1)}, q) = softmax(W_3 h_t + W_4 c_t)</script>
<p><img src="/assets/images/n2nmn3.png" alt="image" class="center-image" width="50px" /></p>
<p>The sequence of modules from decoder is mapped into a syntax tree using Reverse Polish Notation(the post-order traversal). Since the layout policy is not fully differentiable, REINFORCE algorithm is used to approximate the gradient. Entropy regularization(0.005) is used to encourage exploration.</p>
<pre class="MathJax_Preview"><code>L(\theta) = E_{l\sim p(l|q;\theta)}[\tilde{L}(\theta, l; q, I)]\\
\nabla L \approx \frac{1}{M}\sum_{m=1}^M ([\tilde{L}(\theta, l_m) - b]\nabla_{\theta}log p(l_m|q;\theta) + \nabla_{\theta}\tilde{L}(\theta, l_m))</code></pre>
<script type="math/tex; mode=display">L(\theta) = E_{l\sim p(l|q;\theta)}[\tilde{L}(\theta, l; q, I)]\\
\nabla L \approx \frac{1}{M}\sum_{m=1}^M ([\tilde{L}(\theta, l_m) - b]\nabla_{\theta}log p(l_m|q;\theta) + \nabla_{\theta}\tilde{L}(\theta, l_m))</script>
<p>Since learning the layout from the scratch is challenging, additional knowledge(expert policy) can be provided for initialization. KL-divergence between layout policy and expert policy is added to loss function.</p>
<h2 id="so">So?</h2>
<p>N2NMN showed better result than NMN on SHAPES, CLEVR and VQA. The expert policies of SHAPES and CLEVER are provided in the dataset, and stanford parser is used for VQA.</p>
<p><img src="/assets/images/n2nmn4.png" alt="image" class="center-image" width="50px" /></p>
<p>The layout policy is shown to learn the appropriate policy. Learning from expert policy is shown to be even better than cloning the expert policy.</p>
<p><img src="/assets/images/n2nmn5.png" alt="image" class="center-image" width="50px" /></p>
<p><a href="http://openaccess.thecvf.com/content_ICCV_2017/papers/Hu_Learning_to_Reason_ICCV_2017_paper.pdf">Hu, Ronghang, et al. “Learning to reason: End-to-end module networks for visual question answering.” CoRR, abs/1704.05526 3 (2017).</a></p>
Tue, 12 Feb 2019 08:51:59 +0000
https://lyusungwon.github.io/studies/2019/02/12/n2nmn/
https://lyusungwon.github.io/studies/2019/02/12/n2nmn/deep-learningvisual-question-answeringstudiesTips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge<h2 id="why">WHY?</h2>
<p>This paper describe several tips and tricks for VQA challenge with the first place model in 2017 VQA challenge. Also, this paper conducts comprehensive experiment for ablation of each trick.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/tt1.png" alt="image" class="center-image" width="50px" /></p>
<p>The architecture of this model is rather simple. As question embedding, GRU(512) with pretrained GloVe word vector(300) is used as initialization. Maximum length of thq question is timmed to 14. As image feature, this paper tries 2 methods: the output of pretrained Resnet resized to 7 x 7 with average pooling, and bottom-up attention with Faster R-CNN (K=60). This paper used only one glimpse of image attention instead of many glimpse of attention. Multi-modal fusion is implemented with Hadamard product.</p>
<p>This paper elaborates 6 major tricks for VQA challenge. First, instead of single-label classification, this paper allowed multi-labels using sigmoid output. Then, the objective function of this paper becomes binary cross entropy loss. Second, instead of one hard score of 1, soft score of accuracy(s = min(m/3, 1)) is used using the information from multiple human annotators. Third, gated tanh activation is used for non-linear layers instead of general ReLU. This paper showed that image features from bottom-up attention is better than grid-like feature map. Also, initializing classifier with pretrained word-vecter is showen to be useful. Lastly, using large mini-batches and smart shuffling further improved the performance. The ablation is showen as below.</p>
<p><img src="/assets/images/tt2.png" alt="image" class="center-image" width="50px" /></p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/tt3.png" alt="image" class="center-image" width="50px" /></p>
<p>This paper achieved the best result on 2017 VQA challenge.</p>
<p><a href="http://openaccess.thecvf.com/content_cvpr_2018/html/Teney_Tips_and_Tricks_CVPR_2018_paper.html">Teney, Damien, et al. “Tips and tricks for visual question answering: Learnings from the 2017 challenge.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.</a></p>
Mon, 11 Feb 2019 09:03:59 +0000
https://lyusungwon.github.io/studies/2019/02/11/tt/
https://lyusungwon.github.io/studies/2019/02/11/tt/deep-learningvisual-question-answeringstudiesLarge-Scale Answerer in Questioner's Mind for Visual Dialog Question Generation<h2 id="why">WHY?</h2>
<p><a href="">AQM</a> solves visual dialogue tasks with information theoratic approach. However, the information gain by each candidate question needs to be calculated explicitly which leads to lack of scalability. This paper suggests AQM+ to solve large-scale problem.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/aqmp1.png" alt="image" class="center-image" width="50px" /></p>
<p>AQM+ is sampling based approximation to AQM algorithm. AQM+ is different from AQM in 3 ways. First, instead of Q-sampler, candidate questions are sampled using beam search. Second, while the answerer model of AQM was a binary classifier, that of AQM+ is an RNN generator. Third, a portion of answer candidates and labels(top k) are used to approximate information gain.</p>
<pre class="MathJax_Preview"><code>\tilde{I}_{topk}[C, A_t; q_t, h_{t-1}] = \sum_{a_t\in \mathbf{A}_{t, topk}}
= \sum_{a_t\in \mathbf{A}_{t, topk}(q_t)}\sum_{c\in \mathbf{C}_{t, topk}} \hat{p}_{reg}(c|h_{t-1})\tilde{p}_{reg}(a_t|c, q_t, h_{t-1}) \ln \frac{\tilde{p}_{reg}(a_t|c, q_t, h_{t-1})}{\tilde{p}_{reg}'(a_t|q_t, h_{t-1})}\\
\hat{p}_{reg}(c|h_{t-1}) = \frac{\hat{p}(c|h_{t-1})}{\sum_{c\in \mathbf{C}_{t, topk}}p(c|h_{t-1})}\\
\tilde{p}_{reg}(a_t|c, q_t, h_{t-1}) = \frac{\tilde{p}(a_t|c, q_t, h_{t-1})}{\sum_{a_t\in \mathbf{A}_{t, topk}(q_t)}\tilde{p}(a_t|c, q_t, h_{t-1})}\\
\tilde{p}_{reg}'(a_t|q_t, h_{t-1}) = \sum_{c\in \mathbf{C}_{t, topk}}\hat{p}_{reg}(c|h_{t-1})\cdot\tilde{p}_{reg}(a_t|c, q_t, h_{t-1})</code></pre>
<script type="math/tex; mode=display">\tilde{I}_{topk}[C, A_t; q_t, h_{t-1}] = \sum_{a_t\in \mathbf{A}_{t, topk}}
= \sum_{a_t\in \mathbf{A}_{t, topk}(q_t)}\sum_{c\in \mathbf{C}_{t, topk}} \hat{p}_{reg}(c|h_{t-1})\tilde{p}_{reg}(a_t|c, q_t, h_{t-1}) \ln \frac{\tilde{p}_{reg}(a_t|c, q_t, h_{t-1})}{\tilde{p}_{reg}'(a_t|q_t, h_{t-1})}\\
\hat{p}_{reg}(c|h_{t-1}) = \frac{\hat{p}(c|h_{t-1})}{\sum_{c\in \mathbf{C}_{t, topk}}p(c|h_{t-1})}\\
\tilde{p}_{reg}(a_t|c, q_t, h_{t-1}) = \frac{\tilde{p}(a_t|c, q_t, h_{t-1})}{\sum_{a_t\in \mathbf{A}_{t, topk}(q_t)}\tilde{p}(a_t|c, q_t, h_{t-1})}\\
\tilde{p}_{reg}'(a_t|q_t, h_{t-1}) = \sum_{c\in \mathbf{C}_{t, topk}}\hat{p}_{reg}(c|h_{t-1})\cdot\tilde{p}_{reg}(a_t|c, q_t, h_{t-1})</script>
<p><code class="MathJax_Preview">\mathbf{C}_{t, topk}</code><script type="math/tex">\mathbf{C}_{t, topk}</script> refers to top-K posterior test images from Qpost <code class="MathJax_Preview">\hat{p}_{reg}(c\|h_{t-1})</code><script type="math/tex">\hat{p}_{reg}(c\|h_{t-1})</script>. <code class="MathJax_Preview">\mathbf{Q}_{t, topk}</code><script type="math/tex">\mathbf{Q}_{t, topk}</script> refers to top-K likelihood questions using beam search from Qgen <code class="MathJax_Preview">p(q_t\|h_{t-1})</code><script type="math/tex">p(q_t\|h_{t-1})</script>. <code class="MathJax_Preview">\mathbf{A}_{t, topk}(q_t)</code><script type="math/tex">\mathbf{A}_{t, topk}(q_t)</script> refers to top-1 generated answers for each question and each class from aprxAgen <code class="MathJax_Preview">\tilde{p}(a_t\|c, q_t, h_{t-1})</code><script type="math/tex">\tilde{p}(a_t\|c, q_t, h_{t-1})</script>. Similar to AQM, approximation of the answer generator can be either indA or depA.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/aqmp2.png" alt="image" class="center-image" width="50px" /></p>
<p>Instead of GuessWhat, AQM+ is applied to GuessWhich which is more complicated version of GuessWhat. The key differences of GuessWhich are that the questioner has to guess one image out of 9628 images by asking questions and that the answers of answerer are not limited to binary. Since there are much more labels in this task, analytic computation of information gain is almost intractable in AQM. AQM+ performed better than SL-Q and RL-QA in various settings.</p>
<p><a href="https://openreview.net/forum?id=rkgT3jRct7">Lee, Sang-Woo, et al. “Large-Scale Answerer in Questioner’s Mind for Visual Dialog Question Generation.” (2018).</a></p>
Fri, 08 Feb 2019 10:01:59 +0000
https://lyusungwon.github.io/studies/2019/02/08/aqmp/
https://lyusungwon.github.io/studies/2019/02/08/aqmp/deep-learningvisual-question-answeringstudiesAnswerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog<h2 id="why">WHY?</h2>
<p>Goal-oriented dialogue tasks require two agents(a questioner and an answerer) to communicate to solve the task. Previous supervised learning or reinforcement learning approaches struggled to make appropriate question due to the complexity of forming a sentence. This paper suggests information theoretic approach to solve this task.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/aqm1.png" alt="image" class="center-image" width="50px" /></p>
<p>The answerer model for VQA is a simple neural network model which is the same as previous methods. In former SL and RL methods, questioner had two RNN-based models, one for generating a question and one for guessing the answer image. On the other hand, the questioner of AQM uses mathmatical calculation instead of RNN models.</p>
<p><img src="/assets/images/aqm2.png" alt="image" class="center-image" width="50px" /></p>
<p>The questioner of AQM first generates question candidates to choose from (Q-sampler). Next, the questioner calculate the information gain for each of question candidate and choose the question with the greatest information gain.</p>
<pre class="MathJax_Preview"><code>I[C, A_t; q_t, a_{1:t-1}, q_{1:t-1}]\\
= H[C; a_{1:t-1}, q_{1:t-1}] - H[C|A_t; q_t, a_{1:t-1}, q_{1:t-1}]\\
= \sum_{a_t}\sum_c p(c|a_{1:t-1}, q_{1:t-1})p(a_t|c, q_t, a_{1:t-1}, q_{1:t-1}) \ln \frac{p(a_t|c, q_t, a_{1:t-1}, q_{1:t-1})}{p(a_t|q_t, a_{1:t-1}, q_{1:t-1})}\\
p(c|a_{1:t}, q_{1:t}) \propto p(c)\prod_{j=1}^t p(a_j|c, q_j, a_{1:j-1}, q_{1:j-1})</code></pre>
<script type="math/tex; mode=display">I[C, A_t; q_t, a_{1:t-1}, q_{1:t-1}]\\
= H[C; a_{1:t-1}, q_{1:t-1}] - H[C|A_t; q_t, a_{1:t-1}, q_{1:t-1}]\\
= \sum_{a_t}\sum_c p(c|a_{1:t-1}, q_{1:t-1})p(a_t|c, q_t, a_{1:t-1}, q_{1:t-1}) \ln \frac{p(a_t|c, q_t, a_{1:t-1}, q_{1:t-1})}{p(a_t|q_t, a_{1:t-1}, q_{1:t-1})}\\
p(c|a_{1:t}, q_{1:t}) \propto p(c)\prod_{j=1}^t p(a_j|c, q_j, a_{1:j-1}, q_{1:j-1})</script>
<p>Since getting posterior requires the answer distribution of the answerer, the questioner approximate the answerer’s answer distribution.</p>
<pre class="MathJax_Preview"><code>\hat{p}(a_t|c, q_t, a_{1:t-1}, q_{1:t-1}) \propto \tilde{p}'(c)\prod_{j=1}^t \tilde{p}(a_j|c, q_j, a_{1:j-1}, q_{1:j-1})</code></pre>
<script type="math/tex; mode=display">\hat{p}(a_t|c, q_t, a_{1:t-1}, q_{1:t-1}) \propto \tilde{p}'(c)\prod_{j=1}^t \tilde{p}(a_j|c, q_j, a_{1:j-1}, q_{1:j-1})</script>
<p>The questioner selects the question that maximize the information gain based on this approximate answer distribution posterior.</p>
<pre class="MathJax_Preview"><code>q_t^* = argmax_{q_t \in Q} \tilde{I}[C, A_t; q_t, a_{1:t-1}, q_{1:t-1}]\\
= argmax_{q_t \in Q} \sum_{a_t}\sum_c \hat{p}(c|a_{1:t-1}, q_{1:t-1})\tilde{p}(a_t|c, q_t, a_{1:t-1}, q_{1:t-1}) \ln \frac{\tilde{p}(a_t|c, q_t, a_{1:t-1}, q_{1:t-1})}{\tilde{p}'(a_t|q_t, a_{1:t-1}, q_{1:t-1})}\\
\tilde{p}'(a_t|q_t, a_{1:t-1}, q_{1:t-1}) = \sum_c\hat{p}(c|a_{1:t-1}, q_{1:t-1})\cdot\tilde{p}(c, q_t, a_{1:t-1}, q_{1:t-1})</code></pre>
<script type="math/tex; mode=display">q_t^* = argmax_{q_t \in Q} \tilde{I}[C, A_t; q_t, a_{1:t-1}, q_{1:t-1}]\\
= argmax_{q_t \in Q} \sum_{a_t}\sum_c \hat{p}(c|a_{1:t-1}, q_{1:t-1})\tilde{p}(a_t|c, q_t, a_{1:t-1}, q_{1:t-1}) \ln \frac{\tilde{p}(a_t|c, q_t, a_{1:t-1}, q_{1:t-1})}{\tilde{p}'(a_t|q_t, a_{1:t-1}, q_{1:t-1})}\\
\tilde{p}'(a_t|q_t, a_{1:t-1}, q_{1:t-1}) = \sum_c\hat{p}(c|a_{1:t-1}, q_{1:t-1})\cdot\tilde{p}(c, q_t, a_{1:t-1}, q_{1:t-1})</script>
<p>The algorithm for AQM’s questioner is as follows.</p>
<p><img src="/assets/images/aqm3.png" alt="image" class="center-image" width="50px" /></p>
<p>In practice, there are some options for implementation. First, Q-sampler that samples question candidates can be randQ that sample questions at random or countQ that sample questions with the least correlation. Second, YOLO9000 is used to pick the list of candidate objects and the prior for these candidate is set to 1/N. Third, the answerer model picks the answer independent of the history. On the other hand, the approximate for the answerer model can be varied. The approximate for the answerer model can be trained independently from answerer model(indA) or trained from the answer from answerer model(depA). While these two share the same training datatset, indAhalf and depAhalf do not share the dataet.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/aqm4.png" alt="image" class="center-image" width="50px" /></p>
<p>AQM outperformed previous methods in goal oriented dialogue tasks such as MNIST Counting Dialog and GuessWhat?!. GuessWhat?! is one of the goal-oriented dialogue tasks that the questioner needs to guess the object in an image by asking questions to the answerer while only the answerer knows the answer.</p>
<p><a href="https://arxiv.org/abs/1802.03881">Lee, Sang-Woo, Yu-Jung Heo, and Byoung-Tak Zhang. “Answerer in Questioner’s Mind for Goal-Oriented Visual Dialogue.” arXiv preprint arXiv:1802.03881 (2018).</a></p>
Thu, 31 Jan 2019 09:01:59 +0000
https://lyusungwon.github.io/studies/2019/01/31/aqm/
https://lyusungwon.github.io/studies/2019/01/31/aqm/deep-learningvisual-question-answeringstudiesMultimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding<h2 id="why">WHY?</h2>
<p>Former methods used element-wise sum, product or concatenation to represent the relation of two vectors. Bilinear model(outer prodct) of two vectors is more sophisticated way of representing relation, but usually dimensionality become too big. This paper suggests multimodal compact bilinear pooling(MCB) to represent compact and sophisticated relations.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/mcb1.png" alt="image" class="center-image" width="50px" /></p>
<p>MCB utilizes Count Sketch projection function for compact encoding of vector. When projecting a vector v of size n to a vector y of size d(n > d), CS algorithm first initalize two vectors <code class="MathJax_Preview">s \in \{-1, 1\}^n</code><script type="math/tex">s \in \{-1, 1\}^n</script>, and <code class="MathJax_Preview">h\in {1,...,d}^n</code><script type="math/tex">h\in {1,...,d}^n</script>. h indicate the indexes where the values of vector v can be projected to. The algorithm is as follows.</p>
<p><img src="/assets/images/mcb2.png" alt="image" class="center-image" width="50px" /></p>
<p>It has been proven that the Count Sketches of the outer product of two vectors can be expressed as convolution of two Count Sketches. Also, convolutino in the time domain is equivalent to element-wise product in frequency domain.</p>
<pre class="MathJax_Preview"><code>\Psi(x\otimes q, h, s) = \Psi(x, h, s) \ast \Psi(q, h, s)\\
x' \ast q' = FFT^{-1}(FFT(x')\odot FFT(q'))</code></pre>
<script type="math/tex; mode=display">\Psi(x\otimes q, h, s) = \Psi(x, h, s) \ast \Psi(q, h, s)\\
x' \ast q' = FFT^{-1}(FFT(x')\odot FFT(q'))</script>
<p><img src="/assets/images/mcb3.png" alt="image" class="center-image" width="50px" /></p>
<p>In VQA architecture, MCB pooling is used between image features and text feature to get attention weight, and attened image feature and text feature to make prediction.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/mcb4.png" alt="image" class="center-image" width="50px" /></p>
<p>MCB achieved the good results on VQA tasks.</p>
<p><a href="https://arxiv.org/abs/1606.01847">Fukui, Akira, et al. “Multimodal compact bilinear pooling for visual question answering and visual grounding.” arXiv preprint arXiv:1606.01847 (2016).</a></p>
Sat, 26 Jan 2019 09:01:59 +0000
https://lyusungwon.github.io/studies/2019/01/26/mcb/
https://lyusungwon.github.io/studies/2019/01/26/mcb/deep-learningvisual-question-answeringstudiesBottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering<h2 id="why">WHY?</h2>
<p>In image captioning or visual question answering, the features of an image are extracted by the spatial output layer of pretrained CNN model.</p>
<h2 id="what">WHAT?</h2>
<p>This paper suggests bottom-up attention using object detection model for extracting image features.</p>
<p><img src="/assets/images/updown1.png" alt="image" class="center-image" width="50px" /></p>
<p>Faster R-CNN in conjunction with ResNet101 is used followed by non-maximum supression using IOU threshold and the mean-pooling. The model was pretrained with ImageNet to classify object classes and trained additionally to predict the attribute classes.</p>
<p><img src="/assets/images/updown2.png" alt="image" class="center-image" width="50px" /></p>
<p>The VQA model of this paper is rather simple. This model utilizes the ‘gated tanh’ layer for non-linear transformation.</p>
<pre class="MathJax_Preview"><code>f_a(x) = \tilde{y}\circ g\\
\tilde{y} = tanh(Wx + b)\\
g = \sigma(W'x + b')\\
a_i = \mathbf{w}_a^{\top} f_a([\mathbf{v}_i, \mathbf{q}])\\
\mathbf{h} = f_q(\mathbf{q})\circ f_v(\hat{\mathbf{h}})\\
p(y) = \sigma(W_o f_o(\mathbf{h}))</code></pre>
<script type="math/tex; mode=display">f_a(x) = \tilde{y}\circ g\\
\tilde{y} = tanh(Wx + b)\\
g = \sigma(W'x + b')\\
a_i = \mathbf{w}_a^{\top} f_a([\mathbf{v}_i, \mathbf{q}])\\
\mathbf{h} = f_q(\mathbf{q})\circ f_v(\hat{\mathbf{h}})\\
p(y) = \sigma(W_o f_o(\mathbf{h}))</script>
<h2 id="so">So?</h2>
<p><img src="/assets/images/updown3.png" alt="image" class="center-image" width="50px" /></p>
<p>Bottm-up attention is shown to be useful than former methods.</p>
<p><img src="/assets/images/updown4.png" alt="image" class="center-image" width="50px" /></p>
<p>Up-Down model showed competitive results compared to other models in leader board of VQA 2.0 challenge (ensemble).</p>
<p><a href="http://openaccess.thecvf.com/content_cvpr_2018/CameraReady/1163.pdf">Anderson, Peter, et al. “Bottom-up and top-down attention for image captioning and visual question answering.” CVPR. Vol. 3. No. 5. 2018.</a></p>
Fri, 25 Jan 2019 09:11:59 +0000
https://lyusungwon.github.io/studies/2019/01/25/updown/
https://lyusungwon.github.io/studies/2019/01/25/updown/deep-learningvisual-question-answeringstudiesCompositional Attention Networks for Machine Reasoning<h2 id="why">WHY?</h2>
<p>Previous methods for visual reasoning lacked interpretability. This paper suggests MAC network which is fully differentiable and interpretable attention based visual reasoning model.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/mac1.png" alt="image" class="center-image" width="50px" /></p>
<p>MAC network consists of three parts: the input unit, the MAC cells, and the read unit. The input unit encodes the question and the image. The word embeddings of question is processed by d-dimensional biLSTM. The hidden states of the model are contextual representation for each word. The concatenation of the last hidden state of backward, and forward LSTM is used as a question representation. This question representaion is linearly projected to vectors for each reasoning step. For image features, the conv4 layer output of pretrained ResNet101 is used. Image features are called knowledge base in this paper.</p>
<p><img src="/assets/images/mac2.png" alt="image" class="center-image" width="50px" /></p>
<p>Recurrent Memory, Attention, and Composition cell(MAC cell) is newly designed recurrent cell for reasoning operation. p number of MAC cells are stacked to perform multiple reasoning steps. MAC cells have two hidden states: control(<code class="MathJax_Preview">c_i</code><script type="math/tex">c_i</script>) and memory(<code class="MathJax_Preview">m_i</code><script type="math/tex">m_i</script>). The control state represents the reasoning operation and the memory state holds intermediate information.</p>
<p><img src="/assets/images/mac3.png" alt="image" class="center-image" width="50px" /></p>
<p>The control unit updates the control state. New control state is a soft attention-based weighted average of the question words based on the question vector for each step.</p>
<p><img src="/assets/images/mac4.png" alt="image" class="center-image" width="50px" /></p>
<p>The read unit extracts information from memory. The interaction of memory and image features(I) and independent facts(k) are concatenated to produce information(I’). The information(I’) interacts with control units(<code class="MathJax_Preview">c_i</code><script type="math/tex">c_i</script>) to produce weights(rv). The knowledge bases are weighted averaged to produce the read unit(<code class="MathJax_Preview">r_i</code><script type="math/tex">r_i</script>).</p>
<p><img src="/assets/images/mac5.png" alt="image" class="center-image" width="50px" /></p>
<p>The memory unit is updated based on the retrieved read unit and previous memory. There are two ways that the control unit controls the write operation: one is performing self attention and the other is to be used as memory gate.</p>
<p><img src="/assets/images/mac6.png" alt="image" class="center-image" width="50px" /></p>
<p>The question representation(q) and the final memory state(<code class="MathJax_Preview">m_p</code><script type="math/tex">m_p</script>) are used as input for 2-layer-fc softmax classifier to result the answer.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/mac7.png" alt="image" class="center-image" width="50px" /></p>
<p>MAC showed the state-of-the-art result on both Clevr and Clevr-Humans without using information about data generation.</p>
<p><img src="/assets/images/mac8.png" alt="image" class="center-image" width="50px" /></p>
<p>Also, the attention map of MAC showed that this model is properly solving the task.</p>
<p><a href="https://arxiv.org/abs/1803.03067">Hudson, Drew A., and Christopher D. Manning. “Compositional attention networks for machine reasoning.” arXiv preprint arXiv:1803.03067 (2018).</a></p>
Thu, 24 Jan 2019 13:11:59 +0000
https://lyusungwon.github.io/studies/2019/01/24/mac/
https://lyusungwon.github.io/studies/2019/01/24/mac/deep-learningvisual-question-answeringstudiesHierarchical Question-Image Co-Attention for Visual Question Answering<h2 id="why">WHY?</h2>
<p>Previous works achieved successful results in VQA by modeling visual attention. This paper suggests co-attention model for VQA to pay attention to both images (where to look) and words (what words to listen to).</p>
<h2 id="what">WHAT?</h2>
<p>Co-attention model in this paper pay attention to the images and question in three level: word level, phrase level and question level. Embedding matrix is used for word level, 1D convolution with 3 window sizes(1-3) and max polling are used for phrase level, LSTM is used to encode question level vector.</p>
<p><img src="/assets/images/coatt1.png" alt="image" class="center-image" width="50px" /></p>
<p>Two methods for co-attention are suggested: parallel co-attention and alternating co-attention. Parallell co-attention first form bilinear affinity matrix to capture the relationship between images and words and result attened visual and word features.</p>
<pre class="MathJax_Preview"><code>\mathbf{C} = tanh(\mathbf{Q}^{\top}\mathbf{W}_b\mathbf{V})\\
\mathbf{H}^v = tanh(\mathbf{W}_v\mathbf{V} + (\mathbf{W}_q\mathbf{Q})\mathbf{C}), \mathbf{H}^q = tanh(\mathbf{W}_q\mathbf{Q} + (\mathbf{W}_v\mathbf{V})\mathbf{C}^{\top})\\
\mathbf{a}^v = softmax(\mathbf{w}_{hv}^{\top}\mathbf{H}^v), \mathbf{a}^q = softmax(\mathbf{w}_{hq}^{\top}\mathbf{H}^q)\\
\hat{\mathbf{v}} = \sum_{n=1}^N a_n^v\mathbf{v}_n, \hat{\mathbf{q}} = \sum_{t=1}^T a_t^q\mathbf{q}_t</code></pre>
<script type="math/tex; mode=display">\mathbf{C} = tanh(\mathbf{Q}^{\top}\mathbf{W}_b\mathbf{V})\\
\mathbf{H}^v = tanh(\mathbf{W}_v\mathbf{V} + (\mathbf{W}_q\mathbf{Q})\mathbf{C}), \mathbf{H}^q = tanh(\mathbf{W}_q\mathbf{Q} + (\mathbf{W}_v\mathbf{V})\mathbf{C}^{\top})\\
\mathbf{a}^v = softmax(\mathbf{w}_{hv}^{\top}\mathbf{H}^v), \mathbf{a}^q = softmax(\mathbf{w}_{hq}^{\top}\mathbf{H}^q)\\
\hat{\mathbf{v}} = \sum_{n=1}^N a_n^v\mathbf{v}_n, \hat{\mathbf{q}} = \sum_{t=1}^T a_t^q\mathbf{q}_t</script>
<p>Alternating co-attention method first summerize the question into a single vector, attend to the image based on the question vector, and attend to the question based on the attened image feature. In first step, X = Q and g = 0. In second step, X = V and <code class="MathJax_Preview">g = \hat{s}</code><script type="math/tex">g = \hat{s}</script>. In third step, X = Q and <code class="MathJax_Preview">g = \hat{v}</code><script type="math/tex">g = \hat{v}</script>.</p>
<pre class="MathJax_Preview"><code>\hat{\mathbf{x}} = \mathcal{A}(\mathbf{X}; \mathbf{g})\\
\mathbf{H} = tanh(\mathbf{W}_x\mathbf{X} + (\mathbf{W}_g\mathbf{g)1^{\top}})\\
\mathbf{a}^x = softmax(\mathbf{w}^{\top}_{hx}\mathbf{H})\\
\hat{\mathbf{x}} = \sum a_i^x \mathbf{x}_i</code></pre>
<script type="math/tex; mode=display">\hat{\mathbf{x}} = \mathcal{A}(\mathbf{X}; \mathbf{g})\\
\mathbf{H} = tanh(\mathbf{W}_x\mathbf{X} + (\mathbf{W}_g\mathbf{g)1^{\top}})\\
\mathbf{a}^x = softmax(\mathbf{w}^{\top}_{hx}\mathbf{H})\\
\hat{\mathbf{x}} = \sum a_i^x \mathbf{x}_i</script>
<p><img src="/assets/images/coatt2.png" alt="image" class="center-image" width="50px" /></p>
<p>Co-attention is performed at various level of hierarchy of question. Various level of attended features are recursively encoded with MLP.</p>
<pre class="MathJax_Preview"><code>\mathbf{h}^w = tanh(\mathbf{W}_w(\hat{\mathbf{q}}^w + \hat{\mathbf{v}}^w))\\
\mathbf{h}^p = tanh(\mathbf{W}_p[(\hat{\mathbf{q}}^p + \hat{\mathbf{v}}^p), \mathbf{h}^w])\\
\mathbf{h}^s = tanh(\mathbf{W}_s[(\hat{\mathbf{q}}^s + \hat{\mathbf{v}}^s), \mathbf{h}^p])\\
\mathbf{p} = softmax(\mathbf{W}_h\mathbf{h}^s)</code></pre>
<script type="math/tex; mode=display">\mathbf{h}^w = tanh(\mathbf{W}_w(\hat{\mathbf{q}}^w + \hat{\mathbf{v}}^w))\\
\mathbf{h}^p = tanh(\mathbf{W}_p[(\hat{\mathbf{q}}^p + \hat{\mathbf{v}}^p), \mathbf{h}^w])\\
\mathbf{h}^s = tanh(\mathbf{W}_s[(\hat{\mathbf{q}}^s + \hat{\mathbf{v}}^s), \mathbf{h}^p])\\
\mathbf{p} = softmax(\mathbf{W}_h\mathbf{h}^s)</script>
<h2 id="so">So?</h2>
<p><img src="/assets/images/coatt3.png" alt="image" class="center-image" width="50px" /></p>
<p>Co-attention model with pretrained image feature achieved good result on VQA dataset.</p>
<p><a href="http://papers.nips.cc/paper/6202-hierarchical-question-image-co-attention-for-visual-question-answering">Lu, Jiasen, et al. “Hierarchical question-image co-attention for visual question answering.” Advances In Neural Information Processing Systems. 2016.</a></p>
Wed, 23 Jan 2019 09:05:59 +0000
https://lyusungwon.github.io/studies/2019/01/23/coatt/
https://lyusungwon.github.io/studies/2019/01/23/coatt/deep-learningvisual-question-answeringstudiesMUTAN: Multimodal Tucker Fusion for Visual Question Answering<h2 id="why">WHY?</h2>
<p>While bilinear model is an effective method for capturing the relationship between two spaces, often the number of parameters is intractable. This paper suggests to reduce the number of parameters by controlling the rank of the matrix with Turker decomposition.</p>
<h2 id="note">Note</h2>
<p>With <code class="MathJax_Preview">\times_i</code><script type="math/tex">\times_i</script> denotes the i-mode product between a tensor and a matrix, Tucker decomposition of a tensor <code class="MathJax_Preview">\tau</code><script type="math/tex">\tau</script> is as follows.</p>
<pre class="MathJax_Preview"><code>\mathbf{\tau} \in \mathbb{R}^{d_q \times d_v \times |\mathcal{A}|}\\
\mathbf{\tau} = ((\mathbf{\tau_c} \times_1 \mathbf{W}_q) \times_2 \mathbf{W}_v) \times_3 \mathbf{W}_O</code></pre>
<script type="math/tex; mode=display">\mathbf{\tau} \in \mathbb{R}^{d_q \times d_v \times |\mathcal{A}|}\\
\mathbf{\tau} = ((\mathbf{\tau_c} \times_1 \mathbf{W}_q) \times_2 \mathbf{W}_v) \times_3 \mathbf{W}_O</script>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/mutan1.png" alt="image" class="center-image" width="50px" /></p>
<p>The Tucker decomposition showed that a tensor can be represented with a limited number of parameters. The bilinear relationship between question vectors and image vectors can be represented in a Tucker Fusion form.</p>
<pre class="MathJax_Preview"><code>\mathbf{\tau} = ((\mathbf{\tau_c} \times_1 (\mathbf{q}^{\top}\mathbf{W}_q)) \times_2 (\mathbf{v}^{\top}\mathbf{W}_v)) \times_3 \mathbf{W}_O\\
\mathbf{z} = (\mathbf{\tau}_c \times_1 \mathbf{\tilde{q}}) \times_2 \mathbf{\tilde{v}} \in \mathbb{R}^{t_o}\\
\mathbf{y} = \mathbf{z}^{\top}\mathbf{W}_O</code></pre>
<script type="math/tex; mode=display">\mathbf{\tau} = ((\mathbf{\tau_c} \times_1 (\mathbf{q}^{\top}\mathbf{W}_q)) \times_2 (\mathbf{v}^{\top}\mathbf{W}_v)) \times_3 \mathbf{W}_O\\
\mathbf{z} = (\mathbf{\tau}_c \times_1 \mathbf{\tilde{q}}) \times_2 \mathbf{\tilde{v}} \in \mathbb{R}^{t_o}\\
\mathbf{y} = \mathbf{z}^{\top}\mathbf{W}_O</script>
<p>We can control the sparsity and expressiveness of bilinear model by controlling the rank of a slice of <code class="MathJax_Preview">\mathbf{\tau}_c</code><script type="math/tex">\mathbf{\tau}_c</script>. If we impose the rank of a slice of the core matrix to be R, then each slice can be represented with the sum of R rank one matrices.</p>
<pre class="MathJax_Preview"><code>\mathbf{z}[k] = \tilde{\mathbf{q}}^{\top}\mathbf{\tau}_c[:,:,k]\tilde{\mathbf{v}}\\
\mathbf{\tau}_c[:,:,k] = \sum_{r=1}^R \mathbf{m}_r^k \otimes \mathbf{n}_r^{k\top}\\
\mathbf{z}[k] = \sum_{r=1}^R(\tilde{\mathbf{q}}^{\top}\mathbf{m}_r^k)(\tilde{\mathbf{v}}^{\top}\mathbf{n}_r^k)\\
\mathbf{z} = \sum_{r=1}^R \mathbf{z}_r\\
\mathbf{z}_r = (\tilde{\mathbf{q}}^{\top}\mathbf{M}_r)*(\tilde{\mathbf{v}}^{\top}\mathbf{N}_r)</code></pre>
<script type="math/tex; mode=display">\mathbf{z}[k] = \tilde{\mathbf{q}}^{\top}\mathbf{\tau}_c[:,:,k]\tilde{\mathbf{v}}\\
\mathbf{\tau}_c[:,:,k] = \sum_{r=1}^R \mathbf{m}_r^k \otimes \mathbf{n}_r^{k\top}\\
\mathbf{z}[k] = \sum_{r=1}^R(\tilde{\mathbf{q}}^{\top}\mathbf{m}_r^k)(\tilde{\mathbf{v}}^{\top}\mathbf{n}_r^k)\\
\mathbf{z} = \sum_{r=1}^R \mathbf{z}_r\\
\mathbf{z}_r = (\tilde{\mathbf{q}}^{\top}\mathbf{M}_r)*(\tilde{\mathbf{v}}^{\top}\mathbf{N}_r)</script>
<p><img src="/assets/images/mutan2.png" alt="image" class="center-image" width="50px" /></p>
<p>MCB and MLB can be represented with a generalized form of MUTAN.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/mutan3.png" alt="image" class="center-image" width="50px" /></p>
<p>MUTAN can represent rich multimodal representation. Achieved SOTA results on VQA dataset.</p>
<p><a href="https://arxiv.org/abs/1705.06676">Ben-Younes, Hedi, et al. “Mutan: Multimodal tucker fusion for visual question answering.” 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017.</a></p>
Tue, 22 Jan 2019 09:03:59 +0000
https://lyusungwon.github.io/studies/2019/01/22/mutan/
https://lyusungwon.github.io/studies/2019/01/22/mutan/deep-learningvisual-question-answeringstudiesChain of Reasoning for Visual Question Answering<h2 id="why">WHY?</h2>
<p>Previous methods for visual question answering performed one-step or static reasoning while some questions requires chain of reasonings.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/cor1.png" alt="image" class="center-image" width="50px" /></p>
<p>Chain of reasoning(CoR) model alternatively updates the objects and their relations to solve questions that require chain of reasoning.</p>
<p><img src="/assets/images/cor2.png" alt="image" class="center-image" width="50px" /></p>
<p>Cor consists of three parts: Data embedding, chain of reasoning and decision making. Data embedding encodes images and questions into vecors with Faster-RCNN and GRU respectively.</p>
<p><img src="/assets/images/cor3.png" alt="image" class="center-image" width="50px" /></p>
<p>Chain of reasoning step consists of a series of sub-chains that perform relational reasoning and object refining. Given that objects of image denoted by <code class="MathJax_Preview">O^{(t)}</code><script type="math/tex">O^{(t)}</script>, output is calculated as follows.</p>
<pre class="MathJax_Preview"><code>P^{t} = relu(o^{t}W_o^{t}), S^{t} = relu(QW_q^{t})\\
F^{t} = \sum_{k=1}^K(P^{t}W_{p, k}^{t})\odot(S^{t}W_{s, k})\\
\alpha^{t} = softmax(F^{t}W_f^{t})\\
\tilde{Q}^{t} = (\alpha^{t})^T O^{t}</code></pre>
<script type="math/tex; mode=display">P^{t} = relu(o^{t}W_o^{t}), S^{t} = relu(QW_q^{t})\\
F^{t} = \sum_{k=1}^K(P^{t}W_{p, k}^{t})\odot(S^{t}W_{s, k})\\
\alpha^{t} = softmax(F^{t}W_f^{t})\\
\tilde{Q}^{t} = (\alpha^{t})^T O^{t}</script>
<p>Relations of objects are calculated with guidance conditioned on question. Objects are refined with weighted sum of relations.</p>
<pre class="MathJax_Preview"><code>G_l = \sigma(relu(QW_{l_1})W_{l_2}), G_r = \sigma(relu)QW_{r_1})W_{r_2}\\
R_{ij}^{t} = (O_i^{t}\odot G_l) \oplus (O_j^{(1)}\odot G_r)\\
O_j^{t+1} = \sum_{i=1}^m \alpha_i^t R_{ij}^t</code></pre>
<script type="math/tex; mode=display">G_l = \sigma(relu(QW_{l_1})W_{l_2}), G_r = \sigma(relu)QW_{r_1})W_{r_2}\\
R_{ij}^{t} = (O_i^{t}\odot G_l) \oplus (O_j^{(1)}\odot G_r)\\
O_j^{t+1} = \sum_{i=1}^m \alpha_i^t R_{ij}^t</script>
<p>Decisions are made with all the objects from each step.</p>
<pre class="MathJax_Preview"><code>O^* = [relu(O^{1}W^{1});relu(O^{2}W^{2});...;relu(O^{T}W^{T})\\
H = \sum_{k=1}^K(O^* W_{O^*, k})\odot(QW_{q', k})\\
\hat{a} = softmax(HW_h)</code></pre>
<script type="math/tex; mode=display">O^* = [relu(O^{1}W^{1});relu(O^{2}W^{2});...;relu(O^{T}W^{T})\\
H = \sum_{k=1}^K(O^* W_{O^*, k})\odot(QW_{q', k})\\
\hat{a} = softmax(HW_h)</script>
<h2 id="so">So?</h2>
<p><img src="/assets/images/cor4.png" alt="image" class="center-image" width="50px" /></p>
<p>CoR achieved the best result on various VQA tasks including VQA 1.0, VQA 2.0, COCO-QA, and TDIUC. Visualization shows CoR performs appropriate reasoning.</p>
<h2 id="critic">Critic</h2>
<p>I think more explanation is needed in architecture of relational reasoning and object refining step. Also, performance difference in ablation studies seems too trivial to draw conclusion.</p>
<p><a href="https://papers.nips.cc/paper/7311-chain-of-reasoning-for-visual-question-answering">Wu, Chenfei, et al. “Chain of Reasoning for Visual Question Answering.” Advances in Neural Information Processing Systems. 2018.</a></p>
Mon, 21 Jan 2019 09:03:59 +0000
https://lyusungwon.github.io/studies/2019/01/21/cor/
https://lyusungwon.github.io/studies/2019/01/21/cor/deep-learningvisual-question-answeringstudiesDeformable Convolutional Networks<h2 id="why">WHY?</h2>
<p>Spatial sampling of convolutional neural network is geometrically fixed. This paper suggests two modules for CNN to capture the geometric structure more flexibly.</p>
<h2 id="what">WHAT?</h2>
<p>Deformable convolution modifies the regular grid <code class="MathJax_Preview">\mathcal{R}</code><script type="math/tex">\mathcal{R}</script> of convolution network by augmenting <code class="MathJax_Preview">\mathcal{R}</code><script type="math/tex">\mathcal{R}</script> with offests. offsets are generated by a conv layer with 2N + 1 channels. For example, consider a convolution with 3x3 kernel with dilation 1.</p>
<p><img src="/assets/images/dcn1.png" alt="image" class="center-image" width="50px" /></p>
<pre class="MathJax_Preview"><code>\mathcal{R} = \{(-1, -1), (-1, 0),...,(0, 1), (1, 1)\}\\
\mathbf{y}(\mathbf{p}_0) = \sum_{\mathbf{p}_n\in\mathcal{R}}\mathbf{w}(\mathbf{p}_n)\cdot\mathbf{x}(\mathbf{p} + \mathbf{p}_n)\\
\mathbf{y}(\mathbf{p}_0) = \sum_{\mathbf{p}_n\in\mathcal{R}}\mathbf{w}(\mathbf{p}_n)\cdot\mathbf{x}(\mathbf{p} + \mathbf{p}_n + \delta\mathbf{p}_n)\\</code></pre>
<script type="math/tex; mode=display">\mathcal{R} = \{(-1, -1), (-1, 0),...,(0, 1), (1, 1)\}\\
\mathbf{y}(\mathbf{p}_0) = \sum_{\mathbf{p}_n\in\mathcal{R}}\mathbf{w}(\mathbf{p}_n)\cdot\mathbf{x}(\mathbf{p} + \mathbf{p}_n)\\
\mathbf{y}(\mathbf{p}_0) = \sum_{\mathbf{p}_n\in\mathcal{R}}\mathbf{w}(\mathbf{p}_n)\cdot\mathbf{x}(\mathbf{p} + \mathbf{p}_n + \delta\mathbf{p}_n)\\</script>
<p>Since offsets can be fractional, bilinear interpolation can be used.</p>
<pre class="MathJax_Preview"><code>\mathbf{x}(\mathbf{p}) = \sum_q G(\mathbf{q}\mathbf{p})\mathbf{x}(\mathbf{q})\\
G(\mathbf{q}\mathbf{p}) = g(q_x, p_x)\cdot g(q_y, p_y)\\
g(a, b) = max(0, 1 - |a - b|)</code></pre>
<script type="math/tex; mode=display">\mathbf{x}(\mathbf{p}) = \sum_q G(\mathbf{q}\mathbf{p})\mathbf{x}(\mathbf{q})\\
G(\mathbf{q}\mathbf{p}) = g(q_x, p_x)\cdot g(q_y, p_y)\\
g(a, b) = max(0, 1 - |a - b|)</script>
<p>This kind of augmentation enable CNN to capture image with various transformation for scale, aspect ratio and rotation.</p>
<p><img src="/assets/images/dcn2.png" alt="image" class="center-image" width="50px" /></p>
<p>Second module is deformable RoI Pooling for image detection. Deformable RoI Pooling divides the RoI into k x k bins and outputs a k x k feature map y. The offsets are generated by a fc layer.</p>
<p><img src="/assets/images/dcn3.png" alt="image" class="center-image" width="50px" /></p>
<pre class="MathJax_Preview"><code>\mathbf{y} = \sum_{\mathbf{p}\in bin(i, j)} \mathbf{x}(\mathbf{p}_0 + \mathbf{p})/n_{ij}\\
\mathbf{y} = \sum_{\mathbf{p}\in bin(i, j)} \mathbf{x}(\mathbf{p}_0 + \mathbf{p} + \delta \mathbf{p}_{ij})/n_{ij}\\
\delta \mathbf{p}_{ij} = \gamma\cdot\delta\hat{\mathbf{p}}_{ij} \circ (w, h)\\</code></pre>
<script type="math/tex; mode=display">\mathbf{y} = \sum_{\mathbf{p}\in bin(i, j)} \mathbf{x}(\mathbf{p}_0 + \mathbf{p})/n_{ij}\\
\mathbf{y} = \sum_{\mathbf{p}\in bin(i, j)} \mathbf{x}(\mathbf{p}_0 + \mathbf{p} + \delta \mathbf{p}_{ij})/n_{ij}\\
\delta \mathbf{p}_{ij} = \gamma\cdot\delta\hat{\mathbf{p}}_{ij} \circ (w, h)\\</script>
<p>Position-Sensitive RoI Pooling replace the general feature map with positive-sensitive score map with <code class="MathJax_Preview">k^2(C+1)</code><script type="math/tex">k^2(C+1)</script> channels.</p>
<p><img src="/assets/images/dcn4.png" alt="image" class="center-image" width="50px" /></p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/dcn5.png" alt="image" class="center-image" width="50px" />
<img src="/assets/images/dcn6.png" alt="image" class="center-image" width="50px" />
Decormable convolution network performed well on semantic segmentation, and object detection than normal convolution network.</p>
<h2 id="critic">Critic</h2>
<p>The amazing proprty of DCN is that the receptive field of its filter can be varied with object size. I assume that feature vectors of DCN may represent real object which can be useful in VQA.</p>
<p><a href="http://openaccess.thecvf.com/content_ICCV_2017/papers/Dai_Deformable_Convolutional_Networks_ICCV_2017_paper.pdf">Dai, Jifeng, et al. “Deformable convolutional networks.” CoRR, abs/1703.06211 1.2 (2017): 3.</a></p>
Fri, 18 Jan 2019 09:02:59 +0000
https://lyusungwon.github.io/studies/2019/01/18/dcn/
https://lyusungwon.github.io/studies/2019/01/18/dcn/deep-learningcomputer-visionstudiesShow, Attend and Tell: Neural Image Caption Generation with Visual Attention<h2 id="why">WHY?</h2>
<p>A caption of an image can be generated with attention based model by aligning a word to a part of image.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/nicg1.png" alt="image" class="center-image" width="50px" />
A convolution neural network extracts features from raw images resulting a series of feature vectors. In order to generate a series of words as a caption, LSTM with attention model is used. Generated feature vectors are used as annotation vectors for attention. Previous word, previous hidden vector and current context vector is concatenated and projected to the dimension of hidden vector to be used as inputs for gates. Deep output layer is used for output word probability distribution.</p>
<p><img src="/assets/images/nicg2.png" alt="image" class="center-image" width="50px" /></p>
<pre class="MathJax_Preview"><code>e_{ti} = f_{att}(\mathbf{a_i}, \mathbf{h_{t-1}})\\
\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{k=1}^L \exp(e_{tk})}\\
\hat{z}_t = \phi(\{\mathbf{a}_i\}, \{\alpha_i\})\\
p(\mathbf{y}_\mathbf{a}, \mathbf{y}_1^{t-1}) \propto \exp(\mathbf{L}_o(\mathbf{E}\mathbf{y}_{t-1} + \mathbf{L}_h \mathbf{h}_t + \mathbf{L}_z\hat{\mathbf{z}}_t))</code></pre>
<script type="math/tex; mode=display">e_{ti} = f_{att}(\mathbf{a_i}, \mathbf{h_{t-1}})\\
\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{k=1}^L \exp(e_{tk})}\\
\hat{z}_t = \phi(\{\mathbf{a}_i\}, \{\alpha_i\})\\
p(\mathbf{y}_\mathbf{a}, \mathbf{y}_1^{t-1}) \propto \exp(\mathbf{L}_o(\mathbf{E}\mathbf{y}_{t-1} + \mathbf{L}_h \mathbf{h}_t + \mathbf{L}_z\hat{\mathbf{z}}_t))</script>
<p>Weighting function of annoation vectors and a hidden vector(<code class="MathJax_Preview">\phi</code><script type="math/tex">\phi</script>) can be varied. Stochastic hard attention treate the weights of attention as parameters of multinoulli distribution and sample a place of attention from the distribution. Variational lower bound on the marginal log-likelihood can be maximized by estimating gradients with REINFORCE algorithm. Deterministic soft attention take the expectation of context vector by weight summing annotation vectors. Doubly stochasic attention is used to encourage model to focus on the every part of the image by imposing regularization on the loss function.</p>
<pre class="MathJax_Preview"><code>L_d = -\log(P(\mathbf{y}|\mathbf{x})) + \lambda\sum^L_i(1 - \sum_t^C\alpha_{ti})^2</code></pre>
<script type="math/tex; mode=display">L_d = -\log(P(\mathbf{y}|\mathbf{x})) + \lambda\sum^L_i(1 - \sum_t^C\alpha_{ti})^2</script>
<h2 id="so">So?</h2>
<p><img src="/assets/images/nicg3.png" alt="image" class="center-image" width="50px" />
Attention model was able to generate caption by sequentially focusing on the part of images.</p>
<h2 id="critic">Critic</h2>
<p>Attention of other words other than keywords were drifting around. There can be attention for relations since some words refer to the relations of the objects.</p>
<p><a href="http://proceedings.mlr.press/v37/xuc15.pdf">Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” International conference on machine learning. 2015.</a></p>
Thu, 17 Jan 2019 10:32:59 +0000
https://lyusungwon.github.io/studies/2019/01/17/nicg/
https://lyusungwon.github.io/studies/2019/01/17/nicg/deep-learningvisual-question-answeringstudiesMobileNet: Efficient Convolutional Neural Networks for Mobile Vision Applications<h2 id="why">WHY?</h2>
<p>Recent neural network models are getting bigger to increase the performance to the limit. This paper suggests MobileNet to reduce the size of neural network small enough to deploy on mobile devices.</p>
<h2 id="what">WHAT?</h2>
<p>Several techniques are used for MobileNet.</p>
<p><a href="/assets/images/mbnet1" class="center-image" width="50px">image</a></p>
<p>The most important component of MobileNet is depthwise separable convolution. Assume a feature map of <code class="MathJax_Preview">D_F\cdot D_F \cdot M</code><script type="math/tex">D_F\cdot D_F \cdot M</script>. Standard convolution filters consists of N number of filters of size <code class="MathJax_Preview">D_K\cdot D_K \cdot M</code><script type="math/tex">D_K\cdot D_K \cdot M</script>. Instead, depthwise separable convolution replace this with M number of depthwise convolution of size <code class="MathJax_Preview">D_f\cdot D_f\cdot 1</code><script type="math/tex">D_f\cdot D_f\cdot 1</script>, and N number of pointwise convolution of size <code class="MathJax_Preview">1\cdot 1 \cdot M</code><script type="math/tex">1\cdot 1 \cdot M</script>.</p>
<p><a href="/assets/images/mbnet2" class="center-image" width="50px">image</a></p>
<p>New efficient architecture based on this method not only reduce the number of multi-add, but also concentrate the computation on pointwise convolution layer which is one of the most efficient operation by general matrix multiply(GEMM).</p>
<p>Mobilenet introduced two additional hyperparameters to reduce the computation. Width multiplier <code class="MathJax_Preview">\alpha</code><script type="math/tex">\alpha</script> is used to reduce the number of channels on each layer. Resolution multiplier <code class="MathJax_Preview">\rho</code><script type="math/tex">\rho</script> is used to reduce the height and width on each layer. The number of computation is reduced from</p>
<pre class="MathJax_Preview"><code>D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F</code></pre>
<script type="math/tex; mode=display">D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F</script>
<p>to</p>
<pre class="MathJax_Preview"><code>D_k \cdot D_k \cdot \alpha M \cdot\rho D_F \cdot\rho D_F + \alpha M \cdot\alpha N \cdot\rho D_F \rho D_F</code></pre>
<script type="math/tex; mode=display">D_k \cdot D_k \cdot \alpha M \cdot\rho D_F \cdot\rho D_F + \alpha M \cdot\alpha N \cdot\rho D_F \rho D_F</script>
<h2 id="so">So?</h2>
<p><a href="/assets/images/mbnet3" class="center-image" width="50px">image</a>
MobileNet decreased the number of parameters and computations dramatically with slight decrease in performance on various tasks including classification and detection.</p>
<h2 id="critic">Critic</h2>
<p>It is amazing that the convolution filters can be represented with depthwise convolution and pointwise convolution while preserving much of its representational power. Could there be similar method of RNN series?</p>
<p><a href="https://arxiv.org/abs/1704.04861">Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017).</a></p>
Tue, 15 Jan 2019 09:59:59 +0000
https://lyusungwon.github.io/studies/2019/01/15/mbnet/
https://lyusungwon.github.io/studies/2019/01/15/mbnet/deep-learningdeep-learningstudiesStarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation<h2 id="why">WHY?</h2>
<p><img src="/assets/images/stargan1.png" alt="image" class="center-image" width="50px" /></p>
<p><a href="https://lyusungwon.github.io/generative-models/2018/04/05/cyclegan.html">CycleGAN</a> has been used in image-to-image translation effectively. However, handling more than two domains was difficult. This paper StarGAN to handle multiple domains with a single model.</p>
<h2 id="what">WHAT?</h2>
<p>StarGAN can be considered as an domain conditioned version of CycleGAN.</p>
<p><img src="/assets/images/stargan2.png" alt="image" class="center-image" width="50px" /></p>
<p>The discriminator of StarGAN not only classify real and fake, but also domain labels. The loss function of StarGAN consists of three terms: adversarial loss, domain classification loss, and reconstruction loss.</p>
<pre class="MathJax_Preview"><code>\mathcal{L}_{adv} = \mathbb{E}_x[\log D_{src}(x)] + \mathbb{E}_{x,c}[\log(1-D_{src}(G(x,c)))]\\
\mathcal{L}_{cls}^r = \mathbb{E}_{x, c'}[-log D_{cls}(c'|x)]\\
\mathcal{L}_{cls}^f = \mathbb{E}_{x, c}[-log D_{cls}(c|G(x, c))]\\
\mathcal{L}_{rec} = \mathbb{E}_{x, c, c'}[\|x - G(G(x, c), c')\|_1]\\
\mathcal{L}_D = -\mathcal{L}_{adv} + \lambda_{cls}\mathcal{L}_{cls}^r\\
\mathcal{L}_G = \mathcal{L}_{adv} + \lambda_{cls}\mathcal{L}_{cls}^f + \lambda_{rec}\mathcal{L}_{rec}</code></pre>
<script type="math/tex; mode=display">\mathcal{L}_{adv} = \mathbb{E}_x[\log D_{src}(x)] + \mathbb{E}_{x,c}[\log(1-D_{src}(G(x,c)))]\\
\mathcal{L}_{cls}^r = \mathbb{E}_{x, c'}[-log D_{cls}(c'|x)]\\
\mathcal{L}_{cls}^f = \mathbb{E}_{x, c}[-log D_{cls}(c|G(x, c))]\\
\mathcal{L}_{rec} = \mathbb{E}_{x, c, c'}[\|x - G(G(x, c), c')\|_1]\\
\mathcal{L}_D = -\mathcal{L}_{adv} + \lambda_{cls}\mathcal{L}_{cls}^r\\
\mathcal{L}_G = \mathcal{L}_{adv} + \lambda_{cls}\mathcal{L}_{cls}^f + \lambda_{rec}\mathcal{L}_{rec}</script>
<p><code class="MathJax_Preview">\lambda_{cls} = 1, \lambda_{rec} = 10</code><script type="math/tex">\lambda_{cls} = 1, \lambda_{rec} = 10</script> in all of the experiment. To enable StarGAN in multiple domain in datasets with different label information, mask vector is introduced as conditional information.</p>
<pre class="MathJax_Preview"><code>\tilde{c} = [c_1,...,c_n, m]</code></pre>
<script type="math/tex; mode=display">\tilde{c} = [c_1,...,c_n, m]</script>
<p>Model architecture is adopted from CycleGAN. WGAN-GP objective is used as loss function.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/stargan3.png" alt="image" class="center-image" width="50px" />
StarGAN successfully generated images conditioned on labels from differnt domains with a single model.</p>
<h2 id="critic">Critic</h2>
<p>I think the key point in StarGAN is effective conditioning of labels. Recent methods from PGGAN or BigGAN or AdaIN from SB-GAN seem to be effective in this setting.</p>
<p><a href="https://arxiv.org/abs/1711.09020">Choi, Yunjey, et al. “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.” arXiv preprint 1711 (2017).</a></p>
Tue, 15 Jan 2019 09:01:59 +0000
https://lyusungwon.github.io/studies/2019/01/15/stargan/
https://lyusungwon.github.io/studies/2019/01/15/stargan/deep-learninggenerative-modelsstudiesU-net: Convolutional Networks for Biomedical Image Segmentation<h2 id="why">WHY?</h2>
<p>Image segmentation requires a lot of annotated images. This paper suggests efficient training of image segmentation using data augmentation and new structure.</p>
<h2 id="what">WHAT?</h2>
<p>This paper suggests U-net architecture which is a modified version of fully convolutional network.</p>
<p><img src="/assets/images/unet1.png" alt="image" class="center-image" width="50px" /></p>
<p>Unet consists of a contracting path and an expansive path. The contracting path is repeated application of two 3x3 convolutions, ReLU, and 2x2 max pooling with doubled channels. The expansive path is somewhat symmetric to the contracting path: two 3x3 convolutions, ReLU and 2x2 up-convolution with halved channels. After upsampling, the output of corresponding contracting layer is cropped and added to provide contextual information. These symmetric pathes form U shape structure.</p>
<p><img src="/assets/images/unet2.png" alt="image" class="center-image" width="50px" /></p>
<p>Overlap-tile strategy is used to predict segmentaion of arbitrarily large images. For efficient use of data, various kinds of data augmentations including shift and rotation invariance, elastic deformations and gray value variations is used.</p>
<p><img src="/assets/images/unet3.png" alt="image" class="center-image" width="50px" /></p>
<p>To separate the touching objects, weighted loss on the sepatation border is used.</p>
<h2 id="so">So?</h2>
<p>U-net achieved good results on various medical image segmentation tasks with small amount of data.</p>
<h2 id="critic">Critic</h2>
<p>It would be better if the paper focus only on U-net structure or efficient training with data augmentation. Each contribution of the methods are not clear on the experiment results.</p>
<p><a href="https://arxiv.org/abs/1505.04597">Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.</a></p>
Mon, 14 Jan 2019 09:01:59 +0000
https://lyusungwon.github.io/studies/2019/01/14/unet/
https://lyusungwon.github.io/studies/2019/01/14/unet/deep-learningcomputer-visionstudiesA Style-Based Generator Architecture for Generative Adversarial Networks<h2 id="why">WHY?</h2>
<p>High quality disentangled generation of images has been the goal for all the generative models. This paper suggests style-based generator architecture for GAN with techniques borrowed from the field of style transfer.</p>
<h2 id="what">WHAT?</h2>
<p>SBGAN changes the architecture of generator on the top of <a href="https://lyusungwon.github.io/generative-models/2019/01/02/pggan.html">PGGAN</a>.</p>
<p><img src="/assets/images/sbgan1.png" alt="image" class="center-image" width="50px" /></p>
<p>Instead of feeding a latent code z through the input layer of generator, SBGAN starts from learned constant. Latent codes are transformed with mapping network of 8-layer MLP and feed through adaptive instance normalization(AdaIN) to each layer. This paper argues that mapping network restores the real distribution of dataset.</p>
<p><img src="/assets/images/sbgan2.png" alt="image" class="center-image" width="50px" /></p>
<p>Finally, external noise inputs are added to each layer to provide stochasticity.</p>
<p><img src="/assets/images/sbgan3.png" alt="image" class="center-image" width="50px" /></p>
<p>Comprehensive ablation study was conducted. To start with (A) PGGAN, (B) bilinear up/downsamplig operation helped improve the quality. (C) Mapping network, AdaIN and (D) constant input tensor further improved the results. (E) External nput noise and (F) mixing regularization which is generating a portion of inputs with mixing of two random latent codes. Truncation trick in W is used only for demonstration.</p>
<p><img src="/assets/images/sbgan4.png" alt="image" class="center-image" width="50px" /></p>
<h2 id="so">So?</h2>
<p>By mixing the latent codes in different scales of layers, different levels of style are mixed.</p>
<p><img src="/assets/images/sbgan5.png" alt="image" class="center-image" width="50px" /></p>
<p>Controlling the input noise in different levels of layers can control stochastic variation in different scales.</p>
<p><img src="/assets/images/sbgan6.png" alt="image" class="center-image" width="50px" /></p>
<p><a href="https://arxiv.org/abs/1812.04948">Karras, Tero, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Generative Adversarial Networks.” arXiv preprint arXiv:1812.04948 (2018).</a></p>
Fri, 11 Jan 2019 09:01:59 +0000
https://lyusungwon.github.io/studies/2019/01/11/sbgan/
https://lyusungwon.github.io/studies/2019/01/11/sbgan/deep-learninggenerative-modelsstudiesNeural Arithmetic Logit Units<h2 id="why">WHY?</h2>
<p>Neural network was poor at manipulating numerical information outside the range of training set.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/nalu.png" alt="image" class="center-image" width="50px" /></p>
<p>This paper suggests two models that learn to manipulate and extrapolate numbers. The first model is the neural accumulator(NAC) which accumulate the quantities in rows additively. This model is a relaxed version of linear matrix multiplication with W which consists of -1, 0 or 1.</p>
<pre class="MathJax_Preview"><code>\mathbf{W} = \tanh(\hat{\mathbf{W}})\odot\sigma(\hat{\mathbf{M}})\\
\mathbf{a} = \mathbf{W}\mathbf{x}</code></pre>
<script type="math/tex; mode=display">\mathbf{W} = \tanh(\hat{\mathbf{W}})\odot\sigma(\hat{\mathbf{M}})\\
\mathbf{a} = \mathbf{W}\mathbf{x}</script>
<p>The second model is the neural arithmetic logic unit(NALU) which can perform multiplicative arithmetic. NALU is weighted sum of a NAC and another NAC that operates in log space.</p>
<pre class="MathJax_Preview"><code>\mathbf{g} = \sigma(\mathbf{G}\mathbf{x})\\
\mathbf{m} = \exp \mathbf{W}(\log(|\mathbf{x}|+\epsilon))\\
\mathbf{y} = \mathbf{g}\cdot\mathbf{a} + (1 - \mathbf{g})\cdot \mathbf{m}</code></pre>
<script type="math/tex; mode=display">\mathbf{g} = \sigma(\mathbf{G}\mathbf{x})\\
\mathbf{m} = \exp \mathbf{W}(\log(|\mathbf{x}|+\epsilon))\\
\mathbf{y} = \mathbf{g}\cdot\mathbf{a} + (1 - \mathbf{g})\cdot \mathbf{m}</script>
<h2 id="so">So?</h2>
<p>NALU successfully operated on various tasks with extrapolated numerical values including simple function learning tasks, MNIST counting and arithmetic tasks, language to number translation tasks and program evaluation task. NALU even performed well on non-numerical extrapolation task such as tracking the time in a Grid-World environment and MNIST parity prediction task.</p>
<p><a href="http://papers.nips.cc/paper/8027-neural-arithmetic-logic-units">Trask, Andrew, et al. “Neural arithmetic logic units.” Advances in Neural Information Processing Systems. 2018.</a></p>
Thu, 10 Jan 2019 14:58:59 +0000
https://lyusungwon.github.io/studies/2019/01/10/nalu/
https://lyusungwon.github.io/studies/2019/01/10/nalu/deep-learningdeep-learningstudiesVAE with a VampPrior<h2 id="why">WHY?</h2>
<p>Choosing an appropriate prior is important for VAE. This paper suggests two-layered VAE with flexible VampPrior.</p>
<h2 id="what">WHAT?</h2>
<p>The original variational lower-bound of VAE can be decomposed as follows.</p>
<pre class="MathJax_Preview"><code>\mathbb{E}_{x\sim q(x)}[\ln p(x)] \geq \mathbb{E}_{x\sim(x)}[\mathbb{E}_{q_{\phi}(z|x)}[\ln p_{\theta}(x|z)+\ln p_{\lambda}(z) - \ln q_{\phi}(z|x)]] \triangleq \mathcal{L}(\phi, \theta, \lambda) \\
= \mathbb{E}_{x \sim q(x)}[\mathbb{E}_{q_{\phi}(z|x)}[\ln p_{\theta}(x|z)]] + \mathbb{E}_{x\sim q(x)}[\mathbb{H}[q_{\phi})(z|x)]] - \mathbb{E}_{x\sim q(x)}[-\ln p_{\lambda}(z)]</code></pre>
<script type="math/tex; mode=display">\mathbb{E}_{x\sim q(x)}[\ln p(x)] \geq \mathbb{E}_{x\sim(x)}[\mathbb{E}_{q_{\phi}(z|x)}[\ln p_{\theta}(x|z)+\ln p_{\lambda}(z) - \ln q_{\phi}(z|x)]] \triangleq \mathcal{L}(\phi, \theta, \lambda) \\
= \mathbb{E}_{x \sim q(x)}[\mathbb{E}_{q_{\phi}(z|x)}[\ln p_{\theta}(x|z)]] + \mathbb{E}_{x\sim q(x)}[\mathbb{H}[q_{\phi})(z|x)]] - \mathbb{E}_{x\sim q(x)}[-\ln p_{\lambda}(z)]</script>
<p>The first component is the negative reconstruction error, the second component is the expectation of the entropy of the variational posterior, and the last component is the cross-entropy betwen the aggregated posterior and the prior. Usually the prior is given with a simple distribution such as Gaussian Normal, but a prior that optimized the ELBO can be found as the aggregated posterior.</p>
<pre class="MathJax_Preview"><code>max_{p_{\lambda}(z)} - \mathbb{E}_{z \sim q(z)}[=\ln p_{\lambda}(z)] + \beta (\int p_{\lambda}(z)dz -1)\\
p_{\lambda}^*(z) = \frac{1}{N}\sum_{n=1}^N q_{\phi}(z|x_n)</code></pre>
<script type="math/tex; mode=display">max_{p_{\lambda}(z)} - \mathbb{E}_{z \sim q(z)}[=\ln p_{\lambda}(z)] + \beta (\int p_{\lambda}(z)dz -1)\\
p_{\lambda}^*(z) = \frac{1}{N}\sum_{n=1}^N q_{\phi}(z|x_n)</script>
<p>However this not only leads to overfitting, but also expensive to compute. So this paper suggests variational mixture of posteriors prior(VampPrior) that approximates the prior with a mixture of variational posteriors of pseudo-inputs. These pseudo-inputs are learned by backpropagation.</p>
<pre class="MathJax_Preview"><code>p_{\lambda}(z) = \frac{1}{K}\sum^K_{k=1}q_{\phi}(z|u_k)</code></pre>
<script type="math/tex; mode=display">p_{\lambda}(z) = \frac{1}{K}\sum^K_{k=1}q_{\phi}(z|u_k)</script>
<p>In order to prevent inactive stochastic units problem, this paper suggests two-layered VAE.</p>
<p><img src="/assets/images/vpvae.png" alt="image" class="center-image" width="50px" /></p>
<pre class="MathJax_Preview"><code>q_{\phi}(z_1|x, z_2) q_{\psi}(z_2|x)\\
p_{\theta}(x|z_1, z_2) p_{\lambda}(z_1|z_2)p(z_2)\\
p(z_2) = \frac{1}{K}\sum_{k=1}^K q_{\psi}(z_2|u_k)\\
p_{\lambda}(z_1|z_2) = \mathcal{N}(z_1|\mu_{\lambda}(z_2), diag(\sigma_{\lambda}^2(z_2)))\\
q_{\phi}(z_1|x, z_2) = \mathcal{N}(z_1|\mu_{\phi}(x, z_2), diag(\sigma_{\phi}^2(x, z_2)))\\
q_{\psi}(z_2|x) = \mathcal{N}(z_2|\mu_{\psi}(x), diag(\sigma_{\psi}^2(x)))</code></pre>
<script type="math/tex; mode=display">q_{\phi}(z_1|x, z_2) q_{\psi}(z_2|x)\\
p_{\theta}(x|z_1, z_2) p_{\lambda}(z_1|z_2)p(z_2)\\
p(z_2) = \frac{1}{K}\sum_{k=1}^K q_{\psi}(z_2|u_k)\\
p_{\lambda}(z_1|z_2) = \mathcal{N}(z_1|\mu_{\lambda}(z_2), diag(\sigma_{\lambda}^2(z_2)))\\
q_{\phi}(z_1|x, z_2) = \mathcal{N}(z_1|\mu_{\phi}(x, z_2), diag(\sigma_{\phi}^2(x, z_2)))\\
q_{\psi}(z_2|x) = \mathcal{N}(z_2|\mu_{\psi}(x), diag(\sigma_{\psi}^2(x)))</script>
<h2 id="so">So?</h2>
<p>HVAE with VampPrior achieved good results on various dataset(MNIST, dynamix MNIST, OMNIGLOT, Caltech 101 Silhouette, Frey Faces and Histopathology patches) not only in log-likelihood(LL) but also in quality reducing blurring problem in standard VAE.</p>
<p><a href="https://arxiv.org/abs/1705.07120">Tomczak, Jakub M., and Max Welling. “VAE with a VampPrior.” arXiv preprint arXiv:1705.07120 (2017).</a></p>
Wed, 09 Jan 2019 09:35:59 +0000
https://lyusungwon.github.io/studies/2019/01/09/vpvae/
https://lyusungwon.github.io/studies/2019/01/09/vpvae/deep-learninggenerative-modelsstudiesLarge Scale GAN Training for High Fidelity Natural Image Synthesis<h2 id="why">WHY?</h2>
<p>Generating a high resolution image with GAN is difficult despite of recent advances. This paper suggests BigGAN which adds few tricks on previous model to generate large scale images without <a href="https://lyusungwon.github.io/generative-models/2019/01/02/pggan.html">progressively growing the network</a>.</p>
<h2 id="what">WHAT?</h2>
<p>BigGAN is made by a series of tricks over baseline model. Self-Attention GAN(SA-GAN) which uses self-attention modules on both generator and discriminator and hindge loss. Class information is provided with class-conditional BatchNorm to generator and with projection to discriminator. <a href="https://lyusungwon.github.io/generative-models/2018/07/31/sgan.html">Spectral Norm</a> on generator is used. On the top of this baseline model, five major tricks are used.</p>
<p><img src="/assets/images/biggan1.png" alt="image" class="center-image" width="50px" /></p>
<p>1) Simply increasing the batch size (by a factor of 8: 256 to 2048) greatly improved the performance(46% is IS score). 2) Increasing the capacity by increasing the number of channels(by 50%) in every layer improved the performance(21% is IS score). 3) Shared embedding which is linearly projected to each BatchNorm layer and 4) a variant of hierarchical latent space further improved the performance.</p>
<p>Author also explored new kinds of prior distribution. It turned out that 5) truncated normal distribution can be used to control trade off between variety and fidelity (Truncation Trick). Modified version(below) of Orthogonal Regularization(above) is found to help reducing the saturation artifacts caused by Truncation Trick.</p>
<pre class="MathJax_Preview"><code>R_{\beta}(W) = \beta\|W^TW - I\|_F^2\\
R_{\beta}(W) = \beta\|W^T W\odot(1-I)\|^2_F</code></pre>
<script type="math/tex; mode=display">R_{\beta}(W) = \beta\|W^TW - I\|_F^2\\
R_{\beta}(W) = \beta\|W^T W\odot(1-I)\|^2_F</script>
<h2 id="so">So?</h2>
<p>After comprehensive analysis on the training of the model, this paper concluded that stability comes from interaction between discriminator and generator so that reasonable conditioning and relaxing constraints to allow collapse at the later stage are helpful to achieve good results.</p>
<p><img src="/assets/images/biggan2.png" alt="image" class="center-image" width="50px" /></p>
<p>BigGAN was able to generate high quality large scale images in quantitative and qualitative measures.</p>
<p><a href="https://arxiv.org/abs/1809.11096">Brock, Andrew, Jeff Donahue, and Karen Simonyan. “Large scale gan training for high fidelity natural image synthesis.” arXiv preprint arXiv:1809.11096 (2018).</a></p>
Tue, 08 Jan 2019 14:53:59 +0000
https://lyusungwon.github.io/studies/2019/01/08/biggan/
https://lyusungwon.github.io/studies/2019/01/08/biggan/deep-learninggenerative-modelsstudiesBilinear Attention Networks<h2 id="why">WHY?</h2>
<p>Representing bilinear relationship of two inputs is expensive. <a href="https://lyusungwon.github.io/deep-learning/2018/09/19/mlb.html">MLB</a> efficiently reduced the number of parameters by substituting bilinear operation with Hadamard product operation. This paper extends this idea to capture bilinear attention between two multi-channel inputs.</p>
<h2 id="what">WHAT?</h2>
<p>Using the low-rank bilinear pooling, attention on visual inputs given a question vector can be calculated efficiently. This can be generalized to represent a bilinear model for two multi-channel inputs. While <code class="MathJax_Preview">\mathbf{f}_k'</code><script type="math/tex">\mathbf{f}_k'</script> indicates kth element of intermediate representation,</p>
<pre class="MathJax_Preview"><code>\mathbf{f}_k' = (\mathbf{X}^T\mathbf{U}')^T_k\mathcal{K}(\mathbf{Y}^T\mathbf{V}')_k\\
= \sum_{i=1}^{\rho}\sum_{j=1}^{\phi}\mathcal{A}_{ij}(\mathbf{X}^T_i\mathbf{U}_k')(\mathbf{V}'_k^T\mathbf{Y}_j) = \sum_{i=1}^{\rho}\sum_{j=1}^{\phi}\mathcal{A}_{ij}\mathbf{X}_i^T(\mathbf{U}_k'\mathbf{V}_k'^T)\mathbf{Y}_j\\
\mathbf{f} = \mathbf{P}^T\mathbf{f}'\\
= BAN(\mathbf{X}, \mathbf{Y}; \mathcal{A})</code></pre>
<script type="math/tex; mode=display">\mathbf{f}_k' = (\mathbf{X}^T\mathbf{U}')^T_k\mathcal{K}(\mathbf{Y}^T\mathbf{V}')_k\\
= \sum_{i=1}^{\rho}\sum_{j=1}^{\phi}\mathcal{A}_{ij}(\mathbf{X}^T_i\mathbf{U}_k')(\mathbf{V}'_k^T\mathbf{Y}_j) = \sum_{i=1}^{\rho}\sum_{j=1}^{\phi}\mathcal{A}_{ij}\mathbf{X}_i^T(\mathbf{U}_k'\mathbf{V}_k'^T)\mathbf{Y}_j\\
\mathbf{f} = \mathbf{P}^T\mathbf{f}'\\
= BAN(\mathbf{X}, \mathbf{Y}; \mathcal{A})</script>
<p>Attention between two input channels can be calculated similarly as in MLB.</p>
<pre class="MathJax_Preview"><code>\mathcal{A}_g = softmax(((1\cdot\mathbf{p}_g^T)\circ \mathbf{X}^T\mathbf{U})\mathbf{V}^T\mathbf{Y})</code></pre>
<script type="math/tex; mode=display">\mathcal{A}_g = softmax(((1\cdot\mathbf{p}_g^T)\circ \mathbf{X}^T\mathbf{U})\mathbf{V}^T\mathbf{Y})</script>
<p>BAN reflects the linear projection of multiple co-attention.</p>
<p>Low-rank bilinear pooling allowed efficient calculation of multiple glimpses of attention distribution given a single reference vector. BAN generalized this idea for two multi-channel inputs.</p>
<p>$$
\mathbf{f’}_k=(\mathbf{X}^T\mathbf{U’})^T_K\mathcal{A}(\mathbf{Y}^T\mathbf{V}’)_k\</p>
<p>Reading and writing in DNC are implemented with differentiable attention mechanism.</p>
<p><img src="/assets/images/dnc1.png" alt="image" class="center-image" width="50px" /></p>
<p>The controller of DNC is an variant of LSTM architecture that takes an input vector(<code class="MathJax_Preview">x_t</code><script type="math/tex">x_t</script>) and a set of read vectors(<code class="MathJax_Preview">r_{t-1}^1,...,r_{t-1}^R</code><script type="math/tex">r_{t-1}^1,...,r_{t-1}^R</script>) as input(concatenated). Concatenated input and hidden vectors from both previous timestep(<code class="MathJax_Preview">h_{t-1}^l</code><script type="math/tex">h_{t-1}^l</script>) and from previous layer(<code class="MathJax_Preview">h_t^{l-1}</code><script type="math/tex">h_t^{l-1}</script>) are concatenated again to be used as input for LSTM to produce next hidden vector(<code class="MathJax_Preview">h_t^l</code><script type="math/tex">h_t^l</script>). Hidden vectors from all layers at a timestep are concatenated to emit an output vector(<code class="MathJax_Preview">\upsilon_t</code><script type="math/tex">\upsilon_t</script>) and an interface vector(<code class="MathJax_Preview">\xi_t</code><script type="math/tex">\xi_t</script>). The output vector(<code class="MathJax_Preview">y_t</code><script type="math/tex">y_t</script>) is the sum of <code class="MathJax_Preview">\upsilon_t</code><script type="math/tex">\upsilon_t</script> and read vectors of the current timestep.</p>
<pre class="MathJax_Preview"><code>v_t = W_y[h_t^1;...;h_t^L]\\
\xi_t = W_{\xi}[h_t^1;...;h_t^L]\\
y_t = \upsilon_t + W_t[r_t^1;...;r_t^R]</code></pre>
<script type="math/tex; mode=display">v_t = W_y[h_t^1;...;h_t^L]\\
\xi_t = W_{\xi}[h_t^1;...;h_t^L]\\
y_t = \upsilon_t + W_t[r_t^1;...;r_t^R]</script>
<p>THe interface vectors are consists of many vectors that interacts with memory: R read keys(<code class="MathJax_Preview">\mathbf{k}_t^{r,i}\in R^W</code><script type="math/tex">\mathbf{k}_t^{r,i}\in R^W</script>), read strengths(<code class="MathJax_Preview">\beta_t^{r,i}</code><script type="math/tex">\beta_t^{r,i}</script>), write key(<code class="MathJax_Preview">\mathbf{k}_t^w\in R^W</code><script type="math/tex">\mathbf{k}_t^w\in R^W</script>), write strength(<code class="MathJax_Preview">\beta_t^w</code><script type="math/tex">\beta_t^w</script>), erase vector(<code class="MathJax_Preview">\mathbf{e}_t\in R^W</code><script type="math/tex">\mathbf{e}_t\in R^W</script>), write vector(<code class="MathJax_Preview">\mathbf{v}_t\in R^W</code><script type="math/tex">\mathbf{v}_t\in R^W</script>), R free gates(<code class="MathJax_Preview">f_t^i</code><script type="math/tex">f_t^i</script>), the allocation gate(<code class="MathJax_Preview">g_t^a</code><script type="math/tex">g_t^a</script>), the write gate(<code class="MathJax_Preview">g_t^w</code><script type="math/tex">g_t^w</script>) and R read modes(\mathbf{\pi}_t^i).</p>
<pre class="MathJax_Preview"><code>\mathbf{\xi}_t = [\mathbf{k}_t^{r,1};...;\mathbf{k}_t^{r,R};\beta_t^{r,1};...;\beta_t^{r,R};\mathbf{k}_t^w;\beta_t^w;\mathbf{e}_t;\mathbf{v}_t;f_t^1;...;f_t^R;g_t^a;g_t^w;\mathbf{\pi}_t^1;...;\mathbf{\pi}_t^R]</code></pre>
<script type="math/tex; mode=display">\mathbf{\xi}_t = [\mathbf{k}_t^{r,1};...;\mathbf{k}_t^{r,R};\beta_t^{r,1};...;\beta_t^{r,R};\mathbf{k}_t^w;\beta_t^w;\mathbf{e}_t;\mathbf{v}_t;f_t^1;...;f_t^R;g_t^a;g_t^w;\mathbf{\pi}_t^1;...;\mathbf{\pi}_t^R]</script>
<p>Read vectors are computed with read weights on memory. Memory matrix are updated with write weights, write vector and erase vector.</p>
<pre class="MathJax_Preview"><code>\mathbf{r}_t^i = M_t^T\mathbf{w}_t^{r,i}\\
M_t = M_{t-1}\odot(E-\mathbf{w}^w_t\mathbf{e}_t^T)+\mathbf{w}^w_t\mathbf{v}_t^T</code></pre>
<script type="math/tex; mode=display">\mathbf{r}_t^i = M_t^T\mathbf{w}_t^{r,i}\\
M_t = M_{t-1}\odot(E-\mathbf{w}^w_t\mathbf{e}_t^T)+\mathbf{w}^w_t\mathbf{v}_t^T</script>
<p>Memory are addressed with content-based addressing and dynamic memory allocation. Contesnt-based addressing is basically the same as attention mechanism. Dynamic memory allocation is designed to clear memory as analogous to free list memory allocation scheme.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/dnc1.png" alt="image" class="center-image" width="50px" /></p>
<p>DNC showed good result on bAbI task, and Graph tasks.</p>
<p><a href="https://www.nature.com/articles/nature20101">Graves, Alex, et al. “Hybrid computing using a neural network with dynamic external memory.” Nature 538.7626 (2016): 471.</a></p>
Mon, 07 Jan 2019 09:55:59 +0000
https://lyusungwon.github.io/studies/2019/01/07/ban/
https://lyusungwon.github.io/studies/2019/01/07/ban/deep-learningcomputer-visionstudiesHybrid computing using a neural network with dynamic external memory<h2 id="why">WHY?</h2>
<p>Using external memory as modern computer enable neural net the use of extensible memory. This paper suggests Differentible Neural Computer(DNC) which is an advanced version of <a href="https://lyusungwon.github.io/deep-learning/2018/06/06/ntm.html">Neural Turing Machine</a>.</p>
<h2 id="what">WHAT?</h2>
<p>Reading and writing in DNC are implemented with differentiable attention mechanism.</p>
<p><img src="/assets/images/dnc1.png" alt="image" class="center-image" width="50px" /></p>
<p>The controller of DNC is an variant of LSTM architecture that takes an input vector(<code class="MathJax_Preview">x_t</code><script type="math/tex">x_t</script>) and a set of read vectors(<code class="MathJax_Preview">r_{t-1}^1,...,r_{t-1}^R</code><script type="math/tex">r_{t-1}^1,...,r_{t-1}^R</script>) as input(concatenated). Concatenated input and hidden vectors from both previous timestep(<code class="MathJax_Preview">h_{t-1}^l</code><script type="math/tex">h_{t-1}^l</script>) and from previous layer(<code class="MathJax_Preview">h_t^{l-1}</code><script type="math/tex">h_t^{l-1}</script>) are concatenated again to be used as input for LSTM to produce next hidden vector(<code class="MathJax_Preview">h_t^l</code><script type="math/tex">h_t^l</script>). Hidden vectors from all layers at a timestep are concatenated to emit an output vector(<code class="MathJax_Preview">\upsilon_t</code><script type="math/tex">\upsilon_t</script>) and an interface vector(<code class="MathJax_Preview">\xi_t</code><script type="math/tex">\xi_t</script>). The output vector(<code class="MathJax_Preview">y_t</code><script type="math/tex">y_t</script>) is the sum of <code class="MathJax_Preview">\upsilon_t</code><script type="math/tex">\upsilon_t</script> and read vectors of the current timestep.</p>
<pre class="MathJax_Preview"><code>v_t = W_y[h_t^1;...;h_t^L]\\
\xi_t = W_{\xi}[h_t^1;...;h_t^L]\\
y_t = \upsilon_t + W_t[r_t^1;...;r_t^R]</code></pre>
<script type="math/tex; mode=display">v_t = W_y[h_t^1;...;h_t^L]\\
\xi_t = W_{\xi}[h_t^1;...;h_t^L]\\
y_t = \upsilon_t + W_t[r_t^1;...;r_t^R]</script>
<p>THe interface vectors are consists of many vectors that interacts with memory: R read keys(<code class="MathJax_Preview">\mathbf{k}_t^{r,i}\in R^W</code><script type="math/tex">\mathbf{k}_t^{r,i}\in R^W</script>), read strengths(<code class="MathJax_Preview">\beta_t^{r,i}</code><script type="math/tex">\beta_t^{r,i}</script>), write key(<code class="MathJax_Preview">\mathbf{k}_t^w\in R^W</code><script type="math/tex">\mathbf{k}_t^w\in R^W</script>), write strength(<code class="MathJax_Preview">\beta_t^w</code><script type="math/tex">\beta_t^w</script>), erase vector(<code class="MathJax_Preview">\mathbf{e}_t\in R^W</code><script type="math/tex">\mathbf{e}_t\in R^W</script>), write vector(<code class="MathJax_Preview">\mathbf{v}_t\in R^W</code><script type="math/tex">\mathbf{v}_t\in R^W</script>), R free gates(<code class="MathJax_Preview">f_t^i</code><script type="math/tex">f_t^i</script>), the allocation gate(<code class="MathJax_Preview">g_t^a</code><script type="math/tex">g_t^a</script>), the write gate(<code class="MathJax_Preview">g_t^w</code><script type="math/tex">g_t^w</script>) and R read modes(\mathbf{\pi}_t^i).</p>
<pre class="MathJax_Preview"><code>\mathbf{\xi}_t = [\mathbf{k}_t^{r,1};...;\mathbf{k}_t^{r,R};\beta_t^{r,1};...;\beta_t^{r,R};\mathbf{k}_t^w;\beta_t^w;\mathbf{e}_t;\mathbf{v}_t;f_t^1;...;f_t^R;g_t^a;g_t^w;\mathbf{\pi}_t^1;...;\mathbf{\pi}_t^R]</code></pre>
<script type="math/tex; mode=display">\mathbf{\xi}_t = [\mathbf{k}_t^{r,1};...;\mathbf{k}_t^{r,R};\beta_t^{r,1};...;\beta_t^{r,R};\mathbf{k}_t^w;\beta_t^w;\mathbf{e}_t;\mathbf{v}_t;f_t^1;...;f_t^R;g_t^a;g_t^w;\mathbf{\pi}_t^1;...;\mathbf{\pi}_t^R]</script>
<p>Read vectors are computed with read weights on memory. Memory matrix are updated with write weights, write vector and erase vector.</p>
<pre class="MathJax_Preview"><code>\mathbf{r}_t^i = M_t^T\mathbf{w}_t^{r,i}\\
M_t = M_{t-1}\odot(E-\mathbf{w}^w_t\mathbf{e}_t^T)+\mathbf{w}^w_t\mathbf{v}_t^T</code></pre>
<script type="math/tex; mode=display">\mathbf{r}_t^i = M_t^T\mathbf{w}_t^{r,i}\\
M_t = M_{t-1}\odot(E-\mathbf{w}^w_t\mathbf{e}_t^T)+\mathbf{w}^w_t\mathbf{v}_t^T</script>
<p>Memory are addressed with content-based addressing and dynamic memory allocation. Contesnt-based addressing is basically the same as attention mechanism. Dynamic memory allocation is designed to clear memory as analogous to free list memory allocation scheme.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/dnc1.png" alt="image" class="center-image" width="50px" /></p>
<p>DNC showed good result on bAbI task, and Graph tasks.</p>
<p><a href="https://www.nature.com/articles/nature20101">Graves, Alex, et al. “Hybrid computing using a neural network with dynamic external memory.” Nature 538.7626 (2016): 471.</a></p>
Fri, 04 Jan 2019 09:01:59 +0000
https://lyusungwon.github.io/studies/2019/01/04/dnc/
https://lyusungwon.github.io/studies/2019/01/04/dnc/deep-learningdeep-learningstudiesSSD: Single Shot MultiBox Detector<h2 id="why">WHY?</h2>
<p>Object box proposal process is complicated and slow in object detection process. This paper proposes Single Shot Detector(SSD) to detect objects with single neural network.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/ssd1.png" alt="image" class="center-image" width="50px" /></p>
<p>SSD prodeces fixed-size collection of bounding boxes and scores the presence of class objects in the boxes.</p>
<p><img src="/assets/images/ssd2.png" alt="image" class="center-image" width="50px" /></p>
<p>The front of SSD is standard classification model with truncated classification layer(base network). After the base network, convolution layers added to decrease the size progressively. These progressively decreasing feature layers represent multiple granuality of bounding boxes. For each feature map, multiple scale(k) classifiers are applied to 3 x 3 area to compute scores of c class labels and 4 relative offsets. For m x n feature map, 3 x 3 filter with k x (c + 4) channel classfiers are applied to produce (c + 4)kmn outputs.</p>
<p>This process require ground truth bounding boxes. Default boxes which has higher jaccard overlap with ground truth boxes than a threshold(0.5) are considered as answer. The training objective consists of two parts: the localization loss and the confidence loss. with <code class="MathJax_Preview">x_{ij}^k</code><script type="math/tex">x_{ij}^k</script> indicates the matching the i-th default box to the j-th ground truth box of category p.</p>
<pre class="MathJax_Preview"><code>L(x, c, l, g) = \frac{1}{N}(L_{conf}(x,c) + \alpha L_{loc}(x, l, g))\\
L_{loc}(x, l, g) = \sum^N_{i\in Pos}\sum_{m\in\{cx, cy, w, h\}} x_{ij}^k smooth_{L1}(l_i^m - \hat{g}_j^m)\\
\hat{g}_j^{cx} = \frac{(g_j^{cx} - d_i^{cx})}{d_i^w}\\
\hat{g}_j^{cy} = \frac{(g_j^{cy} - d_i^{cy})}{d_i^h}\\
\hat{g}_j^{w} = \log \frac{g_j^{w}}{d_i^w}\\
\hat{g}_j^{h} = \log \frac{g_j^{h}}{d_i^h}\\
L_{conf}(x, c) = - \sum^N_{i\in Pos}x_{ij}^p \log(\hat{c}_i^p) - \sum_{i\in Neg}\log(\hat{c}_i^0)</code></pre>
<script type="math/tex; mode=display">L(x, c, l, g) = \frac{1}{N}(L_{conf}(x,c) + \alpha L_{loc}(x, l, g))\\
L_{loc}(x, l, g) = \sum^N_{i\in Pos}\sum_{m\in\{cx, cy, w, h\}} x_{ij}^k smooth_{L1}(l_i^m - \hat{g}_j^m)\\
\hat{g}_j^{cx} = \frac{(g_j^{cx} - d_i^{cx})}{d_i^w}\\
\hat{g}_j^{cy} = \frac{(g_j^{cy} - d_i^{cy})}{d_i^h}\\
\hat{g}_j^{w} = \log \frac{g_j^{w}}{d_i^w}\\
\hat{g}_j^{h} = \log \frac{g_j^{h}}{d_i^h}\\
L_{conf}(x, c) = - \sum^N_{i\in Pos}x_{ij}^p \log(\hat{c}_i^p) - \sum_{i\in Neg}\log(\hat{c}_i^0)</script>
<p>Boxes of different aspect ratios are proposed for each feature map.</p>
<pre class="MathJax_Preview"><code>s_k = s_{min} + \frac{s_{max} - s_{min}}{m - 1}(k-1), k\in[1,m]</code></pre>
<script type="math/tex; mode=display">s_k = s_{min} + \frac{s_{max} - s_{min}}{m - 1}(k-1), k\in[1,m]</script>
<p>With feature map of differnt granuality and aspect ratios, SSD can a wide range of boxes.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/ssd3.png" alt="image" class="center-image" width="50px" /></p>
<p>SSD achieved better result on PASCAL VOC2007, 2012 and COCO than Fast and Faster R-CNN.</p>
<h2 id="critic">Critic</h2>
<p>Integrating redundent and slow process into neural network seems convenient idea. However, I think there would be better way to suggest boxes than proposing hundreds of them.</p>
<p><a href="https://arxiv.org/abs/1512.02325">Liu, Wei, et al. “Ssd: Single shot multibox detector.” European conference on computer vision. Springer, Cham, 2016.</a></p>
Thu, 03 Jan 2019 09:31:59 +0000
https://lyusungwon.github.io/studies/2019/01/03/ssd/
https://lyusungwon.github.io/studies/2019/01/03/ssd/deep-learningcomputer-visionstudiesProgressive Growing of GANs for improved Quality, Stability, and Variation<h2 id="why">WHY?</h2>
<p>Training GAN on high-resolution images is known to be difficult.</p>
<h2 id="what">WHAT?</h2>
<p>This paper suggests new method of training GAN to train progressively from coarse to fine scale.</p>
<p><img src="/assets/images/pggan1.png" alt="image" class="center-image" width="50px" /></p>
<p>A pair of generator and discriminator are trained with low scale real and fake images at first. As input image size grows, the generator and discriminator add a layer on top of previously learned layers.</p>
<p><img src="/assets/images/pggan2.png" alt="image" class="center-image" width="50px" /></p>
<p>New layer is added smoothly to preserve previously learned layers. Doubling and halving the image is implemented wtih nearest neighbot filtering and average pooling. Weight <code class="MathJax_Preview">\alpha</code><script type="math/tex">\alpha</script> increases linearly from 0 to 1.</p>
<p>To increase the variation, the average of standard variation of minibatch is replicated to all spatial location and used as an additional feature map at the end of discriminator.</p>
<p>In order to reduce unhealthy competition between generator and discriminator, this paper suggest two methods. First is to scale the weights at runtime with per-layer normalization constant from He’s initializer(c). Compared to adaptive SGD methods, this method allows to apply equalized learning rate across all weights. Second method is to normalize the feature vector in each pixel to unit length in the generator after each conv layer using “local response normalization”. This prevents the excessive escalation of magnitudes of both network.</p>
<h2 id="so">So?</h2>
<p><img src="/assets/images/pggan3.png" alt="image" class="center-image" width="50px" /></p>
<p>This paper suggests using sliced Wasserstein distance(SWD) over MS-SSIM as metric. PGGAN produced better high-resolution images than previous methods in terms of quantitative and qualitative measure.</p>
<h2 id="critic">Critic</h2>
<p>Progressive training of GAN is not only intuitive but also super effective method to train networks. Amazing images qualities!</p>
<p><a href="https://arxiv.org/abs/1710.10196">Karras, Tero, et al. “Progressive growing of gans for improved quality, stability, and variation.” arXiv preprint arXiv:1710.10196 (2017).</a></p>
Wed, 02 Jan 2019 11:31:59 +0000
https://lyusungwon.github.io/studies/2019/01/02/pggan/
https://lyusungwon.github.io/studies/2019/01/02/pggan/deep-learninggenerative-modelsstudiesBERT: Pretraining of Deep Bidirectional Transformers for Language Understanding<h2 id="why">WHY?</h2>
<p>Former <a href="https://lyusungwon.github.io/natural-language-processing/2018/03/21/transformer.html">Transformer</a> was unidirectional language model.</p>
<h2 id="what">WHAT?</h2>
<p><img src="/assets/images/bert1.png" alt="image" class="center-image" width="50px" /></p>
<p>BERT is multi-layer bidirectional Transformer encoder.</p>
<p><img src="/assets/images/bert2.png" alt="image" class="center-image" width="50px" /></p>
<p>A sequence of input representation can be either a single text sentence or a pair of text sentences. The first token of every sequence is a classification token and sentences in a sequence are separated by a separation token. Each token in the sequence is the sum of three components: Token embeddings, segment embeddings and position embeddings. Two sentences in a sequence is differentiated again by segment embeddings.</p>
<p>Two kinds of pretraining tasks are used for BERT. Since BERT is bidirectional, masked language modeling task is used. Mased LM randomly choose words in the corpus and replace with mask token and the hidden vector of the mask token is used for the softmax. BERT randomly choose 15% of the tokens. 80% of the chosen words are turn into mask tokens. 10% are replaced with a random word and the rest 10% are kept unchanged. The second task is next sentence prediction task. The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood.</p>
<p><img src="/assets/images/bert3.png" alt="image" class="center-image" width="50px" /></p>
<p>Fine-tuning for each task can be differ. Sentence level classification task only need a W matrix that can be multiplied to classification token. Start and end vector is trained to estimate the span in question answering task. A classifier is trained to classify each token in named entity recognition task.</p>
<h2 id="so">So?</h2>
<p>BERT achieved State-Of-The-Art result in nearly every language task including GLUE datasets, SQUAD dataset, named-entity recognition task, and SWAG dataset with a single model. Broad ablation study is conducted to prove the effectiveness of BERT.</p>
<h2 id="critic">Critic</h2>
<p>No argument on the incredible result, but I’m not sure what BERT is doing. Works on interpretability can be useful to examine whether BERT is actually solving the task.</p>
<p><a href="https://arxiv.org/abs/1810.04805">Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).</a></p>
Wed, 02 Jan 2019 09:00:59 +0000
https://lyusungwon.github.io/studies/2019/01/02/bert/
https://lyusungwon.github.io/studies/2019/01/02/bert/deep-learningnatural-language-processingstudies