A Hierarchical Latent Variable Encoder-Decoder model for Generating Dialogues
WHY?
Hierarchical recurrent encoder-decoder model(HRED) that aims to capture hierarchical structure of sequential data tends to fail because model is encouraged to capture only local structure and LSTM often has vanishing gradient effect.
WHAT?
Latent Variable Hierarchical Recurrent Encoder-Decoder(VHRED) tried to improve HRED by forcing to learn z with variational inference. Generative process output z from previous w, and inference process infer z through next w.
P_{\theta}(z_n|w_{<n}) = N(\mu_{prior}(w_{<n}), \Sigma_{prior}(w_{<n}))\\
P_{\theta}(w_n|z_n, w_{<n}) = \Pi_{m=1}^{M_n}P_{\theta}(w_{n,m}|z_n, w_{<n}, w_{n, <m})\\
log P_{\theta}(w_{<N}) \geq \Sigma_{n=1}^N - KL[Q_{\psi}(z_n|w_{\leq n})\|P_{\theta}(z_n|w_{<n})] + E_{Q_{\psi}(z_n|w_{\leq n})}[log P_{\theta}(w_n|z_n, w_{<n})]\\
Q_{\psi}(z_n|w_{\leq N}) = Q_{\psi}(z_n|w_{\leq n}) = N(\mu_{posterior}(w_{\leq n}), \Sigma_{posterior}(w_{\leq n}))
So?
VHRED tends to perform better in Long context than LSTM and HRED.
Critic
It is good to implement variational inference in sequential data but I’m not sure this significantly improved performance. This model can be used to make a good paragraph vector.