# Neural Variational Inference and Learning in Belief Networks

## WHY?

Directed latent variable models are known to be difficult to train at large scale because posterior distribution is intractable.

## WHAT?

This paper suggests way to estimate inference model with feed-forward network. Since exact posterior `P_{\theta}(h|x)`

is intractable, we use `Q_{\phi}(h|x)`

to approximate. `\log P_{\theta}(x) = \log\sum_h P_{\theta}(x,h)\\ \geq \sum_h Q_{\phi}(h|x) log \frac{P_{\theta}(x,g)}{Q_{\phi}(h|x)}\\ = E_Z[\log P_{\theta}(x, h) - \log Q_{\phi}(h|x)]\\ = \mathcal{L}(x, \theta, \phi)\\ \mathcal{L}(x, \theta, \phi) = \log P_\theta(x) - KL(Q_{\phi}(h|x)|P_{\theta}(h|x))`

Since h is sampled from posterior, it is impossible to get exact gradient of lowerbound wrt parameters. Therefore, Monte-Carlo estimation is used, and score function estimator (REINFORCE) is used to get gradient of lowerbound which include stochastic variable h. `\nabla_{\theta}\mathcal{L}(x) = E_Q[\nabla_{\theta}\log P_{\theta}(x, h)]\\ \nabla_{\phi}\mathcal{L}(x) = E_Q[(\log P_{\theta}(x,h) - \log Q_{\phi}(h|x)) \times \nabla_{\phi}\log Q_{\phi}(h|x)]\\ \nabla_{\theta}\mathcal{L}(x) \approx \frac{1}{n}\sum^n_{i=1}\nabla_{\theta}\log P_{\theta}(x, h^{(i)})\\ \nabla_{\phi}\mathcal{L}(x) \approx \frac{1}{n}\sum^n_{i=1}(\log P_{\theta}(x,h^{(i)}) - \log Q_{\phi}(h^{(i)}|x)) \times \nabla_{\phi}\log Q_{\phi}(h^{(i)}|x)\\`

However, usually the variance of estimation of gradient using score function estimator is high. So variance reduction technique is used to estimate the gradient. c is global baseline that is learned through training, and `C_{\psi}`

is used to input-dependent baseline. Input-dependent baseline is also trained to minimize the mse. `l_{\phi}(x, h) = \log P_{\theta}(x,h) - \log Q_{\phi}(h|x)\\ E_{Q}[(l_{\phi}(x, h) - C_{\psi}(x) - c)^2]\\`

To make training stable, variance is normalized with running estimate when it is greater than 1. If inference network is structured, we can estimate local learning signal for each factorized conditionals. `\nabla_{\phi_i}\mathcal{L}(x) = E_{Q(h^{1:i-1}|x)}[E_{Q(h^{i:n}|h^{i-1})}[l_{\phi}(x,h)\nabla_{\phi_i}\log Q_{\phi_i}(h^i|h^{i-1})]|h^{i-1}]\\ l^i_{\phi}(x, h) = \log P_{\theta}(h^{i-1:n}) - \log Q_{\phi}(h^{i:n}|h^{i-1})\\`

Then, layer-dependent baseline need to be learned.

## So?

NVIL used in sigmoid belief network (SBN) outperformed SBN using wake-sleep algorithm and other models including DARN, NADE, RBM and MoB in NLL for MNIST. SBN using NVIL showed better performance in document modeling then LDA.

## Critic

This seems smart move, but reparameterization of VAE was too strong. This can be used in cases where distribution is impossible to reparamterize.