Transformation Autoregressive Networks

WHY?

Autoregressive model has been dominant model for density estimation. On the other hand, various non-linear transformations techniques enabled tracking of density after transformation of variables. Transformation Autoregressive Networks(TAN) combined non-linear transformation into autoregressive model to capture more complicated density of data.

WHAT?

TAN is composded of two module: autogregressive model and non-linear transformation. Autoregressive model represent the probability of data with products of sequential conditionals.

$p(x_1,...,x_d) = \prod^d_{i=1}p(x_i|x_{i-1},...,x_1)\\ p(x_i|x_{i-1},...,x_1}) = p(x_i|MM(\theta(x_{i-1},...,x_1)))\\ \theta(x_{i-1},...,x_1) = f(h_i)\\ h_i = g_i(x_{i-1},...,x_1)$

TAN proposes two kinds of autoregressive models. First is Linear Autoregressive Model(LAM) which have different weight matrix per conditional. Second is Recurrent Autoregressive Model(RAM). These two models have trade-off of flexibility and the number of parameters.

$g_i(x_{i-1},...,x_1) = W^{(i)}x_{<i} + b\\ h_i = g(x_{i-1}, g(x_{i-2},...,x_1)) = g(x_{i-1}, h_{i-1})$

To tract the probability of transformation, the transformation must be invertible and the determinant of its Jacobian have to be computed easily. This paper suggests several transformations that meet this criteria.

$p(x_1, ... x_d) = \|\det \frac{dq}{dx}\| \prod_{i=1}^d p(z_i|z_{i-1},...,z_1)$

First is Linear Transformation. The determinant of matrix A can be computed via LU decomposition.

$z = Ax + b\\ A = LU\\ \det{dz}{dx} = \prod^d_{i=1} U_{ii}$

Second is Recurrent Transformation. r is a ReLU unit and $r_{\alpha}$ is leaky ReLU unit.

$z_i = r_{\alpha}(yx_i + w^Ts_{i-1} + b), s_i = r(ux_i + v^Ts_{i-1} + a)\\ r_{\alpha}(t) = \mathbb{I}\{t<0\}\alpha t + \mathbb{I}\{t \geq 0\}t\\ x_i = \frac{1}{y}(r^{-1}_{\alpha}(z_i^{(r)})- w^T s_{i-1} - b)\\ s_i = r(ux_i + v^Ts_{i-1} + a)\\ \det\frac{dz}{dx} = y^d\prod^d_{i=1}r'_{\alpha}(yx_i + w^Ts_{i-1} + b)\\ r'_{\alpha}(t) = \mathbb{I}\{t>0\} + \alpha\mathbb{I}\{t < 0\}$

Third is Recurrent Shift Transformation that perform additive shift based on prior dimensions. The determinant of Jacobian is always 1 for this transformation.

$z_i = x_i + m(s_{i-1})\\ s_i = g(x_i, s_{i-1})$

These various transformations can be composed. Combining transformations of variables and rich autoregressive models, we can estimate the density of data.

$-\logp(x_i,...,x_d) = -\Sum^{t=1}^T \log\|\det\frac{d q^{(t)}}{d q^{(t-1)}}\| - \Sum_{t=1}^d \log p(q_i(x)|h_i)$

So?

Compared to MADE, Real NVP, and MAF, TAN showed best log likelihood not only in MNIST, but also in various real world data.

Critic

Clever combining of autoregressive model and flows.

Oliva, Junier B., et al. “Transformation Autoregressive Networks.” arXiv preprint arXiv:1801.09819 (2018).

WHY?

WHAT?

So?

Critic

Share this post