# Transformation Autoregressive Networks

## WHY?

Autoregressive model has been dominant model for density estimation. On the other hand, various non-linear transformations techniques enabled tracking of density after transformation of variables. Transformation Autoregressive Networks(TAN) combined non-linear transformation into autoregressive model to capture more complicated density of data.

## WHAT?

TAN is composded of two module: autogregressive model and non-linear transformation. Autoregressive model represent the probability of data with products of sequential conditionals.

```
p(x_1,...,x_d) = \prod^d_{i=1}p(x_i|x_{i-1},...,x_1)\\
p(x_i|x_{i-1},...,x_1}) = p(x_i|MM(\theta(x_{i-1},...,x_1)))\\
\theta(x_{i-1},...,x_1) = f(h_i)\\
h_i = g_i(x_{i-1},...,x_1)
```

TAN proposes two kinds of autoregressive models. First is Linear Autoregressive Model(LAM) which have different weight matrix per conditional. Second is Recurrent Autoregressive Model(RAM). These two models have trade-off of flexibility and the number of parameters.

```
g_i(x_{i-1},...,x_1) = W^{(i)}x_{<i} + b\\
h_i = g(x_{i-1}, g(x_{i-2},...,x_1)) = g(x_{i-1}, h_{i-1})
```

To tract the probability of transformation, the transformation must be invertible and the determinant of its Jacobian have to be computed easily. This paper suggests several transformations that meet this criteria.

`p(x_1, ... x_d) = \|\det \frac{dq}{dx}\| \prod_{i=1}^d p(z_i|z_{i-1},...,z_1)`

First is Linear Transformation. The determinant of matrix A can be computed via LU decomposition.

```
z = Ax + b\\
A = LU\\
\det{dz}{dx} = \prod^d_{i=1} U_{ii}
```

Second is Recurrent Transformation. r is a ReLU unit and `r_{\alpha}`

is leaky ReLU unit.

```
z_i = r_{\alpha}(yx_i + w^Ts_{i-1} + b), s_i = r(ux_i + v^Ts_{i-1} + a)\\
r_{\alpha}(t) = \mathbb{I}\{t<0\}\alpha t + \mathbb{I}\{t \geq 0\}t\\
x_i = \frac{1}{y}(r^{-1}_{\alpha}(z_i^{(r)})- w^T s_{i-1} - b)\\
s_i = r(ux_i + v^Ts_{i-1} + a)\\
\det\frac{dz}{dx} = y^d\prod^d_{i=1}r'_{\alpha}(yx_i + w^Ts_{i-1} + b)\\
r'_{\alpha}(t) = \mathbb{I}\{t>0\} + \alpha\mathbb{I}\{t < 0\}
```

Third is Recurrent Shift Transformation that perform additive shift based on prior dimensions. The determinant of Jacobian is always 1 for this transformation.

```
z_i = x_i + m(s_{i-1})\\
s_i = g(x_i, s_{i-1})
```

These various transformations can be composed. Combining transformations of variables and rich autoregressive models, we can estimate the density of data.

`-\logp(x_i,...,x_d) = -\Sum^{t=1}^T \log\|\det\frac{d q^{(t)}}{d q^{(t-1)}}\| - \Sum_{t=1}^d \log p(q_i(x)|h_i)`

## So?

Compared to MADE, Real NVP, and MAF, TAN showed best log likelihood not only in MNIST, but also in various real world data.

## Critic

Clever combining of autoregressive model and flows.