The same motivation with Concrete.
Gumbel-Softmax distribution is the same as Concrete distribution. GS distribution appoaches to on-hot as temperature goes 0. However, GS samples are not exactly the same as categorical samples resulting bias. This GS estimator becomes close to unbiased but the variance of gradient increase (trade-off). If we do not replace categorical variables with GS variables, we can estimate the gradient with Straight-Through GS estimator. (3) is score function estimator, (4) is Straight-Through estimator and (5) is GS estimator (pathwise derivative).
GS outperformed in generation of MNIST images in NLL. ST-GS outperformed previous approach in semi-supervised classification.
Nice visualization. I wonder how this can be used in NLP.