• Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    WHY? Gradient descent methods depend on the first order gradient of a loss function wrt parameters. However, the second order gradient(Hessian) is often neglected. WHAT? This paper explored exact Hessian prodect of neural network (after convergence) and discovered that the eigenvalue of Hessian is separated into two groups: 0s and...


  • Variational Inference for Monte Carlo Objectives

    WHY? Recent variational training requires sampling of the variational posterior to estimate gradient. NVIL estimator suggest a method to estimate the gradient of the loss function wrt parameters. Since score function estimator is known to have high variance, baseline is used as variance reduction technique. However, this technique is insufficient...


  • [Pytorch] MADE

    Pytorch implementation of MADE: Masked Autoencoder for Distribution Estimation. https://github.com/Lyusungwon/generative_models_pytorch Reference https://github.com/karpathy/pytorch-made Note Autoregressive sampling was tricky Results Config model: 180817182411_made_1000_200_0.001_28_28_1000_2_1_False epochs 1000 batch-size 200 lr 1e-3 hidden-size 1000 layer-size 2 mask-num 1 start-sample 394 random-order False Test loss Samples Original - Reconstruction - Inpainting input - Inpainting output


  • Noisy Network for Exploration

    WHY? Efficient exploration of agent in reinforcement learning is an important issue. Conventional exploration heuristics includes -greedy for DQN and entropy reward for A3C. WHAT? NoisyNet is a neural network whose parameters are replaced with a parametric function of the noise. There are two options for noise: Independent Gaussian noise...


  • Unsupervised Deep Embedding for Clustering Analysis

    WHY? There had been little study on learning representation that focus on clustering. WHAT? Deep Embedding Clustering(DEC) consists of two phases: parameter initialization with a deep autoencoder and (2) parameter optimization. This paper first describe the second phase. Assume encoder and inital cluster centroids are given, two steps are alternated...