Gradient descent methods depend on the first order gradient of a loss function wrt parameters. However, the second order gradient(Hessian) is often neglected.
This paper explored exact Hessian prodect of neural network (after convergence) and discovered that the eigenvalue of Hessian is separated into two groups: 0s and large, positive values (singular). This property did not depend on the loss function, or the choice of initial points. Varying the parameters of models and datas, this paper observed that the bulk of the eigenvalues depend on the architecture while top discrete eigenvalues depend on data.
The property of Hessian provides useful information about landscape.
Deeper discussion about the property would be nicer.
Subscribe via RSS