WHY?

For audio source separation task, traditional approach only utilized magnitude part ignoring phase part. Previously deep complex network provided complex arithmetics via convolution. image

WHAT?

Deep Complex U-net modified simple DCN to better preseve audio information. First, DCU used strided complex-valued convolutional layers instead of max pooling operation. Second, complex batch normalization is used. Third, leaky CReLU is used instead of CReLU. For source separation task, DCU used complex-valued mask instead of real-valued mask.

The problem of complex-valued mask is that the values can range from - to (Unbounded mask). Unbounded mask may capture any form of appropriate mask, but empirically proved that the good optimization is difficult.

image

Previous work suggested sigmoid activation for each of real and imaginary part. However, seeing the distribution of appropriate mask, this activation can capture very limited area. DCU suggest polar-coordinate-wise masking to keep the mask bounded in a unit-circle in complex-space.

Since former MSE losses(Spectrogram-MSE, Wave-MSE) did not correlated with evaluation measures, DCU proposed weighted-SDR losses. Source-to-distortion ratio(SDR) represent the distortion ratio of reconstruted audio.

To prevent the zero division and bound the loss to -1 to 1, weighted-SDR loss is proposed.

So?

DCU achieved state-of-the-art result in CSIG, CBAK, COVL, PESQ, and SSNR compared to SEGAN, Wavenet, MMSE-GAN and Deep Feature Loss.

Critic

Incredible application of complex-network!

Not disclosed yet