Speech Enhancement Using Deep Neural Networks [Audio/Speech Processing : Part 3]

Leonard Loo
3 min readApr 19, 2020

In this article, we discuss how to perform Speech Enhancement, which is a problem of obtaining the clean speech when in noisy environment.

This is Part 3 of the the series:

  1. Part 1 : Understanding common features used
  2. Part 2 : Speech Data Augmentation

Contents:

  1. Traditional Methods
  2. Direct Mapping
  3. Masked-based Post-processing
  4. LSTM-based
  5. Time Domain Loss

Traditional Methods

While I will not go into details on some conventional approach here, I will simply list them. These are some notable approaches in speech enhancement before deep neural networks come into play:

  1. Spectral subtraction
  2. Wiener Filtering
  3. MMSE Estimator
  4. OM-LSA & MCRA
  5. IMCRA

Direct mapping

Here, only the classical fully-connected layers, no convolutional or recurrent layers are yet explored.

After converting the speech to spectrogram, the network takes in a context of 11 frames, to predict the center frame.

Configuration of hidden layers: 3 hidden layers, 2048 neurons each. Finally, the output layer is the center frame of the enhanced spectrogram.

Mask-based Post-processing

Here, on top of the noisy input, noisy MFCC is used as an additional feature to complement the learning. Clean MFCC and a mask is also predicted as additional output.

The predicted mask helps in post-processing as it learns which frequency bin is noisier, hence requiring more masking. Some input is already clean, thus no masking is required.

LSTM-based

The only difference between this and the previous paper two methods is that, instead of using the classical method of fully-connected layers, LSTM is used instead.

Because speech is a time-series, it is highly dependent on temporal context, hence a recurrence model would seem to be the better choice.

Time Domain Loss (TDL)

This paper has a two step approach to deal with noise, and reverberation separately. Also, phase is used to reconstruct the waveform to provide additional loss : TDL.

The DNN 1 and DNN 2 here can be though of using any of the above methods (Direct mapping, Mask-based, etc) as long as you output the enhanced spectrum.

TDL is used because not all time-freq units are equally important. For example, for a particular word, emphasis on 30Hz might be more than that of 150Hz. Reconstructing them to the waveform using the clean phase allows us to address this issue under the hood.

My implementation of this paper can be found here.

--

--