Dynamic Speech Generation to Enhance Intelligibility in Noisy Environments

 

Abstract

Synthetic speech is occasionally enhanced by a fixed amount before being presented to the listener. However, this approach neglects the diverse types and levels of noise, potentially resulting in either unintelligible or unpleasant speech. This paper proposes a dynamic speech enhancement approach, which generates speech while accounting for the presence of noise. Our methodology extends the WaveRNN vocoder by conditioning not only the speech mel-spectrograms but also the spectrogram from the background noise. In training, the target speech is tilted according to the listeners preference and we further introduce an objective metric that aims at maximizing the spectro-temporal regions where target energy exceeds the masker energy. The generated speech was tested in speech-shaped noise at various noise levels. Our evaluation results showed that (a) in more adverse conditions, both a fixed-amount post-enhanced baseline system and the suggested dynamically-enhanced system performed equally well in terms of intelligibility and preference and (b) in the least noisy conditions, where the intelligibility scores of all models were nearly identical, listeners preferred the dynamically-enhanced speech. Our findings demonstrate and reinforce the benefits of using dynamic speech enhancement techniques in noisy environments.

 

Tested Vocoders

Four systems were trained and then compared:

  1. Baseline waveRNN (BL): Refers to the state-of-the-art waveRNN vocoder.
  2. Post-processed enhancement of BL generated speech (BL-post): The generated speech from the BL model is enhanced by a fixed amount modification of the spectral tilt. This is achieved by minimizing the log-likelihood between the glimpse profile of the speech signal and the model’s learned distribution at -10 dB SNR level.
  3. Environment-aware waveRNN (EA λ=0): The proposed noise-conditioning vocoder with λ=0 (no extra loss term).
  4. Environment-aware waveRNN with intelligibility-enhancing loss term (EA λ=1): Similar to the previous model with an extra term in the loss function.

Systems 1 and 2 were trained exclusively with the untilted dataset, while the last two systems incorporated both the tilted and untilted dataset.

sample-id SNR BL BL-post EA λ=0 EA λ=1
LJ002-0166 -10dB
LJ004-0215 -10dB
LJ002-0166 -7.5dB
LJ004-0215 -7.5dB
LJ002-0166 -2dB
LJ004-0215 -2dB
LJ002-0166 6dB
LJ004-0215 6dB

 

Citing

O. Simantiraki, M. Markaki, and Y. Pantazis: Dynamic Speech Generation to Enhance Intelligibility in Noisy Environments. Αccepted for presentation at ICASSP 2025

Get In Touch

Foundation for Research and Technology - Hellas
Ν. Plastira 100, Vassilika Vouton GR-700 13, Heraklion, Crete

+30 2810 391825

pantazis@iacm.forth.gr