STOMA

Dynamic Speech Generation to Enhance Intelligibility in Noisy Environments

Abstract

Synthetic speech is occasionally enhanced by a fixed amount before being presented to the listener. However, this approach neglects the diverse types and levels of noise, potentially resulting in either unintelligible or unpleasant speech. This paper proposes a dynamic speech enhancement approach, which generates speech while accounting for the presence of noise. Our methodology extends the WaveRNN vocoder by conditioning not only the speech mel-spectrograms but also the spectrogram from the background noise. In training, the target speech is tilted according to the listeners preference and we further introduce an objective metric that aims at maximizing the spectro-temporal regions where target energy exceeds the masker energy. The generated speech was tested in speech-shaped noise at various noise levels. Our evaluation results showed that (a) in more adverse conditions, both a fixed-amount post-enhanced baseline system and the suggested dynamically-enhanced system performed equally well in terms of intelligibility and preference and (b) in the least noisy conditions, where the intelligibility scores of all models were nearly identical, listeners preferred the dynamically-enhanced speech. Our findings demonstrate and reinforce the benefits of using dynamic speech enhancement techniques in noisy environments.

Tested Vocoders

Four systems were trained and then compared:

Baseline waveRNN (BL): Refers to the state-of-the-art waveRNN vocoder.

Post-processed enhancement of BL generated speech (BL-post): The generated speech from the BL model is enhanced by a fixed amount modification of the spectral tilt. This is achieved by minimizing the log-likelihood between the glimpse profile of the speech signal and the model’s learned distribution at -10 dB SNR level.