Synthetic speech is occasionally enhanced by a fixed amount before being presented to the listener. However, this approach neglects the diverse types and levels of noise, potentially resulting in either unintelligible or unpleasant speech. This paper proposes a dynamic speech enhancement approach, which generates speech while accounting for the presence of noise. Our methodology extends the WaveRNN vocoder by conditioning not only the speech mel-spectrograms but also the spectrogram from the background noise. In training, the target speech is tilted according to the listeners preference and we further introduce an objective metric that aims at maximizing the spectro-temporal regions where target energy exceeds the masker energy. The generated speech was tested in speech-shaped noise at various noise levels. Our evaluation results showed that (a) in more adverse conditions, both a fixed-amount post-enhanced baseline system and the suggested dynamically-enhanced system performed equally well in terms of intelligibility and preference and (b) in the least noisy conditions, where the intelligibility scores of all models were nearly identical, listeners preferred the dynamically-enhanced speech. Our findings demonstrate and reinforce the benefits of using dynamic speech enhancement techniques in noisy environments.
Four systems were trained and then compared:
sample-id | SNR | BL | BL-post | EA λ=0 | EA λ=1 |
---|---|---|---|---|---|
LJ002-0166 | -10dB | ||||
LJ004-0215 | -10dB | ||||
LJ002-0166 | -7.5dB | ||||
LJ004-0215 | -7.5dB | ||||
LJ002-0166 | -2dB | ||||
LJ004-0215 | -2dB | ||||
LJ002-0166 | 6dB | ||||
LJ004-0215 | 6dB | ||||
Citing
O. Simantiraki, M. Markaki, and Y. Pantazis: Dynamic Speech Generation to Enhance Intelligibility in Noisy Environments. Αccepted for presentation at ICASSP 2025
Foundation for Research and Technology - Hellas
Ν. Plastira 100, Vassilika Vouton
GR-700 13, Heraklion, Crete
+30 2810 391825
pantazis@iacm.forth.gr