Disentangled speech representation allows for precise control over individual speech attributes, such as content, speaker identity, and style, enabling more flexible and natural voice synthesis engines. This study advances speech synthesis by developing innovative disentangled speech representation algorithms. Techniques grounded in Information Theory such as recently-proposed regularized variational mutual information estimators supplemented with gradient reversal layer were integrated to refine the representation of independent speech attributes. Using the Expresso dataset within the FastSpeech 2 framework, this work demonstrates significant improvements in the controllability and quality of synthetic speech. Objective metrics including cosine similarity matrices, perceptual evaluation of speech quality (PESQ), and short-term objective intelligence (STOI), complemented by subjective assessment of speech quality, were evaluated. The results show that the proposed methods outperform existing approaches, evidenced by superior A/B testing outcomes, improved inter-cluster distance metrics, and enhanced PESQ and STOI scores, highlighting the advancements of the developed systems in intelligibility, naturalness, and overall speech quality.
Use Case 1: Multispeaker Audio Synthesis Results by Model
sp0 | sp1 | sp2 | sp3 | |
---|---|---|---|---|
CC | ||||
CC&GRL | ||||
CLUB | ||||
GRL | ||||
INFO | ||||
MINE | ||||
WC | ||||
Use Cases 2-4: Multispeaker Audio Synthesis Results by Model
Use Case 2 | Use Case 3 | Use Case 4 | ||
---|---|---|---|---|
Confused | ||||
Default | ||||
Enunc | ||||
Happy | ||||
Laugh | ||||
Sad | ||||
Whisper | ||||
Citing:
T. Kassiotis and Y. Pantazis: Disentangling Speech Representations with Mutual Information Estimators for Expressive Synthesis. 33rd European Signal Processing Conference (EUSIPCO 2025), Palermo, Italy, September 8-12, 2025.
Foundation for Research and Technology - Hellas
Ν. Plastira 100, Vassilika Vouton
GR-700 13, Heraklion, Crete
+30 2810 391825
pantazis@iacm.forth.gr