STOMA

Disentangling Speech Representations with Mutual Information Estimators for Expressive Synthesis

Abstract

Disentangled speech representation allows for precise control over individual speech attributes, such as content, speaker identity, and style, enabling more flexible and natural voice synthesis engines. This study advances speech synthesis by developing innovative disentangled speech representation algorithms. Techniques grounded in Information Theory such as recently-proposed regularized variational mutual information estimators supplemented with gradient reversal layer were integrated to refine the representation of independent speech attributes. Using the Expresso dataset within the FastSpeech 2 framework, this work demonstrates significant improvements in the controllability and quality of synthetic speech. Objective metrics including cosine similarity matrices, perceptual evaluation of speech quality (PESQ), and short-term objective intelligence (STOI), complemented by subjective assessment of speech quality, were evaluated. The results show that the proposed methods outperform existing approaches, evidenced by superior A/B testing outcomes, improved inter-cluster distance metrics, and enhanced PESQ and STOI scores, highlighting the advancements of the developed systems in intelligibility, naturalness, and overall speech quality.

Use Case 1: Multispeaker Audio Synthesis Results by Model

Generated Sentence: "The fish twisted and turned on the bent hook."
Style: Default
Speakers: All
Models: All (CC & GRL, GRL, CC, WC, MINE, INFO, CLUB)

	sp0	sp1	sp2	sp3
CC
CC&GRL
CLUB
GRL
INFO
MINE
WC

Use Cases 2-4: Multispeaker Audio Synthesis Results by Model

Generated Sentence: Use Case 2: "The fish twisted and turned on the bent hook.", Use Case 3: "Amazing, we won the match.", Use Case 4: "In his absence, the council adopted the change."
Style: All Styles (Confused, Default, Enunciated, Happy, Sad, Laugh, Whisper)
Speakers: Same Speaker
Models: CC & GRL