FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS
Official PyTorch Implementation of FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS. Codes are based on the Acknowledgements below.
Abstract: In this paper, we propose a method to flexibly control the local prosodic variation of a neural text-to-speech (TTS) model. To provide expressiveness for synthesized speech, conventional TTS models utilize utterance-wise global style embeddings that are obtained by compressing frame-level embeddings along the time axis. However, since utterance-wise global features do not contain sufficient information to represent the characteristics of word-level local features, they are not appropriate for direct use on controlling prosody at a fine scale. In multi-style TTS models, it is very important to have the capability to control local prosody because it plays a key role in finding the most appropriate text-to-speech pair among many one-to-many mapping candidates. To explicitly present local prosodic characteristics to the contextual information of the corresponding input text, we propose a module to predict the fundamental frequency (
Visit our Demo for audio samples.
- Clone this repository
- Install python requirements. Please refer requirements.txt
- Like Code reference, please modify return values of torch.nn.funtional.multi_head_attention.forward() to draw attention of all head in the validation step.
#Before return attn_output, attn_output_weights.sum(dim=1) / num_heads #After return attn_output, attn_output_weights
Prepare text preprocessing
1-2. Make data filelists like format of filelists/example_filelist.txt. They used for preprocessing and training.
/your/data/path/angry_f_1234.wav|your_data_text|speaker_type /your/data/path/happy_m_5678.wav|your_data_text|speaker_type /your/data/path/sadness_f_111.wav|your_data_test|speaker_type ...
1-3. For finding the number of speaker and emotion and defining file names to save, we used format of filelists/example_filelist.txt. Thus, please modify the data-specific part (annotated) in utils/data_utils.py, extract_emb.py, mean_i2i.py and inference.py
2-3. Modify path of data, train and validation filelist hparams.py
python train.py -o [SAVE DIRECTORY PATH] -m [BASE OR PROP]
-c: Ckpt path for loading -o: Path for saving ckpt and log -m: Choose baseline or proposed model
Mean (i2i) style embedding extraction (optional) 0-1. Extract emotion embeddings of dataset
python extract_emb.py -o [SAVE DIRECTORY PATH] -c [CHECKPOINT PATH] -m [BASE OR PROP]
-o: Path for saving emotion embs -c: Ckpt path for loading -m: Choose baseline or proposed model
0-2. Compute mean (or I2I) embs
python mean_i2i.py -i [EXTRACED EMB PATH] -o [SAVE DIRECTORY PATH] -m [NEU OR ALL]
-i: Path of saved emotion embs -o: Path for saving mean or i2i embs -m: Set the farthest emotion as only neutral or other emotions (explained in mean_i2i.py)
python inference.py -c [CHECKPOINT PATH] -v [VOCODER PATH] -s [MEAN EMB PATH] -o [SAVE DIRECTORY PATH] -m [BASE OR PROP]
-c: Ckpt path of acoustic model -v: Ckpt path of vocoder (Hifi-GAN) -s (optional): Path of saved mean (i2i) embs -o: Path for saving generated wavs -m: Choose baseline or proposed model --control (optional): F0 controal at the utterance or phoneme-level --hz (optional): values to modify F0 --ref_dir (optional): Path of reference wavs. Use when you do not apply mean (i2i) algs. --spk (optional): Use with --ref_dir --emo (optional): Use with --ref_dir --slide (optional): Use when you want to apply sliding window attn in Multispeech
We refered to the following codes for official version of implementation.