FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS

Overview

FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS

Official PyTorch Implementation of FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS. Codes are based on the Acknowledgements below.

Abstract: In this paper, we propose a method to flexibly control the local prosodic variation of a neural text-to-speech (TTS) model. To provide expressiveness for synthesized speech, conventional TTS models utilize utterance-wise global style embeddings that are obtained by compressing frame-level embeddings along the time axis. However, since utterance-wise global features do not contain sufficient information to represent the characteristics of word-level local features, they are not appropriate for direct use on controlling prosody at a fine scale. In multi-style TTS models, it is very important to have the capability to control local prosody because it plays a key role in finding the most appropriate text-to-speech pair among many one-to-many mapping candidates. To explicitly present local prosodic characteristics to the contextual information of the corresponding input text, we propose a module to predict the fundamental frequency ( $F0$ ) of each text by conditioning on the utterance-wise global style embedding. We also estimate multi-style embeddings using a multi-style encoder, which takes as inputs both a global utterance-wise embedding and a local $F0$ embedding. Our multi-style embedding enhances the naturalness and expressiveness of synthesized speech and is able to control prosody styles at the word-level or phoneme-level.

Visit our Demo for audio samples.

Prerequisites

  • Clone this repository
  • Install python requirements. Please refer requirements.txt
  • Like Code reference, please modify return values of torch.nn.funtional.multi_head_attention.forward() to draw attention of all head in the validation step.
    #Before
    return attn_output, attn_output_weights.sum(dim=1) / num_heads
    #After
    return attn_output, attn_output_weights
    

Preprocessing

  1. Prepare text preprocessing

    1-1. Our codes are used for internal Korean dataset. If you run the code with another languages, please modify files in text and hparams.py that are related to symbols and text preprocessing.

    1-2. Make data filelists like format of filelists/example_filelist.txt. They used for preprocessing and training.

    /your/data/path/angry_f_1234.wav|your_data_text|speaker_type
    /your/data/path/happy_m_5678.wav|your_data_text|speaker_type
    /your/data/path/sadness_f_111.wav|your_data_test|speaker_type
    ...
    

    1-3. For finding the number of speaker and emotion and defining file names to save, we used format of filelists/example_filelist.txt. Thus, please modify the data-specific part (annotated) in utils/data_utils.py, extract_emb.py, mean_i2i.py and inference.py

  2. Preprocessing

    2-1. Before run preprocess.py, modify path (data path) and file_path (filelist that you make in 1-2.) in the line 21 , 25.

    2-2. Run

    python preprocess.py
    

    2-3. Modify path of data, train and validation filelist hparams.py

Training

python train.py -o [SAVE DIRECTORY PATH] -m [BASE OR PROP] 

(Arguments)

-c: Ckpt path for loading
-o: Path for saving ckpt and log
-m: Choose baseline or proposed model

Inference

  1. Mean (i2i) style embedding extraction (optional) 0-1. Extract emotion embeddings of dataset

    python extract_emb.py -o [SAVE DIRECTORY PATH] -c [CHECKPOINT PATH] -m [BASE OR PROP]
    

    (Arguments)

    -o: Path for saving emotion embs
    -c: Ckpt path for loading
    -m: Choose baseline or proposed model
    

    0-2. Compute mean (or I2I) embs

    python mean_i2i.py -i [EXTRACED EMB PATH] -o [SAVE DIRECTORY PATH] -m [NEU OR ALL]
    

    (Arguments)

    -i: Path of saved emotion embs
    -o: Path for saving mean or i2i embs
    -m: Set the farthest emotion as only neutral or other emotions (explained in mean_i2i.py)
    
  2. Inference

    python inference.py -c [CHECKPOINT PATH] -v [VOCODER PATH] -s [MEAN EMB PATH] -o [SAVE DIRECTORY PATH] -m [BASE OR PROP]
    

    (Arguments)

    -c: Ckpt path of acoustic model
    -v: Ckpt path of vocoder (Hifi-GAN)
    -s (optional): Path of saved mean (i2i) embs
    -o: Path for saving generated wavs
    -m: Choose baseline or proposed model
    --control (optional): F0 controal at the utterance or phoneme-level
    --hz (optional): values to modify F0
    --ref_dir (optional): Path of reference wavs. Use when you do not apply mean (i2i) algs.
    --spk (optional): Use with --ref_dir
    --emo (optional): Use with --ref_dir
    --slide (optional): Use when you want to apply sliding window attn in Multispeech
    

Acknowledgements

We refered to the following codes for official version of implementation.

  1. NVIDIA/tacotron2: Link
  2. Deepest-Project/Transformer-TTS: Link
  3. NVIDIA/FastPitch: Link
  4. KevinMIN95/StyleSpeech: Link
  5. Kyubong/g2pK: Link
  6. jik876/hifi-gan: Link
  7. KinglittleQ/GST-Tacotron: Link
You might also like...

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

🐤 Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Nov 16, 2022

Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)

AuxiliaryASR This repo contains the training code for Phoneme-level ASR for Voice Conversion (VC) and TTS (Text-Mel Alignment) used in StarGANv2-VC an

Nov 18, 2022

KhanomTan TTS (ขนมตาล) is an open-source Thai text-to-speech model that supports multilingual speakers such as Thai, English, and others.

KhanomTan TTS v1.0 KhanomTan TTS (ขนมตาล) is an open-source Thai text-to-speech model that supports multilingual speakers such as Thai, English, and o

Oct 12, 2022

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: https://arxiv.org/abs/2205.00305 To be pub

Nov 12, 2022

A toolbox for manipulating data that can be expressed as a dependent and independent variable.

gigaanalysis version 0.2.2 This library provides a collection of classes and functions for analysing datasets which are of the from of one independent

Oct 23, 2022

Joint Transmit Beamforming and Phase Shifts Design with Deep Reinforcement Learning Under the Phase-Dependent Amplitude Model

Joint Transmit Beamforming and Phase Shifts Design with Deep Reinforcement Learning Under the Phase-Dependent Amplitude Model PyTorch implementation o

Nov 18, 2022

Image Style Transfer with style comes from text description(CPU friendly).

Image Style Transfer with style comes from text description(CPU friendly).

Multimodal_Style_Transfer Image Style Transfer with style comes from text description(CPU friendly). Setup You are recommended to create a virtual env

Oct 21, 2022

Collection of different source of TTS api for generating corpus

Collection of different source of TTS api for generating corpus

Mar 21, 2022

Cross-platform CLI TTS Tools for Hiroyuki's Voice

Cross-platform CLI TTS Tools for Hiroyuki's Voice

TarakoTalk おしゃべりひろゆきメーカー を使って CLI からひろゆきに適当な事を喋らせられる、非公式な CLI TTS (Text-to-Speech) ツールです。 Features 生成した音声をファイルに保存する save と、生成した音声をそのまま PC で再生する play の

Oct 27, 2022
Owner
Changhwan Kim
Changhwan Kim
FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset (CVPR2022)

FaceVerse FaceVerse: a Fine-grained and Detail-controllable 3D Face Morphable Model from a Hybrid Dataset Lizhen Wang, Zhiyuan Chen, Tao Yu, Chenguang

Lizhen Wang 205 Nov 23, 2022
This is the source code for IJCAI 2022 paper: Automatic Noisy Label Correction for Fine-Grained Entity Typing.

Automatic Noisy Label Correction for Fine-Grained Entity Typing This is the source code for IJCAI 2022 paper: Automatic Noisy Label Correction for Fin

null 6 Jul 28, 2022
FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment

FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment Created by Jinglin Xu*, Yongming Rao*, Xumin Yu, Guangyi Chen, Jie Zh

Jinglin Xu 57 Nov 16, 2022
An implementation of Spatially Weighted Pooling (SWP) in paper "Deep CNNs With Spatially Weighted Pooling for Fine-Grained Car Recognition"

Model with Spatially Weighted Pooling (SWP) An imeplementation of Model+SWP using Keras (TF2) framework. Models supported: ResNet{50, 101}, VGG16, Ale

Duong Tran Thanh 4 Oct 17, 2022
Code for 'Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information'

Dynamic MLP, which is parameterized by the learned embeddings of variable locations and dates to help fine-grained image classification.

null 62 Nov 29, 2022
A PyTorch implementation of MetaFormer: A Unified Meta Framework for Fine-Grained Recognition.

A PyTorch implementation of MetaFormer: A Unified Meta Framework for Fine-Grained Recognition. A reference PyTorch implementation of “CoAtNet: Marrying Convolution and Attention for All Data Sizes”

null 136 Nov 23, 2022
A Fine-grained House Music Dataset

HouseX HouseX is a fine-grained house music dataset, including 160 tracks, that provides 4 sub-genre labels and around 17480 converted mel-spectrogram

Xinyu Li 13 Sep 2, 2022
XCon: Learning with Experts for Fine-grained Category Discovery

XCon: Learning with Experts for Fine-grained Category Discovery This repo contains the implementation of our paper: "XCon: Learning with Experts for F

Yixin Fei 7 Sep 23, 2022
Official PaddlePaddle Implementation of Few-Shot Font Generation by Learning Fine-Grained Local Styles (FsFont)

FsFont: Few-Shot Font Generation by Learning Fine-Grained Local Styles (CVPR2022) This is the official paddlepaddle implementation for "FsFont: Few-Sh

null 23 Nov 24, 2022
An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval Introduction The implementation of paper X-CLIP: End-to-End Multi-grain

Guohai Xu 27 Nov 22, 2022