This is the official code repository for the paper "OverFlow: Putting flows on top of neural transducers for better TTS". For audio examples, visit our demo page. pre-trained model (female) and pre-trained model (male) are also available.
OverFlow is now also available in Coqui TTS! Making it easier for people to use and experiment with OverFlow please find the training recipe under
recipes/ljspeech/overflowrolling out more recipes soon!
# Install TTS
pip install tts
# Change --text to the desired text prompt
# Change --out_path to the desired output path
tts --text "Hello world!" --model_name tts_models/en/ljspeech/overflow --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path output.wavCurrent plan is to maintain both the repositories.
- Download and extract the LJ Speech dataset. Place it in the
datafolder such that the directory becomesdata/LJSpeech-1.1. Otherwise update the filelists indata/filelistsaccordingly. - Clone this repository
git clone https://github.com/shivammehta25/OverFlow.git- If using multiple GPUs change the flag in
src/hparams.gradient_checkpoint=False
- If using multiple GPUs change the flag in
- Initalise the submodules
git submodule init; git submodule update - Make sure you have docker installed and running.
- It is recommended to use Docker (it manages the CUDA runtime libraries and Python dependencies itself specified in Dockerfile)
- Alternatively, If you do not intend to use Docker, you can use pip to install the dependencies using
pip install -r requirements.txt
- Run
bash start.shand it will install all the dependencies and run the container. - Check
src/hparams.pyfor hyperparameters and set GPUs.- For multi-GPU training, set GPUs to
[0, 1 ..] - For CPU training (not recommended), set GPUs to an empty list
[] - Check the location of transcriptions
- For multi-GPU training, set GPUs to
- Once your filelists and hparams are updated run
python generate_data_properties.pyto generatedata_parameters.ptfor your dataset (the defaultdata_parameters.ptis available for LJSpeech in the repository). - Run
python train.pyto train the model.- Checkpoints will be saved in the
hparams.checkpoint_dir. - Tensorboard logs will be saved in the
hparams.tensorboard_log_dir.
- Checkpoints will be saved in the
- To resume training, run
python train.py -c <CHECKPOINT_PATH>
- Download our pre-trained LJ Speech model.
- Alternatively, you can also use a pre-trained RyanSpeech model.
- Download HiFi gan pretrained HiFiGAN model.
- We recommend using fine tuned on Tacotron2 if you cannot finetune on OverFlow.
- Run jupyter notebook and open
synthesis.ipynbor use theoverflow_speak.pyfile.
python overflow_speak.py -t "Hello world" --checkpoint_path <CHECKPOINT_PATH> --hifigan_checkpoint_path <HIFIGAN_PATH> --hifigan_config <HIFIGAN_CONFIG_PATH>python overflow_speak.py -f <FILENAME> --checkpoint_path <CHECKPOINT_PATH> --hifigan_checkpoint_path <HIFIGAN_PATH> --hifigan_config <HIFIGAN_CONFIG_PATH>- In
src.hparams.pychangehparams.precisionto16for mixed precision and32for full precision.
- Since the code uses PyTorch Lightning, providing more than one element in the list of GPUs will enable multi-GPU training. So change
hparams.gpusto[0, 1, 2]for multi-GPU training and single element[0]for single-GPU training.
- If you encoder this error message
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/data.py) - Update the requirement.txt file with these requirements:
torch==1.11.0a0+b6df043
--extra-index-url https://download.pytorch.org/whl/cu113
torchmetrics==0.6.0If you have any questions or comments, please open an issue on our GitHub repository.
If you use or build on our method or code for your research, please cite our paper:
@inproceedings{mehta2023overflow,
title={{O}ver{F}low: {P}utting flows on top of neural transducers for better {TTS}},
author={Mehta, Shivam and Kirkland, Ambika and Lameris, Harm and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
booktitle={Proc. Interspeech},
pages={4279--4283},
doi={10.21437/Interspeech.2023-1996},
year={2023}
}
The code implementation is based on Nvidia's implementation of Tacotron 2, Glow TTS and uses PyTorch Lightning for boilerplate-free code.
