
This thesis explores advancements in Text-to-Speech (TTS) synthesis using diffusion models. By replacing traditional autoregressive models with a denoising diffusion probabilistic model (DDPM), we introduce a hybrid architecture that improves inference speed, scalability, and output quality. The proposed system leverages latent discrete representations through EnCodec and demonstrates robust zero-shot synthesis capabilities.
Key contributions include:
- Faster, parallelized generation.
- Enhanced speaker similarity and speech naturalness in zero-shot settings.
- Improved generalization to noisy and out-of-distribution data.
- Diffusion Models in TTS: A novel approach to generating discrete codec tokens using structured noise and denoising.
- Neural Codec Integration: Utilizing EnCodec for efficient latent space representation.
- Non-Autoregressive Synthesis: Faster inference through parallel token generation.
- Evaluation: Achieved higher audio similarity and speaker consistency compared to baseline autoregressive models.
- Performance: Reduced inference latency by over 40% and training time by 44%.
- Generalization: Demonstrated strong performance on out-of-distribution datasets like LibriSpeech.
Model | Inference Speed (tokens/sec) | Latency (sec) | Training Time (hours) |
---|---|---|---|
Baseline | 120.27 | 3.7 | 500 |
Proposed | 211.90 | 2.1 | 280 |

- Clone this repository:
git clone https://github.com/csulb-datascience/TTS-with-Diffusion-model.git
- Training
To start the training process, run the following command:
Replace config/train/diffused.yml with the path to your YAML configuration file.
!python -m vall_e.train yaml=config/train/diffused.yml
- Inference
For Inference, run the following command:
!python -m vall_e 'The Sentence to be cloned' data/test/speakersample.wav proposed/generated_sample_location.wav