TTS-with-Diffusion-model

Improving Text-to-Speech Synthesis with Diffusion Models: A D3PM Approach for Discrete Codecs

Abstract

This thesis explores advancements in Text-to-Speech (TTS) synthesis using diffusion models. By replacing traditional autoregressive models with a denoising diffusion probabilistic model (DDPM), we introduce a hybrid architecture that improves inference speed, scalability, and output quality. The proposed system leverages latent discrete representations through EnCodec and demonstrates robust zero-shot synthesis capabilities.

Key contributions include:

Faster, parallelized generation.
Enhanced speaker similarity and speech naturalness in zero-shot settings.
Improved generalization to noisy and out-of-distribution data.

Key Features

Diffusion Models in TTS: A novel approach to generating discrete codec tokens using structured noise and denoising.
Neural Codec Integration: Utilizing EnCodec for efficient latent space representation.
Non-Autoregressive Synthesis: Faster inference through parallel token generation.

Results

Evaluation: Achieved higher audio similarity and speaker consistency compared to baseline autoregressive models.
Performance: Reduced inference latency by over 40% and training time by 44%.
Generalization: Demonstrated strong performance on out-of-distribution datasets like LibriSpeech.

Model	Inference Speed (tokens/sec)	Latency (sec)	Training Time (hours)
Baseline	120.27	3.7	500
Proposed	211.90	2.1	280

Getting Started

Clone this repository:

git clone https://github.com/csulb-datascience/TTS-with-Diffusion-model.git

Training To start the training process, run the following command: Replace config/train/diffused.yml with the path to your YAML configuration file.
```
!python -m vall_e.train yaml=config/train/diffused.yml
```

Inference For Inference, run the following command:

!python -m vall_e 'The Sentence to be cloned' data/test/speakersample.wav proposed/generated_sample_location.wav

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
build		build
config		config
scripts		scripts
vall_e.egg-info		vall_e.egg-info
vall_e		vall_e
.DS_Store		.DS_Store
README.md		README.md
Test.ipynb		Test.ipynb
Untitled-1 (copy).ipynb		Untitled-1 (copy).ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TTS-with-Diffusion-model

Improving Text-to-Speech Synthesis with Diffusion Models: A D3PM Approach for Discrete Codecs

Abstract

Key Features

Results

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Languages

csulb-datascience/TTS-with-Diffusion-model

Folders and files

Latest commit

History

Repository files navigation

TTS-with-Diffusion-model

Improving Text-to-Speech Synthesis with Diffusion Models: A D3PM Approach for Discrete Codecs

Abstract

Key Features

Results

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages