Skip to content

csulb-datascience/TTS-with-Diffusion-model

Repository files navigation

TTS-with-Diffusion-model

Improving Text-to-Speech Synthesis with Diffusion Models: A D3PM Approach for Discrete Codecs

image

Abstract

This thesis explores advancements in Text-to-Speech (TTS) synthesis using diffusion models. By replacing traditional autoregressive models with a denoising diffusion probabilistic model (DDPM), we introduce a hybrid architecture that improves inference speed, scalability, and output quality. The proposed system leverages latent discrete representations through EnCodec and demonstrates robust zero-shot synthesis capabilities.

Key contributions include:

  • Faster, parallelized generation.
  • Enhanced speaker similarity and speech naturalness in zero-shot settings.
  • Improved generalization to noisy and out-of-distribution data.

Key Features

  • Diffusion Models in TTS: A novel approach to generating discrete codec tokens using structured noise and denoising.
  • Neural Codec Integration: Utilizing EnCodec for efficient latent space representation.
  • Non-Autoregressive Synthesis: Faster inference through parallel token generation.

Results

  • Evaluation: Achieved higher audio similarity and speaker consistency compared to baseline autoregressive models.
  • Performance: Reduced inference latency by over 40% and training time by 44%.
  • Generalization: Demonstrated strong performance on out-of-distribution datasets like LibriSpeech.
Model Inference Speed (tokens/sec) Latency (sec) Training Time (hours)
Baseline 120.27 3.7 500
Proposed 211.90 2.1 280
image

Getting Started

  1. Clone this repository:
    git clone https://github.com/csulb-datascience/TTS-with-Diffusion-model.git
    
  2. Training To start the training process, run the following command: Replace config/train/diffused.yml with the path to your YAML configuration file.
    !python -m vall_e.train yaml=config/train/diffused.yml
  3. Inference For Inference, run the following command:
    !python -m vall_e 'The Sentence to be cloned' data/test/speakersample.wav proposed/generated_sample_location.wav
    
    
    

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published