Skip to content

[magpietts] added logging of consumed samples at each training step. #13849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

XuesongYang
Copy link
Collaborator

@XuesongYang XuesongYang commented Jun 6, 2025

The Lhotse dataloader applies dynamic batching based on duration buckets, causing the number of consumed samples at each training step to vary. Tracking the number of consumed samples enables fairer model comparisons than relying solely on training step counts. The Megatron models also tracked consumed samples.

This PR adds sample consumption logs to the wandb UI and appends this information to checkpoint filenames.

ckpt filename changes
MagpieTTS-EN-Lhotse--val_loss=11.7273-step=1350-epoch=26.ckpt -->
MagpieTTS-EN-Lhotse--val_loss=11.7273-step=1350-epoch=26-consumed_samples=96466.ckpt

wandb UI changes
Screenshot 2025-06-05 at 11 52 56 PM

…g it into wandb UI. The checkpoint filename also appended with it.

Signed-off-by: Xuesong Yang <[email protected]>
@XuesongYang XuesongYang requested a review from Copilot June 6, 2025 06:53
@github-actions github-actions bot added the TTS label Jun 6, 2025
@XuesongYang XuesongYang requested a review from blisc June 6, 2025 06:53
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds logging for the number of consumed samples at each training step to facilitate fairer model comparisons and improve checkpoint traceability. Key changes include:

  • New instance variables and functions added in the MagpieTTS model to track consumed samples.
  • Updates to log calls during training to propagate the consumed samples count to the UI and checkpoints.
  • Modifications to YAML configuration files to include consumed_samples in checkpoint filenames.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
nemo/collections/tts/models/magpietts.py Implements consumed samples tracking and logging in training and resuming.
examples/tts/conf/magpietts/magpietts_multilingual_v1.yaml Updates checkpoint filename format to include consumed_samples.
examples/tts/conf/magpietts/magpietts_lhotse_dc_en.yaml Updates checkpoint filename format to include consumed_samples.
examples/tts/conf/magpietts/magpietts_en.yaml Updates checkpoint filename format to include consumed_samples.
examples/tts/conf/magpietts/magpietts_dc_en.yaml Updates checkpoint filename format to include consumed_samples.
Comments suppressed due to low confidence (2)

nemo/collections/tts/models/magpietts.py:277

  • Consider including the checkpoint path (ckpt_path) in the warning log message to aid in debugging when parsing fails.
init_consumed_samples = int(re.search(r"consumed_samples\=(\d+)", ckpt_path).group(1))

examples/tts/conf/magpietts/magpietts_multilingual_v1.yaml:229

  • [nitpick] Ensure that the updated filename format including 'consumed_samples' is clearly documented for users, such as in the changelog or release notes.
filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.4f}-{step}-{epoch}-{consumed_samples:.0f}'

Copy link
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to root cause checkpoint issues

@XuesongYang
Copy link
Collaborator Author

Lhotse applied dynamic batching for a duration bucket across ranks so that each rank will proceed with varying batch sizes. This PR would then makes training progress hanging right after the validation is complete. Need to find alternative ways of reducing with sum batch sizes across ranks.

convert to draft for further investigation.

@XuesongYang XuesongYang marked this pull request as draft June 10, 2025 17:02
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jun 28, 2025
Copy link
Contributor

github-actions bot commented Jul 6, 2025

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Jul 6, 2025
@XuesongYang XuesongYang reopened this Jul 6, 2025
@github-actions github-actions bot removed the stale label Jul 7, 2025
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jul 21, 2025
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants