-
Notifications
You must be signed in to change notification settings - Fork 3k
[magpietts] added logging of consumed samples at each training step. #13849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[magpietts] added logging of consumed samples at each training step. #13849
Conversation
…g it into wandb UI. The checkpoint filename also appended with it. Signed-off-by: Xuesong Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds logging for the number of consumed samples at each training step to facilitate fairer model comparisons and improve checkpoint traceability. Key changes include:
- New instance variables and functions added in the MagpieTTS model to track consumed samples.
- Updates to log calls during training to propagate the consumed samples count to the UI and checkpoints.
- Modifications to YAML configuration files to include consumed_samples in checkpoint filenames.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
nemo/collections/tts/models/magpietts.py | Implements consumed samples tracking and logging in training and resuming. |
examples/tts/conf/magpietts/magpietts_multilingual_v1.yaml | Updates checkpoint filename format to include consumed_samples. |
examples/tts/conf/magpietts/magpietts_lhotse_dc_en.yaml | Updates checkpoint filename format to include consumed_samples. |
examples/tts/conf/magpietts/magpietts_en.yaml | Updates checkpoint filename format to include consumed_samples. |
examples/tts/conf/magpietts/magpietts_dc_en.yaml | Updates checkpoint filename format to include consumed_samples. |
Comments suppressed due to low confidence (2)
nemo/collections/tts/models/magpietts.py:277
- Consider including the checkpoint path (ckpt_path) in the warning log message to aid in debugging when parsing fails.
init_consumed_samples = int(re.search(r"consumed_samples\=(\d+)", ckpt_path).group(1))
examples/tts/conf/magpietts/magpietts_multilingual_v1.yaml:229
- [nitpick] Ensure that the updated filename format including 'consumed_samples' is clearly documented for users, such as in the changelog or release notes.
filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.4f}-{step}-{epoch}-{consumed_samples:.0f}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to root cause checkpoint issues
Lhotse applied dynamic batching for a duration bucket across ranks so that each rank will proceed with varying batch sizes. This PR would then makes training progress hanging right after the validation is complete. Need to find alternative ways of reducing with sum batch sizes across ranks. convert to draft for further investigation. |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
The Lhotse dataloader applies dynamic batching based on duration buckets, causing the number of consumed samples at each training step to vary. Tracking the number of consumed samples enables fairer model comparisons than relying solely on training step counts. The Megatron models also tracked consumed samples.
This PR adds sample consumption logs to the wandb UI and appends this information to checkpoint filenames.
ckpt filename changes
MagpieTTS-EN-Lhotse--val_loss=11.7273-step=1350-epoch=26.ckpt -->
MagpieTTS-EN-Lhotse--val_loss=11.7273-step=1350-epoch=26-consumed_samples=96466.ckpt
wandb UI changes
