Skip to content

🐛[BUG]: Validation Loss Not Converging for Bracket Example #228

@ghasemiAb

Description

@ghasemiAb

Version

24.12

On which installation method(s) does this occur?

Docker

Describe the issue

Issue: Validation Loss Not Converging for Bracket Example
We are encountering issues with the validation loss not converging during training for the bracket example. Specifically, we tested two different modulus containers, 24.12 and 23.08, and observed different outcomes:

Container Version 24.12:
The validation loss remained unchanged even after running for 2 million iterations. This suggests that the model is not making progress toward convergence.

Image

Container Version 23.08:
Without restarting from the checkpoint, the results showed some improvement. However, the model did not fully converge, especially for the Z components. After 2 million iterations, the Z components reached a plateau at a validation loss of 0.7, while the other components reached 0.3.
With restart from the checkpoint, the loss curve started either oscillating or diverging, preventing further progress.

Image
Image

Container Version 23.05:
The results with this version were similar to those obtained with container version 23.08, showing a consistent pattern of incomplete convergence, particularly for the Z components.

ldc_2d_zeroEq with container version 24.12

validation loss are converging and it seems that convergency problem is not valid for ldc_2d_zeroEqexample

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.

No response

Metadata

Metadata

Assignees

Labels

1 - On DeckTo be worked on nextbugSomething isn't workingexternalIssues/PR filed by people outside the core team

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions